Email spam detection using apache spark mllib

Knoldus

In this blog we will see the real use case of spark mllib that is email spam detection. With the help of using the apache spark mllib component we will detect that email will goes in spam folder or primary folder.

So now jump into the programming and see how it will implement. So first we will load the data from training from spam dataset and primary dataset as follow

val spam = sc.textFile("/home/sandy/Spark/enron1/spam/0052.2003-12-20.GP.spam.txt", 4)
val normal = sc.textFile("/home/sandy/Spark/enron1/ham/0022.1999-12-16.farmer.ham.txt", 4)

Next we need to use HashinTF or IDF to find the frequency of word in the mail and create a Vector which is helpful in creating the LabelPoints for the training

val spamFeatures = spam.map(email => tf.transform(email.split(" ")))
val normalFeatures = normal.map(email => tf.transform(email.split(" ")))

With the help of vectors we will create the LabelPoints , LabelPoints are the input for our model we will create label points as follows

View original post 204 more words

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s