Email spam detection using apache spark mllib

In this blog we will see the real use case of spark mllib that is email spam detection. With the help of using the apache spark mllib component we will detect that email will goes in spam folder or primary folder.

So now jump into the programming and see how it will implement. So first we will load the data from training from spam dataset and primary dataset as follow

val spam = sc.textFile("/home/sandy/Spark/enron1/spam/0052.2003-12-20.GP.spam.txt", 4)
val normal = sc.textFile("/home/sandy/Spark/enron1/ham/0022.1999-12-16.farmer.ham.txt", 4)

Next we need to use HashinTF or IDF to find the frequency of word in the mail and create a Vector which is helpful in creating the LabelPoints for the training

val spamFeatures = => tf.transform(email.split(" ")))
val normalFeatures = => tf.transform(email.split(" ")))

With the help of vectors we will create the LabelPoints , LabelPoints are the input for our model we will create label points as follows

