Email spam detection using apache spark mllib


In this blog we will see the real use case of spark mllib that is email spam detection. With the help of using the apache spark mllib component we will detect that email will goes in spam folder or primary folder.

So now jump into the programming and see how it will implement. So first we will load the data from training from spam dataset and primary dataset as follow

val spam = sc.textFile("/home/sandy/Spark/enron1/spam/0052.2003-12-20.GP.spam.txt", 4)
val normal = sc.textFile("/home/sandy/Spark/enron1/ham/0022.1999-12-16.farmer.ham.txt", 4)

Next we need to use HashinTF or IDF to find the frequency of word in the mail and create a Vector which is helpful in creating the LabelPoints for the training

val spamFeatures = => tf.transform(email.split(" ")))
val normalFeatures = => tf.transform(email.split(" ")))

With the help of vectors we will create the LabelPoints , LabelPoints are the input for our model we will create label points as follows

View original post 204 more words


Convolutional Neural Networks backpropagation: from intuition to derivation

Grzegorz Gwardys

Disclaimer: It is assumed that the reader is familiar with terms such as Multilayer Perceptron, delta errors or backpropagation. If not,  it is recommended to read for example a chapter 2 of free online book ‘Neural Networks and Deep Learning’ by Michael Nielsen.   

Convolutional Neural Networks (CNN) are now a standard way of image classification – there are publicly accessible deep learning frameworks, trained models and services. It’s more time consuming to install stuff like caffe than to perform state-of-the-art object classification or detection. We also have many methods of getting knowledge -there is a large number of deep learning courses/MOOCs, free e-books or even direct ways of accessing to the strongest Deep/Machine Learning minds such as Yoshua Bengio, Andrew NG or Yann Lecun by Quora, Facebook or G+.

Nevertheless, when I wanted to get deeper insight in CNN, I could not find a “CNN backpropagation for dummies”. Notoriously…

View original post 785 more words

Give Mesos and External Volumes a spin with playa-mesos

OLD - {code} by Dell EMC

Mesos is an important platform to consider if you’re interested in running containers in a highly available manner, operating an Enterprise-friendly container platform, or building application platforms to operate complex distributed applications. It should be thought of in a collaborative and complimentary way to the container eco-system. For some, it will sit at the scheduling layer only, and for others it will span across scheduling to the container runtime. Mesos represents a new way of thinking when it comes to how we operate and consume data center resources.

The Mesos platform is often adopted when a data center is moving towards the following key points:

  • A homogenous operating environment where all compute resources can run all workloads
    • Data center silos for workloads that aren’t virtualization friendly can now be scheduled alongside other workloads (ie. Hadoop and Cassandra)
    • IaaS and virtualization are no longer needed to pool resources
  • Providing simple but highly available applications
    • Basic capabilities include…

View original post 753 more words

Is Neural Network Better Off with Big Data


How does neural network or for that matter any machine learning model relates to Big Data. Do we get a better quality learning model with bigger data. That’s what we will explore in this post. We will explore sample complexity i.e. the way model performance varies with training sample size. This will be particularly interesting from a Big Data point of view.  We will also look at model complexity which tells us how model performance varies with model complexity.

Although I have used a multi layer neural network for my experiments, the findings should

View original post 1,696 more words

Raspberry Pi Lights: how to sync Christmas lights to midi audio

The Raspberry Pi and I

** UPDATE 9/21/2014 **

I updated the source code today. Now lightorgan supports more than 7 output channels.  It chooses the pin to light up based on both the pitch and the octave of every note. Now the number of supported output channels is limited only by how dynamic the range of the midi file is. I observed that this worked practically for at least 24 channels on several Christmas songs. This is cool because the new Rasperry Pi Model B now supports up to 28 pins!  See Gordon’s page at

Also, the WiringPi pins that lightorgan uses are now configurable. Just modify the array called pinMapping[] that’s near the top of the lightorgan.c file to add, remove, or remap a lightorgan channel to a corresponding WiringPi  pin. Recompile with your changes and then you should be good to go.

Check out the new source code from the google…

View original post 1,287 more words

Backup & Restore Your Logstash/Grafana Dashboards

Web Development Insights

I created a Chef cookbook with which you can backup and restore your Logstash/Grafana dashboards. You can find it here.

It wouldn’t be an exaggeration to say that Logstash and Grafana have changed my life. I can’t even remember how I was monitoring or investigating performance issues before having them. When I first installed those tools and started to feed them with data I was really excited by the possibilities they offered me. Building dashboards was so easy and fluid. Create a widget, select the data to display and viola – you have a neat looking graph! With time I added more and more dashboards both to Logstash and Grafana. I now have dozens in each. They show me everything I need to know and when I find they have a missing piece – I add it right away.

A sample Grafana dashboard A sample Grafana dashboard

Everything went well, life…

View original post 878 more words

Why We Chose Kubernetes Over ECS

Web Development Insights

On our last post, we saw how Docker changed the way we treat our infrastructure and what changes it brought to the domain of service orchestration.
In the following post, we’re going to take a tour of two of the leading Docker orchestration frameworks out there: ECS (Elastic Container Service) by AWS, and Kubernetes, an orchestration framework which began at Google and became open source later.

3 months ago when we, at, came to evaluate which Docker orchestration framework to use, we gave ECS the first priority. We were already familiar with AWS services, and since we already had our whole infrastructure there, it was the default choice. After testing the service for a while we had the feeling it was not mature enough and missing some key features we needed (more on that later), so we went to test another orchestration framework: Kubernetes. We were glad to discover that…

View original post 2,447 more words