Apache Spark Cluster Internals: How spark jobs will be computed by the spark cluster


In this blog we are explain how the spark cluster compute the jobs. Spark jobs are collection of stages and stages are collection of tasks. So before the deep dive first we see the spark cluster architecture.


In the above cluster we can see the driver program it is a main program of our spark program, driver program is running on the master node of the spark cluster.

Cluster manager is the responsible for allocating resources for the given job.

And worker nodes have a executers in which the task will be running and stored the data in the cache.

This is Apache Spark basic architecture of the cluster.

Now we discuss about different RDD types created on transformations as follows:

  • HadoopRDD
  • FilteredRDD
  • ShuffleRDD

HadoopRDD: Spark make a RDD from the Hadoop InputFormat so it makes a new HadoopRDD and map the partitions with Hadoop block size by default…

View original post 325 more words


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s