In this blog we are explain how the spark cluster compute the jobs. Spark jobs are collection of stages and stages are collection of tasks. So before the deep dive first we see the spark cluster architecture.
In the above cluster we can see the driver program it is a main program of our spark program, driver program is running on the master node of the spark cluster.
Cluster manager is the responsible for allocating resources for the given job.
And worker nodes have a executers in which the task will be running and stored the data in the cache.
This is Apache Spark basic architecture of the cluster.
Now we discuss about different RDD types created on transformations as follows:
HadoopRDD: Spark make a RDD from the Hadoop InputFormat so it makes a new HadoopRDD and map the partitions with Hadoop block size by default…
View original post 325 more words