Edureka has one of the most detailed and comprehensive course on Apache Spark and Hadoop online. But before going for any online training just go through this to have a basic grasp of the technology and the fundamentals
To learn Spark and Hadoop, you need to start with the basics, i.e Big Data and emergence of Hadoop.
Moving forward you need to focus on the main reason Hadoop became popular. It was because of HDFS (Hadoop Distributed File System).
Further moving on take a deep dive into Hadoop Ecosystem and learn various tools inside Hadoop Ecosystem with their functionalities. So, that you will learn how to create a tailored solution according to your requirements
The main components of HDFS are NameNode and DataNode.
NameNode
It is the master daemon that maintains and manages the DataNodes (slave nodes). It records the metadata of all the files stored in the cluster, e.g. location of blocks stored, the size of the files, permissions, hierarchy, etc. It records each and every change that takes place to the file system metadata.
For example, if a file is deleted in HDFS, the NameNode will immediately record this in the EditLog. It regularly receives a Heartbeat and a block report from all the DataNodes in the cluster to ensure that the DataNodes are live. It keeps a record of all the blocks in HDFS and in which nodes these blocks are stored.
DataNode
These are slave daemons which runs on each slave machine. The actual data is stored on DataNodes. They are responsible for serving read and write requests from the clients. They are also responsible for creating blocks, deleting blocks and replicating the same based on the decisions taken by the NameNode.
For processing, we use YARN(Yet Another Resource Negotiator). The components of YARN are ResourceManager and NodeManager.
ResourceManager
It is a cluster level (one for each cluster) component and runs on the master machine. It manages resources and schedule applications running on top of YARN.
NodeManager
It is a node level component (one on each node) and runs on each slave machine. It is responsible for managing containers and monitoring resource utilization in each container. It also keeps track of node health and log management. It continuously communicates with ResourceManager to remain up-to-date.
So, you can perform parallel processing on HDFS using MapReduce.
Next comes the concepts of Pig, Hive and Hbase.
Moving on to Spark you need to learn about Scala, as Spark-shell by default runs on Scala.
- Scala is a general-purpose programming language, which is aimed to implement common programming patterns in a concise, elegant, and type-safe way
- It supports both object-oriented and functional programming styles,thus helping programmers to be more productive.
Further moving forward, you need to learn about RDDs , which are the basic building blocks for any spark code.
- RDD(Resilient Distributed Dataset) is a distributed memory abstraction which lets programmers perform in-memory computations on large clusters in a fault-tolerant manner.
- They are read-only collection of objects partitioned across a set of machines that can be rebuilt if a partition is lost.
- RDDs can be created from multiple data sources e.g. Scala collection, local file system, Hadoop, Amazon S3, HBase table etc.
SparkSQL is another main component of Spark which is very important to process structured data in an sql style format.
Next comes the Machine Learning library of Spark, ie. MLlib. How it is used to perform various ML algorithms through Spark. (Regressions and K-means Clustering)
Flume also plays an important role in the process of Streaming data and so does Kafka.
Spark itself has the ability to process and Stream data, which is done through Spark Streaming using DStreams.
Edureka’s Apache Spark and Scala Certification training offers a detailed course specifically designed for the CCA175 exam, covering all the above mentioned topics.
Edureka provides a good list of Spark Videos. I would recommend you go through this Edureka Spark Playlist as well as the Spark Tutorial
There are a lot of Hadoop Videos too.
Hope this helps.