Hey you can checkout the following interview questions on Spark:
What is Spark?
Spark is scheduling, monitoring and distributing engine for big data.It is a cluster computing platform designed to be fast and general purpose.Spark extends the popular MapReduce model.One of the main features Spark offers for speed is the ability to run computations in memory, but the system is also more efficient than MapReduce for complex applications running on disk.
What is Apache Spark?
Spark is a fast, easy-to-use and flexible data processing framework. Most of the data users know only SQL and are not good at programming. Shark is a tool, developed for people who are from a database background - to access Scala MLib capabilities through Hive like SQL interface. Shark tool helps data users run Hive on Spark - offering compatibility with Hive metastore, queries and data.
Explain key features of Spark.
- Allows Integration with Hadoop and files included in HDFS.
- Spark has an interactive language shell as it has an independent Scala (the language in which Spark is written) interpreter
- Spark consists of RDD’s (Resilient Distributed Datasets), which can be cached across computing nodes in a cluster.
- Spark supports multiple analytic tools that are used for interactive query analysis , real-time analysis and graph processing.
Define RDD?
RDDs (Resilient Distributed Datasets) are basic abstraction in Apache Spark that represent the data coming into the system in object format. RDDs are used for in-memory computations on large clusters, in a fault tolerant manner. RDDs are read-only portioned, collection of records, that are –
Immutable – RDDs cannot be altered.
Resilient – If a node holding the partition fails the other node takes the data.
How to run spark in Standalone client mode?
- spark-submit \
- class org.apache.spark.examples.SparkPi \
- deploy-mode client \
- master spark//$SPARK_MASTER_IP:$SPARK_MASTER_PORT \
- $SPARK_HOME/examples/lib/spark-examples_version.jar 10
How to run spark in Standalone cluster mode?
- spark-submit \
- class org.apache.spark.examples.SparkPi \
- deploy-mode cluster \
- master spark//$SPARK_MASTER_IP:$SPARK_MASTER_PORT \
- $SPARK_HOME/examples/lib/spark-examples_version.jar 10
How to run spark in YARN client mode?
- spark-submit \
- class org.apache.spark.examples.SparkPi \
- deploy-mode client \
- master yarn \
- $SPARK_HOME/examples/lib/spark-examples_version.jar 10
How to run spark in YARN cluster mode?
- spark-submit \
- class org.apache.spark.examples.SparkPi \
- deploy-mode cluster \
- master yarn \
- $SPARK_HOME/examples/lib/spark-examples_version.jar 10
What operations RDD support?
What do you understand by Transformations in Spark?
Transformations are functions applied on RDD, resulting into another RDD. It does not execute until an action occurs. map() and filer() are examples of transformations, where the former applies the function passed to it on each element of RDD and results into another RDD. The filter() creates a new RDD by selecting elements form current RDD that pass function argument.
What is Hive on Spark?
Hive contains significant support for Apache Spark, wherein Hive execution is configured to Spark:
- hive> set spark.home=/location/to/sparkHome;
- hive> set hive.execution.engine=spark;
- Hive on Spark supports Spark on yarn mode by default.
Name commonly-used Spark Ecosystems?
- Spark SQL (Shark)- for developers
- Spark Streaming for processing live data streams
- GraphX for generating and computing graphs
- MLlib (Machine Learning Algorithms)
- SparkR to promote R Programming in Spark engine.
What are the main components of Spark?
- Spark Core: Spark Core contains the basic functionality of Spark, including components for task scheduling, memory management, fault recovery, interacting with storage systems, and more. Spark Core is also home to the API that defines RDDs,
- Spark SQL: Spark SQL is Spark’s package for working with structured data. It allows querying data via SQL as well as the HQL.
- Spark Streaming: Spark Streaming is a Spark component that enables processing of live streams of data. Examples of data streams include logfiles generated by production web servers.
- MLlib: Spark comes with a library containing common machine learning (ML) functionality, called MLlib. MLlib provides multiple types of machine learning algorithms.
- GraphX: GraphX is a library for manipulating graphs (e.g., a social network’s friend graph) and performing graph-parallel computations.
How Spark Streaming works?
Spark Streaming receives live input data streams and divides the data into batches, which are then processed by the Spark engine to generate the final stream of results in batches.Spark Streaming provides a high-level abstraction called discretized stream or DStream, which represents a continuous stream of data. DStreams can be created either from input data streams from sources such as Kafka, Flume, or by applying high-level operations on other DStreams. Internally, a DStream is represented as a sequence of RDDs.
Define Spark Streaming.Spark supports stream processing?
An extension to the Spark API , allowing stream processing of live data streams. The data from different sources like Flume, HDFS is streamed and finally processed to file systems, live dashboards and databases. It is similar tobatch processing as the input data is divided into streams like batches.
What file systems Spark support?
- Hadoop Distributed File System (HDFS)
- Local File system
- S320.What is Yarn?Similar to Hadoop, Yarn is one of the key features in Spark, providing a central and resource management platform to deliver scalable operations across the cluster . Running Spark on Yarn necessitates a binary distribution of Spar as built on Yarn support.
List the functions of Spark SQL?
Spark SQL is capable of:
- Loading data from a variety of structured sources
- Querying data using SQL statements, both inside a Spark program and from external tools that connect to Spark SQL through standard database connectors (JDBC/ODBC). For instance, using business intelligence tools like Tableau
- Providing rich integration between SQL and regular Python/Java/Scala code, including the ability to join RDDs and SQL tables, expose custom functions in SQL, and more
Write common workflow of a Spark program?
Every Spark program and shell session will work as follows:
- Create some input RDDs from external data.
- Transform them to define new RDDs using transformations like filter().
- Ask Spark to persist() any intermediate RDDs that will need to be reused.
- Launch actions such as count() and first() to kick off a parallel computation, which is then optimized and executed by Spark.
Difference between cache() and persist()?
With cache(), you use only the default storage level MEMORY_ONLY. With persist(), you can specify which storage level you want.So cache() is the same as calling persist() with the default storage level.Spark has many levels of persistence to choose from based on what our goals are.The default persist() will store the data in the JVM heap as unserialized objects. When we write data out to disk, that data is also always serialized.Different levels of persistence are MEMORY_ONLY, MEMORY_ONLY_SER, MEMORY_AND_DISK, MEMORY_AND_DISK_SER, DISK_ONLY.
What are benefits of Spark over MapReduce?
- Due to the availability of in-memory processing, Spark implements the processing around 10-100x faster than Hadoop MapReduce. MapReduce makes use of persistence storage for any of the data processing tasks.
- Unlike Hadoop, Spark provides in-built libraries to perform multiple tasks form the same core like batch processing, Steaming, Machine learning, Interactive SQL queries. However, Hadoop only supports batch processing.
- Hadoop is highly disk-dependent whereas Spark promotes caching and in-memory data storage
- Spark is capable of performing computations multiple times on the same dataset. This is called iterative computation while there is no iterative computing implemented by Hadoop.
What is Spark Executor?
When SparkContext connect to a cluster manager, it acquires an Executor on nodes in the cluster. Executors are Spark processes that run computations and store the data on the worker node. The final tasks by SparkContext are transferred to executors for their execution.
Name types of Cluster Managers in Spark?
The Spark framework supports three major types of Cluster Managers:
- Standalone: a basic manager to set up a cluster
- Apache Mesos: generalized/commonly-used cluster manager, also runs Hadoop MapReduce and other applications
- Yarn: responsible for resource management in Hadoop
What are the steps that occur when you run a Spark application on a cluster?
- The user submits an application using spark-submit.
- Spark-submit launches the driver program and invokes the main() method specified by the user.
- The driver program contacts the cluster manager to ask for resources to launch executors.
- The cluster manager launches executors on behalf of the driver program.
- The driver process runs through the user application. Based on the RDD actions and transformations in the program, the driver sends work to executors in the form of tasks.
- Tasks are run on executor processes to compute and save results.
- If the driver’s main() method exits or it calls SparkContext.stop(),it will terminate the executors and release resources from the cluster manager.
What is Spark SQL?
Spark SQL is a module in Apache Spark that integrates relational processing(e.g., declarative queries and optimized storage) with Spark’s procedural programming API. Spark SQL makes two main additions.First, it offers much tighter integration between relational and procedural processing, through a declarative DataFrame API.Second, it includes a highly extensible optimizer, Catalyst.
Big data applications require a mix of processing techniques, data sources and storage formats. The earliest systems designed for these workloads, such as MapReduce, gave users a powerful, but low-level, procedural programming interface. Programming such systems was onerous and required manual optimization by the user to achieve high performance. As a result, multiple new systems sought to provide a more productive user experience by offering relational interfaces to big data. Systems like Pig, Hive and Shark all take advantage of declarative queries to provide richer automatic optimizations.
What is a schema RDD/DataFrame?
A SchemaRDD is an RDD composed of Row objects with additional schema information of the types in each column. Row objects are just wrappers around arrays of basic types (e.g., integers and strings).
What are Spark’s main features?
- Speed : Spark enables applications in Hadoop clusters to run up to 100x faster in memory, and 10x faster even when running on disk. Spark makes it possible by reducing number of read/write to disc. It stores this intermediate processing data in-memory. It uses the concept of an Resilient Distributed Dataset (RDD), which allows it to transparently store data on memory and persist it to disc only it’s needed. This helps to reduce most of the disc read and write – the main time consuming factors – of data processing.
- Combines SQL, streaming, and complex analytics: In addition to simple “map” and “reduce” operations, Spark supports SQL queries, streaming data, and complex analytics such as machine learning and graph algorithms out-of-the-box. Not only that, users can combine all these capabilities seamlessly in a single workflow.
- Ease of Use:Spark lets you quickly write applications in Java, Scala, or Python. This helps developers to create and run their applications on their familiar programming languages and easy to build parallel apps.
- Runs Everywhere: Spark runs on Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources including HDFS, Cassandra, HBase, S3.
Explain about the popular use cases of Apache Spark?
Apache Spark is mainly used for
- Iterative machine learning.
- Interactive data analytics and processing.
- Stream processing
- Sensor data processing
How can you remove the elements with a key present in any other RDD?
Use the subtractByKey () function
What is the difference between persist() and cache()
persist () allows the user to specify the storage level whereas cache () uses the default storage level.
What are the various levels of persistence in Apache Spark?
Apache Spark automatically persists the intermediary data from various shuffle operations, however it is often suggested that users call persist () method on the RDD in case they plan to reuse it. Spark has various persistence levels to store the RDDs on disk or in memory or as a combination of both with different replication levels. The various storage/persistence levels in Spark are -
- MEMORY_ONLY
- MEMORY_ONLY_SER
- MEMORY_AND_DISK
- MEMORY_AND_DISK_SER, DISK_ONLY
- OFF_HEAP
Explain about the core components of a distributed Spark application.