Apache Spark is a widely used, general-purpose, distributed cluster computing framework. With this open-source tool, you can program an entire computer cluster with implicit data parallelism and fault tolerance. We have compiled a list of the most common Apache Spark interview questions. These questions will help you formulate a better strategy for Apache Spark's upcoming interview.
Apache Spark has rapidly gained popularity as a unifying analytics engine for big data and machine learning. Therefore, it is advisable to learn Spark with the best Apache Spark tutorials for beginners in 2021.
Best Apache Spark Interview Questions and Answers
When preparing to answer Spark interview questions, it is essential to know the right buzzwords and learn the right technologies. This article features questions and answers about Spark Core, Spark Streaming, Spark SQL, GraphX, and MLlib, among others.
1. What is Apache Spark?
Answer: Known as a cluster computing platform, Apache Spark is an open-source framework for real-time processing. Currently, this is the most active Apache project at the moment and has a vibrant open-source community. Furthermore, Spark provides an interface for entire programming clusters with implicit data parallelism and fault tolerance.
Spark is one of the best-known projects in the Apache Software Foundation. With its market leadership in Big Data, Spark is emerging as a market leader. There are thousands of nodes on clusters where Spark is typically run. Many companies use Spark, including Amazon, eBay, and Yahoo!
2. What are the key features of Apache Spark?
Answer: Apache Spark has the following key features:
- Hadoop Integration
- Lazy Evaluation
- Machine Learning
- Multiple Format Support
- Real-Time Computation
3. Which language does Apache Spark support, and which is the most popular?
Answer: Apache Spark supports Scala, Java, Python, and R. Among them, Scala and Python have interactive shells for Spark. You can access the Scala shell through ./bin/spark-shell and the Python shell via ./bin/pyspark. Scala is the most widely used language for Spark because Spark is written in Scala.
4. What is the difference between Spark and Hadoop?
|Real-time and batch processing||Only Batch processing|
|100x faster than Hadoop||Decent speed|
|Offers interactive modes||Doesn't have any interactive modes except Pig and Hive|
|Easy to learn due to high-level modules||Difficult to learn|
|Supports partition recovery||Fault-tolerant|
5. What advantages does Spark offer over Hadoop MapReduce?
Answer: Spark offers certain advantages over Hadoop MapReduce:
- No Disk-Dependency: Hadoop MapReduce is heavily disk-dependent, whereas Spark relies primarily on caching and in-memory data storage.
- Enhanced Speed: MapReduce uses persistent storage for all data processing tasks. On the other hand, Spark uses in-memory processing, which is about 10 to 100 times faster than Hadoop MapReduce.
- Iterative Computation: When a computation is repeated several times on the same dataset, it is called an iterative computation. Spark is capable of iterative computation, whereas Hadoop MapReduce is not.
- Multitasking: Hadoop supports batch processing only via inbuilt libraries. On the other hand, Apache Spark offers built-in libraries for performing multiple tasks from the same core, such as batch processing, interactive SQL queries, machine learning, and streaming.
6. What are the various functions of Spark Core?
Answer: Spark Core is a large-scale parallel and distributed data processing engine. It is a distributed execution engine that works seamlessly with Java, Python, and Scala APIs to provide a platform for ETL (Extract, Transform, Load) application development. Spark Core has the following functions:
- Monitoring, scheduling, and distributing jobs on a cluster.
- Working with storage systems.
- Fault recovery and memory management.
Additionally, additional libraries built on top of Spark Core enable it to process machine learning, streaming, and SQL queries.
7. Tell us how will you implement SQL in Spark?
Answer: Spark SQL modules allow you to integrate relational processing with Spark's functional programming API. You can query the data using SQL or HiveQL (Hive Query Language).
Furthermore, Spark SQL supports various data sources and allows you to weave SQL queries with code transformations. Spark SQL consists of four libraries: DataFrame API, Data Source API, Interpreter & Optimizer, and SQL Service.
8. List the various components of the Spark Ecosystem.
Answer: These are the five types of components in the Spark Ecosystem:
- GraphX: Enables graphs and graph-parallel computation.
- MLib: It is used for machine learning.
- Spark Core: A powerful parallel and distributed processing platform.
- Spark Streaming: Handles real-time streaming data.
- Spark SQL: Combines Spark's functional programming API with relational processing.
9. Is there any API available for implementing graphs in Spark?
Answer: Apache Spark implements graphs and graph-parallel computing using GraphX. It increases the Spark RDD with a Resilient Distributed Property Graph. The graph can have several edges in parallel and is a directed multi-graph.
A user-defined property is associated with each edge and vertex of the Resilient Distributed Property Graph. Multiple relationships are possible between the same vertices due to the parallel edges.
With GraphX, you can perform graph computations with a set of fundamental operators, such as joinVertices, mapReduceTriplets, and subgraph, along with an optimized version of Pregel.
In addition to graph algorithms, GraphX also offers graph builders to simplify graph analytics tasks.
10. Could you please explain the concept of RDD? Also, explain how you can create RDDs in Apache Spark.
Answer: RDD or Resilient Distribution Dataset comprises fault-tolerant operational elements that can run in parallel. All partitioned data in an RDD is immutable and distributed. Essentially, RDDs consist of portions of data that are stored in memory and shared among many nodes. Spark lazily evaluates these RDDs, which contributes to Apache Spark's faster speed. There are two types of RDDs:
- Hadoop Datasets: Apply functions to each record in HDFS (Hadoop Distributed File System) or other types of storage systems
- Parallelized Collections: Run existing RDDs in parallel
Apache Spark provides two methods for creating RDDs:
- Using the Driver program to parallelize a collection. It uses the parallelize() method of SparkContext.
- Through the loading of an external dataset from external storage, such as HDFS, HBase, and shared filesystems.
11. Name various types of Cluster Managers in Spark.
Answer: There are three types of Cluster Managers in Spark:
- Apache Mesos: Most widely used cluster manager.
- Standalone: A simple cluster manager to set up a cluster.
- YARN: It is used for resource management.
12. What do you understand by the Parquet file?
Answer: Parquet is a columnar format supported by several data processing systems. Spark SQL supports both read and write operations with Parquet. Columnar storage has the following advantages:
- Offers better-summarized data
- Limited I/O operations
- Consumes less space
- Follows type-specific encoding
- Able to fetch specific columns for access
13. Is it possible to use Apache Spark for accessing and analyzing data stored in Cassandra databases?
Answer: Yes, it is possible to use Apache Spark for accessing and analyzing data stored in Cassandra databases. Using the Spark Cassandra Connector, Apache Spark can be used to access and analyze data stored in Cassandra databases. Spark needs to include a feature during which Spark executors talk to local Cassandra nodes and ask only for local data.
You can connect Cassandra with Apache Spark to speed up queries by reducing the network traffic for sending data between Spark executors and Cassandra nodes.
14. Can you explain how you can use Apache Spark along with Hadoop?
Answer: Apache Spark has the advantage of being compatible with Hadoop. Together, they make a formidable tech duo. Using Apache Spark and Hadoop combines Spark's unparalleled processing power with HDFS and YARN's best features. Below are the ways to use Hadoop Components with Apache Spark:
- Batch & Real-Time Processing – MapReduce and Spark can work together, where the former handles batch processing, and the latter handles real-time processing.
- HDFS – Spark can make use of the HDFS to leverage the distributed replicated storage.
- MapReduce – Apache Spark can be used in conjunction with MapReduce in a similar Hadoop cluster or independently as a processing framework.
- YARN – You can run Spark applications on YARN.
15. What do you mean by the worker node?
Answer: A worker node is any node capable of running code in a cluster. The driver program must listen for incoming connections before accepting those from its executors. Furthermore, the driver program must be network addressable by the worker nodes.
A worker node is a slave node. Work is assigned to worker nodes by the master node. Working nodes process data on the node and report resources to their master nodes. It schedules tasks according to resource availability.
16. How will you connect Apache Spark with Apache Mesos?
Answer: The steps to connect Apache Spark and Apache Mesos is as follows:
- Set up the Spark driver program to connect with Apache Mesos.
- Place the Spark binary package in a location accessible by Mesos.
- In the same directory as Apache Mesos, install Apache Spark.
- Set the spark.mesos.executor.home property to the location where Apache Spark is installed.
17. Please explain the sparse vector in Spark.
Answer: Sparse vectors allow you to store non-zero entries to save space. There are two parallel arrays:
- One for indices
- The other for values
18. Can you explain how to minimize data transfers while working with Spark?
Answer: To write Spark programs that run reliably and quickly, data transfers and shuffling need to be minimized. Apache Spark can minimize data transfers in several ways:
- Avoiding – ByKey operations, repartition, and other operations responsible for initiating shuffles.
- Using Accumulators – Accumulators help in updating the values of variables while executing them in parallel
- Using Broadcast Variables – Broadcast variables improve the efficiency of joining small and large RDDs
19. Does Apache Spark provide checkpoints?
Answer: Apache Spark does provide checkpoints. In addition to allowing the program to run around the clock, they make it resilient against failures unrelated to the application logic. A lineage graph is used to recover RDDs after a failure.
Apache Spark provides an API for adding and managing checkpoints. After that, the user decides which data to the checkpoint. If the lineage graphs are lengthy and have many dependencies, checkpoints are preferred.
20. What are broadcast variables in Apache Spark? Why do we need them?
Answer: Instead of shipping a copy of a variable with tasks, a broadcast variable keeps a read-only cached version of the variable on each machine.
Additionally, broadcast variables are used to provide each node with a copy of a large input dataset. Apache Spark distributes broadcast variables using efficient broadcast algorithms to reduce communication costs.
With broadcast variables, there is no need to duplicate variables for each task. As a result, data can be processed quickly. In contrast to RDD lookup(), broadcast variables assist in storing a lookup table inside the memory, enhancing retrieval efficiency.
21. What are the limitations of using Apache Spark?
Answer: Here are the few limitations of using Apache Spark:
- There is no built-in file management system. Hence, integration with other platforms like Hadoop is required to take advantage of a file management system.
- Higher latency but consequently, lower throughput
- It doesn't support real-time data stream processing. Streams of live data are partitioned into batches in Apache Spark and, after processing, are again converted into batches. In other words, Spark Streaming is micro-batch processing rather than truly real-time data processing.
- Fewer algorithms are available.
- Spark streaming doesn't support record-based window criteria.
- It is necessary to distribute work across multiple clusters instead of running everything on a single node.
- Apache Spark's in-memory ability becomes a bottleneck when used for the cost-efficient processing of big data.
22. What is YARN?
Answer: YARN is one of the key features in Spark, providing a central management platform for delivering scalable operations across the cluster. For example, YARN is a container manager, similar to Mesos, whereas Spark is a data processing tool. As with Hadoop Map Reduce, Spark can run on YARN. Spark needs a YARN-supported binary distribution to run on YARN.
23. What is Spark Driver?
Answer: Spark Driver is a program that runs on the master node and declares transformations and actions on the data RDDs. Simply put, Spark drivers create a SparkContext, which is associated with a Spark Master. Additionally, the driver delivers RDD graphs to Master, where the standalone cluster manager runs.
24. What is Spark Executor?
Answer: SparkContext acquires an Executor on nodes in the cluster when it connects to the cluster manager. Executors are Spark processes that execute computations and store the data on worker nodes. It transfers final tasks to executors for execution.
25. What are the main operations of RDD?
Answer: RDD includes two main operations:
26. What is the function of the Map ()?
Answer: Map () repeats over every line of the RDD and, afterward, divides it into new RDDs.
27. What is the function of filer()?
Answer: Filer() selects various elements from the existing RDD, which passes the function argument, to create a new RDD.
28. Define Transformations in Spark?
Answer: Transformations are the functions performed on an RDD to create another RDD. Transformation does not occur until action is taken. Map () and filer() are examples of transformations.
29. What are the Actions in Spark?
Answer: Actions in Spark can bring data from an RDD back to the local machine. It contains several RDD operations with non-RDD values. Action in Sparks implements functions such as reduce() and take().
30. What file systems does Spark support?
Answer: Spark supports the following three file systems:
- Hadoop Distributed File System (HDFS)
- Amazon S3
- Local File system
31. Is there a module to implement SQL in Spark? How does it work?
Answer: Spark SQL integrates relational processing with Spark's functional programming API. You can query data using either SQL or Hive Query Language. If you are familiar with RDBMS, Spark SQL will be an easy transition from your earlier tools, where you could extend the possibilities of traditional relational data processing.
Spark SQL combines relational processing with Spark's functional programming. It also supports multiple data sources and makes it possible to combine SQL queries with code transformations, resulting in a robust tool.
Spark SQL consists of four libraries:
- Data Source API
- DataFrame API
- Interpreter & Optimizer
- SQL Service
32. What is the difference between reducing () and take() function?
Answer: Reduce() is an action that is applied repeatedly until one value is left, and take() is an action that considers all values from an RDD to the local node.
33. What are the similarities and differences between coalesce () and repartition () in Map Reduce?
Answer: Both coalesce () and repartition () are used in Map Reduce to modify the number of partitions in an RDD. The difference between both is that, Coalesce () is a part of repartition(), which shuffles using coalesce (). Using various kinds of hash algorithms, repartition() gives results in a specific number of partitions with the whole data getting distributed.
34. Define PageRank in Spark? Give an example?
Answer: PageRank is an algorithm in Graphix in Spark that measures each vertex in the graph. For example, in social media, if a person has a large following on Facebook, Instagram, or any other platform, their page will rank higher.
35. Define RDD Lineage?
Answer: As Apache Spark does not support data replication in its memory, RDD Lineage reconstructs the lost data partitions. It helps in remembering the method used for building other datasets.
36. What is Sliding Window in Spark? Give an example?
Answer: To specify each batch of Spark streaming that needs to be processed, Spark uses a Sliding Window. For example, you can set the batch intervals and the number of batches that you want to process via Spark streaming.
37. What are the benefits of Sliding Window operations?
Answer: Benefits of sliding window operations include:
- It assists in the transfer of data packets between different computer networks.
- It merges the RDDs that fall within the specified window and performs operations to create new RDDs of the windowed DStream.
- By using the Spark Streaming Library, it supports windowed computations for the transformation of RDDs.
38. Can we trigger automated clean-ups in Spark?
Answer: Using Spark, we can trigger automated cleanups to handle the accumulated metadata. To accomplish this, you must set the parameter "spark.cleaner.ttl."
39. Why is SchemaRDD designed?
Answer: SchemaRDD aims to simplify code debugging and unit testing for SparkSQL developers.
40. Define SchemaRDD in Apache Spark RDD?
Answer: SchemmaRDD is an RDD that contains various row objects, such as wrappers around strings and integer arrays, as well as schema information about the type of data in each column. As of now, it is called DataFrame API.
41. What is another method than “Spark.cleaner.ttl” to trigger automated clean-ups in Spark?
Answer: Apart from using "Spark.clener.ttl" to trigger automated clean-ups in Spark, you can divide long-running jobs into batches and write the intermediate results on the disk.
42. What is the basic difference between Spark SQL, HQL, and SQL?
Answer: Spark SQL supports SQL and Hiver Query without modifying any syntax. Spark SQL allows us to join SQL and HQL tables.
43. What is the role of Akka in Spark?
Answer: Spark uses Akka for scheduling. Both workers and masters can communicate via the tool in regards to their tasks and requests to register.
44. When running Spark applications, is it necessary to install Spark on all the nodes of the YARN cluster?
Answer: Spark does not need to be installed when running a job under YARN or Mesos because Spark can execute on top of YARN or Mesos clusters without affecting them.
45. Explain accumulators in Apache Spark.
Answer: Accumulators are variables that can only be added through commutative and associative operations. You can use them to implement counters or sums. The UI can track accumulators to help understand the progress of running stages. Spark has native support for numeric accumulators. We can create named and unnamed accumulators.
46. Explain Caching in Spark Streaming.
Answer: Developers can cache/permanently store the stream's data in memory with DStreams. It is useful when processing multiple times the same data in a DStream. Use the persist() method on a DStream to accomplish this. The default persistence level is set to replicate data to two nodes for fault tolerance for input streams that receive data over the network (such as Kafka, Flume, sockets, etc.).
47. How can Spark be connected to Apache Mesos?
Answer: Spark and Mesos can be connected as follows:
- Configure the spark driver program to connect to Mesos.
- Spark binary package should be in a location accessible by Mesos.
- Install Apache Spark in the same location as Apache Mesos and configure the property spark.mesos.executor.home to point to the location where it is installed.
48. What do you understand by Lazy Evaluation?
Answer: Spark is intellectual in the way it operates on data. If you ask Spark to operate on a dataset, it heeds your instructions and makes a note of them, so it will not forget - but it doesn't do anything until you ask it to. If map() is called on an RDD, the operation is not immediately performed. Spark doesn't evaluate transformations until you act upon them. This helps optimize the overall data processing workflow.
49. List some use cases where Spark outperforms Hadoop in processing.
Answer: Here are some use cases where Spark outperforms Hadoop in processing:
- Sensor Data Processing: Apache Spark's in-memory computing works well here, as data is retrieved and combined from various sources.
- Real-Time Processing: Spark is preferred over Hadoop for real-time data querying. Example: Stock Market, Telecommunications, Healthcare, Analysis, Banking, etc.
- Stream Processing: Apache Spark is the best solution for processing logs and detecting frauds in live streams for alerts.
- Big Data Processing: Spark processes medium and large datasets up to 100 times faster than Hadoop.
50. What do you mean by in-memory processing?
Answer: In-memory processing refers to the instant access of data from physical memory whenever the operation is required. By using this methodology, data transfer delays are significantly reduced. Spark uses this method to query or process large chunks of data.
If you have made it this far, then certainly you are willing to learn more about Apache Spark. Here are some more resources related to Spark that we think will be useful to you.