Are you looking for the job with your Apache spark skills? If yes then this article is about Spark Interview questions and answers which can help in the interview and secure this job. In the IT industry Apache Spark is in huge demand and rising rapidly and companies like Amazon, Shopify hire professionals time to time. As per payscale.com average salary is $110,000 of Apache Spark professional for roles like Data Scientist, Data Engineer and Software Engineers. Spark can run on Kubernetes, Hadoop and also on cloud.
Apache Spark is an open-source data processing framework which does the analytical operation on very large data sets efficiently. Every year there are millions of people are getting connected to the internet and there is a huge need to process the data and show the user the right answer to their query. Big social media companies like Facebook and twitter uses Big data to manage and compare the data.
For more about Apache Spark
Table of Contents
Spark Interview Questions
Frequently asked basic Spark questions
What is Spark used for?
Spark is used for data processing and it supports streaming data, SQL queries and machine learning.
Is Apache Spark a language?
No Apache Spark is not a Language and Spark is written in Scala, it also consumes API’s for Java, Python, Scala.
Why is spark job slow?
Out of memory is the number one problem which causes to slow down the job and it is caused by Incorrect usage, Inefficient queries, Configuration is not correct and also depending on the usage.
How to check my spark version?
You need to run following commands to get Spark version.
- spark-submit –version
- spark-shell –version
- spark-sql –version
What is latest spark version?
The latest version of Spark is 3.1.1
How do you start a spark application?
You need to run one of the below commands
For Scala: $ SPARK_HOME/bin/spark-shell
For Python: $ SPARK_HOME/bin/pyspark
What are the various functions of Spark Core?
Answer: Various functions of spark core.
- It oversees essential I/O functionalities.
- Distributed Task Dispatching.
- Significant in programming and observing the role of the Spark cluster.
- Fault recovery.
- It overcomes the snag of MapReduce by using in-memory computation.
- Job Scheduling.
Name some sources from where Spark streaming component can process real-time data.
Answer: Apache Kafka, Amazon Kinesis, Apache Flume, Twitter
Name some companies that are already using Spark Streaming.
Answer: Uber, Pinterest, Netflix, Oracle, Hortonworks, Cisco, Verizon, Visa, Microsoft, Databricks.
What is the bottom layer of abstraction in the Spark Streaming API?
What is RDD?
Answer: RDD stands for Resilient Distributed Datasets (RDDs). The dataset in RDD is divided into logical partition/ separation, which are computed on different nodes of the cluster. It is used when you have a massive amount of data, which is not stored in the single system and all the data is distributed across the nodes, and each subset of data is called partition and which is later processed by the task.
RDDs have the following properties –
- Immutability and partitioning
- Coarse-grained operations
- Fault Tolerance
- Lazy evaluations
The data can come from various sources:
- Text File
- CSV File
- JSON File
- Database (via JBDC driver)
How can we split single HDFS block into partitions RDD?
Answer: data = context.textFile(“/user/interviewquestions/file-name”) by default one partition is created for one block.
data = context.textFile(“/user/interviewquestions/file-name”, 30) It will create 30 partitions for the file. For each block there will be 2 partitions.
What are actions and transformations?
- Transformations create new Resilient Distributed Dataset from the existing one, and these transformations can only execute once the action is called. These are lazy transformations.
- mapPartitions(func, preservesPartitioning=False)
- sample (withReplacement,fraction, seed)
- union (a different rdd)
- intersection (a different rdd)
- join (otherDataset, [numTasks])
- sortByKey(ascending=True, numPartitions=None, keyfunc=<function <adv>>)
- aggregateByKey(zeroValue) (seqOp, combOp, [numTasks])
- reduceByKey(func, [numTasks])
- count ()
- collect ()
- top ()
- reduce ()
- fold ()
- foreach ()
- aggregate ()
What is the role of cache () and persist ()?
Spark has five types of storage level
With cache (), you use only MEMORY_ONLY. With persist (), you can specify which storage level you want.
Other Resources which may help to clear the job interview