Spark Interview Questions
What are the various functions of Spark Core?
Answer: Various functions of spark core.
- It oversees essential I/O functionalities.
- Distributed Task Dispatching.
- Significant in programming and observing the role of the Spark cluster.
- Fault recovery.
- It overcomes the snag of MapReduce by using in-memory computation.
- Job Scheduling.
Name some sources from where Spark streaming component can process real-time data.
Answer: Apache Kafka, Amazon Kinesis, Apache Flume, Twitter
Name some companies that are already using Spark Streaming.
Answer: Uber, Pinterest, Netflix, Oracle, Hortonworks, Cisco, Verizon, Visa, Microsoft, Databricks.
What is the bottom layer of abstraction in the Spark Streaming API?
What is RDD?
Answer: RDD stands for Resilient Distributed Datasets (RDDs). The dataset in RDD is divided into logical partition/ separation, which are computed on different nodes of the cluster. It is used when you have a massive amount of data, which is not stored in the single system and all the data is distributed across the nodes, and each subset of data is called partition and which is later processed by the task.
RDDs have the following properties –
- Immutability and partitioning
- Coarse-grained operations
- Fault Tolerance
- Lazy evaluations
The data can come from various sources:
- Text File
- CSV File
- JSON File
- Database (via JBDC driver)
How can we split single HDFS block into partitions RDD?
Answer: data = context.textFile(“/user/interviewquestions/file-name”) by default one partition is created for one block.
data = context.textFile(“/user/interviewquestions/file-name”, 30) It will create 30 partitions for the file. For each block there will be 2 partitions.
What are actions and transformations?
- Transformations create new Resilient Distributed Dataset from the existing one, and these transformations can only execute once the action is called. These are lazy transformations.
- mapPartitions(func, preservesPartitioning=False)
- sample (withReplacement,fraction, seed)
- union (a different rdd)
- intersection (a different rdd)
- join (otherDataset, [numTasks])
- sortByKey(ascending=True, numPartitions=None, keyfunc=<function <adv>>)
- aggregateByKey(zeroValue) (seqOp, combOp, [numTasks])
- reduceByKey(func, [numTasks])
- count ()
- collect ()
- top ()
- reduce ()
- fold ()
- foreach ()
- aggregate ()
What is the role of cache () and persist ()?
Spark has five types of storage level
With cache (), you use only MEMORY_ONLY. With persist (), you can specify which storage level you want.