Spark Interview Questions and Answers for Experienced

    Spark Interview Questions

    What are the various functions of Spark Core?

    Answer: Various functions of spark core.

    1. It oversees essential I/O functionalities.
    2. Distributed Task Dispatching.
    3. Significant in programming and observing the role of the Spark cluster.
    4. Fault recovery.
    5. It overcomes the snag of MapReduce by using in-memory computation.
    6. Job Scheduling.

    Name some sources from where Spark streaming component can process real-time data.

    Answer: Apache Kafka, Amazon Kinesis, Apache Flume, Twitter

    Name some companies that are already using Spark Streaming.

    Answer: Uber, Pinterest, Netflix, Oracle, Hortonworks, Cisco, Verizon, Visa, Microsoft, Databricks.

    What is the bottom layer of abstraction in the Spark Streaming API?

    Answer: DStream.

    What is RDD?

    Answer:  RDD stands for Resilient Distributed Datasets (RDDs). The dataset in RDD is divided into logical partition/ separation, which are computed on different nodes of the cluster. It is used when you have a massive amount of data, which is not stored in the single system and all the data is distributed across the nodes, and each subset of data is called partition and which is later processed by the task.

    RDDs have the following properties –

    1. Immutability and partitioning
    2. Coarse-grained operations
    3. Fault Tolerance
    4. Lazy evaluations
    5. Persistence

    The data can come from various sources:

    1. Text File
    2. CSV File
    3. JSON File
    4. Database (via JBDC driver)

    How can we split single HDFS block into partitions RDD?

    Answer: data = context.textFile(“/user/interviewquestions/file-name”) by default one partition is created for one block.

    data = context.textFile(“/user/interviewquestions/file-name”, 30) It will create 30 partitions for the file. For each block there will be 2 partitions.

    What are actions and transformations?

    1. Transformations create new Resilient Distributed Dataset from the existing one, and these transformations can only execute once the action is called. These are lazy transformations.

    Transformations

    1. map(func)
    2. flatMap(func)
    3. filter(func)
    4. mapPartitions(func, preservesPartitioning=False)
    5. mapPartitionsWithIndex(func)
    6. sample (withReplacement,fraction, seed)
    7. union (a different rdd)
    8. intersection (a different rdd)
    9. distinct([numTasks])
    10. join (otherDataset, [numTasks])
    11. sortByKey(ascending=True, numPartitions=None, keyfunc=<function <adv>>)
    12. aggregateByKey(zeroValue) (seqOp, combOp, [numTasks])
    13. reduceByKey(func, [numTasks])
    14. groupByKey([numTasks])

    Actions

    1. count ()
    2. collect ()
    3. take(n)
    4. top ()
    5. countByValue()
    6. reduce ()
    7. fold ()
    8. foreach ()
    9. aggregate ()

    What is the role of cache () and persist ()?

    Spark has five types of storage level

    1. MEMORY_ONLY
    2. MEMORY_ONLY_SER
    3. MEMORY_AND_DISK
    4. MEMORY_AND_DISK_SER
    5. DISK_ONLY

    With cache (), you use only MEMORY_ONLY. With persist (), you can specify which storage level you want.

    Leave a Reply

    Your email address will not be published. Required fields are marked *