Data Engineer Interview Questions and Answers (2026) – SQL, Python, Spark + System Design

You don’t pass a data engineer interview by listing tools. You pass by proving you can move data reliably, at scale, without blowing up cost, and without lying to downstream teams.

This article gives you the data engineer interview questions you’ll face in 2026 that focus on the core skills: strong SQL, practical Python, distributed processing (Spark), orchestration (Airflow), streaming (Kafka), cloud fundamentals, and system design trade-offs and the rubric interviewers score you on, plus answer patterns that don’t sound rehearsed.

Table of Contents

How data engineer interviews work in 2026

Most data engineer interviews follow a predictable loop. The names of tools change by company, but the evaluated skills stay consistent.

1) Recruiter / hiring manager screen (30–45 min)

They’re not testing depth. They’re testing:

  • Can you explain your last pipeline clearly?
  • Do you understand why choices were made (not just “we used Spark”)?
  • Can you talk about failures without excuses?

Gotcha: If you can’t describe input → transformations → output → SLA → data quality → backfill strategy, you’re not ready.

2) SQL + Python technical screens (45–60 min each)

  • SQL: windows, dedup, incremental loads, CDC/SCD2, debugging wrong counts.
  • Business metrics (retention, funnels, DAU/WAU/MAU)
  • Python: parsing, batching, idempotency, memory, correctness under retries.

Gotcha: Most candidates lose points on edge cases, not syntax.

3) System design (60–90 min)

You design a pipeline or platform component. Interviewers grade:

  • contracts, SLAs, failure modes
  • incremental strategy
  • schema evolution
  • observability + cost

Gotcha: “We’ll just replay everything” is not a strategy. Backfills need guardrails.

4) One “tools” round (Spark/Airflow/Kafka/dbt/warehouse)

This is usually scenario-based:

  • performance debugging
  • orchestration correctness
  • streaming semantics (late events, exactly-once illusions)
  • warehouse/lakehouse modeling decisions

5) Behavioral / collaboration (30–60 min)

Data engineering is cross-functional pain. They want proof you can:

  • handle ambiguous requirements
  • push back on bad asks
  • keep pipelines stable during change

How to answer like a strong candidate

Use a consistent structure:

  1. clarify assumptions
  2. propose approach
  3. discuss edge cases
  4. estimate complexity
  5. mention production concerns.

The scoring rubric interviewers use

Interviewers rarely grade with one yes/no checkbox. They score multiple dimensions, often subconsciously.

Rubric dimension 1: Correctness (baseline)

You must produce correct results for:

  • typical inputs
  • edge cases
  • “dirty data” (nulls, malformed rows, duplicates)

How to signal correctness fast

  • State assumptions explicitly (“timestamps are epoch seconds”)
  • Add validations (“skip records missing user_id”)
  • Provide a small example and walk through output

Rubric dimension 2: Data engineering thinking (the differentiator)

They look for production instincts:

  • Idempotency: safe to retry and re-run
  • Determinism: consistent output ordering/behavior
  • Observability: logs, metrics, and failure visibility
  • Backfills: ability to recompute safely for a date range

Rubric dimension 3: Performance and scale awareness

You don’t need to “over-optimize,” but you must reason about:

  • time complexity
  • memory usage
  • partitioning/shuffles (Spark)
  • expensive operations (wide joins, full table scans)

Rubric dimension 4: Communication

Clear, structured answers win:

  • talk through steps
  • name trade-offs
  • keep queries readable
  • keep code modular

Rubric dimension 5: Tool fluency (without tool worship)

Strong candidates treat tools as implementations of principles:

  • Airflow = orchestration + retries/backfills
  • Spark = distributed execution + shuffles/partitions
  • Kafka = ordered partitions + consumer groups + replay

The fastest way to answer ANY DE interview question (the “4-step” pattern)

When you’re stuck, do this:

  1. Clarify grain + keys
  2. List edge cases (duplicates, late data, nulls, deletes)
  3. Pick the simplest correct approach
  4. Add scale + reliability upgrades (partitioning, idempotency, observability)

This makes you sound senior even when the question is basic.

SQL Data Engineer Interview Questions

What’s the difference between INNER JOIN, LEFT JOIN, and FULL OUTER JOIN?

Answer:

  • INNER JOIN returns only matching rows.
  • LEFT JOIN keeps all left rows and matches right rows when possible (null if missing).
  • FULL OUTER JOIN returns all rows from both sides with nulls where unmatched.

Interview tip: Mention where you’d use each in pipelines (enrichment, reconciliation, completeness checks).

How do you find duplicate rows by business key?

Answer: Group by the business key and filter COUNT(*) > 1.

SELECT user_id, event_id, COUNT(*) AS cnt

FROM events

GROUP BY user_id, event_id

HAVING COUNT(*) > 1;

How do you keep only the latest record per key?

Answer: Use ROW_NUMBER() partitioned by the key and ordered by timestamp descending.

SELECT *

FROM (

SELECT

t.*,

ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY updated_at DESC) AS rn

FROM user_profile t

) x

WHERE rn = 1;

Common follow-up: “What if timestamps tie?”
Add a tie-breaker column like ingested_at or event_id.

ROW_NUMBER vs RANK vs DENSE_RANK?

Answer:

  • ROW_NUMBER() assigns unique sequence numbers (no ties).
  • RANK() gives same rank for ties and leaves gaps.
  • DENSE_RANK() gives same rank for ties without gaps.

Get the latest record per user

What they’re testing: dedup strategy + tie-breaking + performance.

Answer (window function):

SELECT *
FROM (
  SELECT
    t.*,
    ROW_NUMBER() OVER (
      PARTITION BY user_id
      ORDER BY updated_at DESC, event_id DESC
    ) AS rn
  FROM events t
) x
WHERE rn = 1;

Gotcha to mention: If timestamps can tie, add a stable tie-breaker (event_id, ingestion_id).

Find users with gaps-and-islands (consecutive days)

What they’re testing: analytic thinking.

Answer (classic trick):

WITH d AS (
  SELECT user_id, activity_date,
         DATEADD(day, -ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY activity_date), activity_date) AS grp
  FROM activity
)
SELECT user_id, MIN(activity_date) AS start_date, MAX(activity_date) AS end_date, COUNT(*) AS days
FROM d
GROUP BY user_id, grp;

Sessionize events with 30-minute inactivity

Answer (lag + sum):

WITH x AS (
  SELECT *,
         CASE WHEN TIMESTAMPDIFF(minute,
              LAG(event_ts) OVER (PARTITION BY user_id ORDER BY event_ts),
              event_ts
         ) > 30 THEN 1 ELSE 0 END AS new_session
  FROM web_events
),
y AS (
  SELECT *,
         SUM(new_session) OVER (PARTITION BY user_id ORDER BY event_ts) AS session_id
  FROM x
)
SELECT user_id, session_id,
       MIN(event_ts) AS session_start,
       MAX(event_ts) AS session_end,
       COUNT(*) AS events
FROM y
GROUP BY user_id, session_id;

Gotcha: “Exactly 30 minutes” edge case-clarify whether boundary starts a new session.

Incremental load: load only new/changed rows

Best answer: Use high-watermark or CDC, depending on source guarantees.

  • If you have a reliable updated_at: watermark by updated_at
  • If updates can arrive late/out-of-order: use CDC or store a hash and re-check a lookback window

Gotcha: Always add a lookback window unless you can prove strict monotonicity.

SCD Type 2 for dimension updates

Answer (conceptual):

  • Natural key: customer_id
  • Version columns: valid_from, valid_to, is_current
  • On change: expire current row, insert new row

Gotcha: Handle no-op updates (same values) or you’ll create fake history.

Why do counts not match between source and warehouse?

Senior answer checklist:

  • duplicates in source / replays
  • join explosion (many-to-many)
  • timezone truncation differences
  • late data excluded by watermark
  • filters applied in different places
  • distinct vs count(*) mismatch due to grain confusion

Read Next: Tricky SQL Queries for Interview

Python Data Engineer Interview Questions

Flatten nested JSON

What they test: recursion + schema variability.

def flatten(d, prefix=""):
    out = {}
    for k, v in d.items():
        key = f"{prefix}.{k}" if prefix else k
        if isinstance(v, dict):
            out.update(flatten(v, key))
        else:
            out[key] = v
    return out

Gotcha: Arrays. Decide whether to explode them or keep JSON strings.

Make this ETL idempotent

Best answer: Idempotency is not a vibe. It’s a mechanism:

  • deterministic primary keys
  • upserts/merge by key
  • write to temp + atomic swap
  • exactly-once is usually achieved through dedup + merge, not magic

Read a 200GB file – don’t OOM

Correct answer: stream/chunk.

import csv

def iter_rows(path):
    with open(path, newline="") as f:
        reader = csv.DictReader(f)
        for row in reader:
            yield row

Gotcha: If you “just pandas.read_csv”, you fail.

How do you read large files without running out of memory?

Answer: Stream input in chunks or line-by-line, and process incrementally.

def read_lines(path):

with open(path, “r”, encoding=”utf-8″) as f:

for line in f:

yield line.rstrip(“\n”)

Retries: how do you avoid double-writing?

Answer:

  • Use write-ahead state (checkpoint table) or merge by key
  • Make writes atomic (staging + rename, transactional table formats)
  • Store a run_id and enforce uniqueness

What tests matter for pipelines?

Answer that scores:

  • unit tests for transforms
  • contract tests for schema + nullability + constraints
  • integration test for one small run end-to-end
  • data quality checks in prod (tests are not enough)

What’s idempotency in Python ETL terms?

Answer:

Running the job twice should not double-count or duplicate outputs. Use deterministic keys, overwrite-per-partition, or upserts.

List vs generator: what’s the difference and why does it matter?

Answer:

Lists hold everything in memory; generators yield items lazily. Generators are ideal for large ETL inputs and streaming transformations.

Spark / distributed processing questions for Data Engineer

What causes a shuffle? How do you reduce it?

Answer: joins, groupBy, distinct, repartition. Reduce by:

  • broadcast small tables
  • pre-aggregate
  • partition by join keys
  • avoid wide transformations

Gotcha: Repartitioning blindly can increase shuffle cost.

Explain data skew and how to fix it

Strong fixes:

  • salting hot keys
  • skew hints (when available)
  • split hot keys into separate path
  • two-stage aggregation (partial then final)

Gotcha: Broadcast join doesn’t fix skew if the other side is skewed.

Caching: when is it bad?

Answer: caching helps only when:

  • reused multiple times
  • dataset fits in memory
  • you control lineage blowups

Gotcha: Caching a huge dataset triggers eviction thrash and kills performance.

Parquet/Delta small files problem

Answer: Too many small files increases planning overhead and read amplification. Fix:

  • compaction jobs
  • optimize/auto-compaction features (format-dependent)
  • correct partition sizing

Repartition vs coalesce: what’s the difference?

Answer:

  • repartition(n) reshuffles data to create evenly balanced partitions (more expensive).
  • coalesce(n) reduces partitions without a full shuffle when possible (cheaper, but may create imbalance).

What is data skew and how do you spot it?

Answer: Skew happens when certain keys dominate (e.g., one customer has 80% of events), causing stragglers and long stages. Spot it via:

  • skewed key distribution metrics
  • long tails in stage duration
  • huge partition sizes for certain keys

Airflow / orchestration questions

How do you design a backfill safely?

Answer:

  • parameterize by partition/date
  • isolate side effects (write per partition)
  • throttle concurrency
  • enforce idempotency
  • run data quality checks per partition

Gotcha: Backfills can DDOS your warehouse if you don’t cap concurrency.

What is a DAG in Airflow?

Answer: A DAG is a directed acyclic graph representing task dependencies and execution order. In DE pipelines, DAGs define ingestion, transformation, validation, and publishing steps.

What makes a DAG maintainable?

Answer:

  • small tasks with clear contracts
  • retries + timeouts with intent
  • alerting that’s actionable
  • sensors avoided or tuned (don’t block workers forever)

What is idempotency and why does it matter in orchestration?

Answer:

Idempotent tasks produce the same output even if retried or rerun. It’s essential for safe retries, backfills, and partial failures.

How do you monitor an Airflow pipeline in production?

Answer:

  • SLA alerts (late runs)
  • task failure alerts + error budgets
  • metrics: row counts, null rates, freshness
  • logs with correlation IDs (dag_id/run_id)
  • dashboards: success rate, runtime, lag

Streaming (Kafka) Data Engineer Interview Questions

Event time vs processing time

Answer:

  • event time = when it happened
  • processing time = when you saw it
  • late events require watermarks + retractions or compensating logic

Gotcha: “We’ll just use processing time” breaks analytics correctness.

What is a Kafka topic and partition?

Answer:

A topic is a named stream of records. Partitions split the topic into ordered logs that enable parallelism; ordering is guaranteed within a partition.

What is a consumer group?

Answer:

A consumer group is multiple consumers sharing work. Kafka assigns partitions to consumers within the group so they can process in parallel.

How do you handle out-of-order events?

Answer:

  • process by event time, not ingestion time
  • allow a lateness window (watermarking concept)
  • use upserts in your serving layer
  • design “corrections” (retractions/updates) downstream

How do you avoid duplicates in streaming pipelines?

Answer:

Use deterministic event IDs, idempotent writes, and de-dup windows keyed by (entity_id, event_id). If duplicates can arrive days later, persist a dedup key store with TTL.

Exactly-once delivery – real or marketing?

Answer:

Exactly-once is usually end-to-end semantics, not a single knob. You still need:

  • deterministic keys
  • deduplication
  • transactional sinks or merge

Data modeling + warehouse/lakehouse questions (2026)

What’s the grain of your fact table?

Answer: The grain is the unique event/unit your fact represents. Define it before writing SQL.

Gotcha: Grain confusion causes “counts don’t match” and join explosions.

Star schema vs wide table

Answer:

  • star schema: stable, scalable, reusable
  • wide table: faster for single use-case, risk of duplication + inconsistency

Star schema vs snowflake schema

Answer:

Star schemas denormalize dimensions for simpler queries; snowflake schemas normalize dimensions to reduce redundancy at the cost of extra joins.

Fact vs dimension tables

Answer:

  • Facts store measurable events (orders, clicks, payments).
  • Dimensions store descriptive attributes (customer, product, campaign).

What is an OLAP vs OLTP workload?

Answer:

OLTP focuses on fast transactions and updates; OLAP focuses on analytical queries across large history with scans, aggregates, and joins.

What is a lakehouse (practically)?

Answer:

A lakehouse stores data in object storage like a lake but adds warehouse-like features: table management, schema enforcement, and reliable reads/writes for analytics.

How do you design tables for performance?

Answer:

  • choose columnar formats (Parquet/ORC) when possible
  • partition by common filters (event_date, tenant_id)
  • avoid too many tiny files (compact)
  • cluster/sort by high-cardinality filters if supported
  • keep grain clear (one row per entity per time)

Data quality: what checks do you implement?

Answer:

  • schema checks (expected columns/types)
  • completeness checks (row counts, missing partitions)
  • validity checks (ranges, allowed values)
  • uniqueness checks (keys)
  • freshness checks (data not late)
  • reconciliation checks (source vs target totals)

Cloud + cost/security questions (2026)

How do you reduce warehouse costs?

Strong answer list:

  • partition pruning / clustering
  • reduce scans with incremental models
  • pre-aggregate for common queries
  • stop runaway queries (governance)
  • right-size compute, schedule heavy jobs off-peak

How do you handle PII?

Answer:

  • classify PII fields
  • least privilege access
  • masking/tokenization where needed
  • audit access
  • avoid copying PII into “convenience” tables

Gotcha: “We’ll just lock the bucket” is not a plan.

What are “small files” and why do they hurt?

Answer:

Too many small files increase metadata overhead and slow down reads in object storage-based systems. Compaction improves scan efficiency and reduces planning time.

What is least privilege?

Answer:

Least privilege means users and services only get the minimum permissions needed. It limits blast radius if credentials leak.

How do you design for failures in cloud?

Answer:

  • retries with backoff for transient failures
  • idempotent outputs for safe re-runs
  • checkpoints for streaming jobs
  • dead-letter queues for bad events
  • alerts on error rates and lag

System design: the 6-layer framework (interview-proof)

Use this framework to answer almost any data engineering system design prompt (batch or streaming). It keeps you structured and interviewers love structure.

Layer 1: Ingestion

  • batch: files, APIs, database extracts, CDC logs
  • streaming: Kafka topics, event buses
    Key questions: volume, latency, schema, duplication rate

Layer 2: Raw storage (immutable)

  • object storage or raw tables
  • store exact source payloads (for replay)
    Key questions: retention, partitioning, encryption

Layer 3: Processing / transformation

  • Spark jobs, SQL transforms, dbt models
  • validate, clean, enrich, deduplicate
    Key questions: incremental vs full refresh, compute cost

Layer 4: Serving layer

  • warehouse/lakehouse tables
  • marts for BI, feature tables for ML
    Key questions: query patterns, SLA, concurrency

Layer 5: Orchestration

  • scheduling, dependencies, retries
  • backfills, versioning, CI/CD for data changes
    Key questions: idempotency, data interval logic

Layer 6: Reliability + governance

  • data quality checks, lineage, access control, monitoring
    Key questions: ownership, alerts, SLOs, audits

The “senior signals” to add (even in a junior-friendly job position)

  • define SLAs: freshness, completeness, correctness
  • define failure modes and what happens next
  • plan backfills and reprocessing
  • mention schema evolution strategy

Coding programs: ALL approaches + outputs

Program 1: Latest record per key (dedup)

Input

rows = [
  {"user_id": 1, "ts": "2026-01-01T10:00:00", "val": "A"},
  {"user_id": 1, "ts": "2026-01-01T10:00:00", "val": "B"},
  {"user_id": 2, "ts": "2026-01-01T09:00:00", "val": "X"},
]

Assume ties broken by last occurrence in input.

Approach A: sort + overwrite dict (simple, O(n log n))

from datetime import datetime

def latest_per_user_sort(rows):
    rows2 = sorted(rows, key=lambda r: (r["user_id"], r["ts"]))
    out = {}
    for r in rows2:
        out[r["user_id"]] = r
    return list(out.values())

Approach B: single pass with comparison (O(n))

def latest_per_user_one_pass(rows):
    out = {}
    for r in rows:
        k = r["user_id"]
        if k not in out or r["ts"] >= out[k]["ts"]:
            out[k] = r
    return list(out.values())

Expected output

[
  {"user_id": 1, "ts": "2026-01-01T10:00:00", "val": "B"},
  {"user_id": 2, "ts": "2026-01-01T09:00:00", "val": "X"},
]

Gotcha to say out loud: define tie-breaking and timestamp parsing rules.

Program 2: Flatten nested JSON

Input

doc = {"a": 1, "b": {"c": 2, "d": {"e": 3}}}

Approach A: recursion

def flatten_dict(d, prefix=""):
    out = {}
    for k, v in d.items():
        key = f"{prefix}.{k}" if prefix else k
        if isinstance(v, dict):
            out.update(flatten_dict(v, key))
        else:
            out[key] = v
    return out

Expected output

{"a": 1, "b.c": 2, "b.d.e": 3}

Gotcha: arrays and mixed types-clarify whether to explode or stringify.

Program 3: Top-K heavy hitters

Input

items = ["a","b","a","c","a","b"]
k = 2

Approach A: Counter (fast, O(n))

from collections import Counter

def topk_counter(items, k):
    return Counter(items).most_common(k)

Approach B: dictionary + heap (good when unique is huge)

import heapq

def topk_heap(items, k):
    counts = {}
    for x in items:
        counts[x] = counts.get(x, 0) + 1
    heap = [(-c, x) for x, c in counts.items()]
    heapq.heapify(heap)
    return [(x, -c) for c, x in heapq.nsmallest(k, heap)]

Expected output

[("a", 3), ("b", 2)]

Gotcha: if streaming unbounded, mention approximate algorithms (Count-Min Sketch) to sound senior.

Program 4: SCD Type 2 + CDC MERGE

Scenario: You receive daily CDC updates: inserts/updates/deletes.

Approach (SQL MERGE conceptual)

  • Match on business key
  • If changed: expire current row, insert new version
  • If delete: mark as deleted or expire row (policy decision)

Example pattern (pseudo-SQL; adapt per warehouse):

-- 1) Expire changed current rows
UPDATE dim_customer d
SET valid_to = c.change_ts, is_current = false
FROM cdc_customer c
WHERE d.customer_id = c.customer_id
  AND d.is_current = true
  AND c.op IN ('U')
  AND (d.name <> c.name OR d.email <> c.email);

-- 2) Insert new versions
INSERT INTO dim_customer (customer_id, name, email, valid_from, valid_to, is_current)
SELECT customer_id, name, email, change_ts, NULL, true
FROM cdc_customer
WHERE op IN ('I','U');

Gotcha: handle “update with same values” or you’ll bloat history.

Comparison Table

Method Difficulty Time Complexity Best For
For/While loop (hash map) Easy O(n) Latest-per-key dedup, fast ETL transforms
Recursion (divide & conquer) Medium O(n) Showing recursion skills, nested structures
Sort + overwrite Easy O(n log n) Simple explanation, moderate data sizes
Heap (nlargest) Medium O(n + m log k) Top-K with many uniques and small k
Built-in libs (Counter, pandas normalize) Easy O(n) to O(n log k) Clean readable solutions under time pressure
SQL MERGE / update+insert Medium Engine-dependent SCD2 correctness, warehouse-native pipelines
Spark DataFrame merge Medium–Hard Distributed O(n) Large-scale CDC/SCD processing

 

Example batch and streaming data pipeline architecture for data engineer interviews.

Junior vs Senior: what different answers sound like

Area Junior passes when they… Senior passes when they…
SQL writes correct query defines grain, edge cases, and performance
Python solves problem makes it idempotent + testable + memory-safe
Spark knows transforms predicts shuffles, skew, file sizing, AQE behavior
Airflow can schedule designs safe backfills + failure containment
Kafka knows basics reasons about event time, late data, replays
System design draws boxes defines contracts, SLAs, backfills, observability, cost

 

Basics Questions

How do I explain a pipeline like a senior?

Input → contracts → transformations → output → SLA → quality checks → backfill strategy → cost levers.

What’s an idempotent pipeline?

Rerunning it produces the same final result without duplicates or double side effects.

What is CDC?

Change Data Capture: inserts/updates/deletes from a source system that must be applied correctly downstream.

What is SCD Type 2?

A modeling pattern that preserves historical versions of dimension attributes across time.

What causes Spark shuffles?

Wide operations like join, groupBy, distinct, orderBy, and repartition.

How do you handle data skew?

Salt hot keys, split hot partitions, use two-stage aggregation, or broadcast strategically.

What’s the difference between event time and processing time?

Event time is when an event occurred; processing time is when you processed it. Late events break naive processing-time logic.

Is exactly-once real?

It’s end-to-end semantics achieved with dedup keys + transactional sinks + careful replay handling.

What’s the grain of a table?

The single unit a row represents. Define it before modeling or querying.

FAQ

How do I prepare for a data engineer interview in 2026?

Master SQL windows + incremental loads, Python ETL patterns, and one distributed engine (Spark). Then drill system design with SLAs, backfills, and observability.

What should I study first for a data engineer interview?

Study SQL windows + dedup + incremental loading first. Then learn Python ETL patterns (idempotency, retries, memory). Then drill system design with SLAs and backfills.

How many questions should I practice?

Practice 30–50 high-quality problems across SQL, Python, and system design. Repeating a smaller set until you can explain edge cases beats collecting 300 links.

What’s the hardest part of DE interviews?

Edge cases and operational thinking: late data, retries, backfills, schema evolution, and cost.

How do I handle “unknown” questions in interviews?

Say what you know, ask clarifying questions, propose a safe default, and mention how you’d validate in production.

Conclusion+ 2026 learning roadmap (practical, not theoretical)

To master Data Engineer Interview Questions and Answers in 2026, prepare for both coding correctness and production thinking. Interviews reward candidates who can write clean SQL/Python and design systems that survive retries, backfills, schema changes, and real-world failures

  • Week 1–2: SQL windows, dedup, sessionization, CDC/SCD2
  • Week 3–4: Python ETL patterns, idempotency, tests, memory
  • Week 5–6: Spark performance basics: shuffle, partitioning, skew
  • Week 7: Airflow backfills + failure containment
  • Week 8: Streaming semantics (event time, late arrivals, replays)
  • Ongoing: system design drills using the 6-layer framework

Leave a Comment

error: Content is protected !!