Data Engineer Interview Questions and Answers (2026) - SQL, Python, Spark + System Design

You don’t pass a data engineer interview by listing tools. You pass by proving you can move data reliably, at scale, without blowing up cost, and without lying to downstream teams.

This article gives you the data engineer interview questions you’ll face in 2026 that focus on the core skills: strong SQL, practical Python, distributed processing (Spark), orchestration (Airflow), streaming (Kafka), cloud fundamentals, and system design trade-offs and the rubric interviewers score you on, plus answer patterns that don’t sound rehearsed.

Table of Contents

How data engineer interviews work in 2026

Most data engineer interviews follow a predictable loop. The names of tools change by company, but the evaluated skills stay consistent.

1) Recruiter / hiring manager screen (30–45 min)

They’re not testing depth. They’re testing:

Can you explain your last pipeline clearly?
Do you understand why choices were made (not just “we used Spark”)?
Can you talk about failures without excuses?

Gotcha: If you can’t describe input → transformations → output → SLA → data quality → backfill strategy, you’re not ready.

2) SQL + Python technical screens (45–60 min each)

SQL: windows, dedup, incremental loads, CDC/SCD2, debugging wrong counts.
Business metrics (retention, funnels, DAU/WAU/MAU)
Python: parsing, batching, idempotency, memory, correctness under retries.

Gotcha: Most candidates lose points on edge cases, not syntax.

3) System design (60–90 min)

You design a pipeline or platform component. Interviewers grade:

contracts, SLAs, failure modes
incremental strategy
schema evolution
observability + cost

Gotcha: “We’ll just replay everything” is not a strategy. Backfills need guardrails.

4) One “tools” round (Spark/Airflow/Kafka/dbt/warehouse)

This is usually scenario-based:

performance debugging
orchestration correctness
streaming semantics (late events, exactly-once illusions)
warehouse/lakehouse modeling decisions

5) Behavioral / collaboration (30–60 min)

Data engineering is cross-functional pain. They want proof you can:

handle ambiguous requirements
push back on bad asks
keep pipelines stable during change

How to answer like a strong candidate

Use a consistent structure:

clarify assumptions
propose approach
discuss edge cases
estimate complexity
mention production concerns.

The scoring rubric interviewers use

Interviewers rarely grade with one yes/no checkbox. They score multiple dimensions, often subconsciously.

Rubric dimension 1: Correctness (baseline)

You must produce correct results for:

typical inputs
edge cases
“dirty data” (nulls, malformed rows, duplicates)

How to signal correctness fast

State assumptions explicitly (“timestamps are epoch seconds”)
Add validations (“skip records missing user_id”)
Provide a small example and walk through output

Rubric dimension 2: Data engineering thinking (the differentiator)

They look for production instincts:

Idempotency: safe to retry and re-run
Determinism: consistent output ordering/behavior
Observability: logs, metrics, and failure visibility
Backfills: ability to recompute safely for a date range

Rubric dimension 3: Performance and scale awareness

You don’t need to “over-optimize,” but you must reason about:

time complexity
memory usage
partitioning/shuffles (Spark)
expensive operations (wide joins, full table scans)

Rubric dimension 4: Communication

Clear, structured answers win:

talk through steps
name trade-offs
keep queries readable
keep code modular

Rubric dimension 5: Tool fluency (without tool worship)

Strong candidates treat tools as implementations of principles:

Airflow = orchestration + retries/backfills
Spark = distributed execution + shuffles/partitions
Kafka = ordered partitions + consumer groups + replay

The fastest way to answer ANY DE interview question (the “4-step” pattern)

When you’re stuck, do this:

Clarify grain + keys
List edge cases (duplicates, late data, nulls, deletes)
Pick the simplest correct approach
Add scale + reliability upgrades (partitioning, idempotency, observability)

This makes you sound senior even when the question is basic.

SQL Data Engineer Interview Questions

What’s the difference between INNER JOIN, LEFT JOIN, and FULL OUTER JOIN?

Answer:

INNER JOIN returns only matching rows.
LEFT JOIN keeps all left rows and matches right rows when possible (null if missing).
FULL OUTER JOIN returns all rows from both sides with nulls where unmatched.

Interview tip: Mention where you’d use each in pipelines (enrichment, reconciliation, completeness checks).

How do you find duplicate rows by business key?

Answer: Group by the business key and filter COUNT(*) > 1.

SELECT user_id, event_id, COUNT(*) AS cnt

FROM events

GROUP BY user_id, event_id

HAVING COUNT(*) > 1;

How do you keep only the latest record per key?

Answer: Use ROW_NUMBER() partitioned by the key and ordered by timestamp descending.

SELECT *

FROM (

SELECT

t.*,

ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY updated_at DESC) AS rn

FROM user_profile t

) x

WHERE rn = 1;

Common follow-up: “What if timestamps tie?”
Add a tie-breaker column like ingested_at or event_id.

ROW_NUMBER vs RANK vs DENSE_RANK?

Answer:

ROW_NUMBER() assigns unique sequence numbers (no ties).
RANK() gives same rank for ties and leaves gaps.
DENSE_RANK() gives same rank for ties without gaps.

Get the latest record per user

What they’re testing: dedup strategy + tie-breaking + performance.

Answer (window function):

SELECT *
FROM (
  SELECT
    t.*,
    ROW_NUMBER() OVER (
      PARTITION BY user_id
      ORDER BY updated_at DESC, event_id DESC
    ) AS rn
  FROM events t
) x
WHERE rn = 1;

Gotcha to mention: If timestamps can tie, add a stable tie-breaker (event_id, ingestion_id).

Find users with gaps-and-islands (consecutive days)

What they’re testing: analytic thinking.

Answer (classic trick):

WITH d AS (
  SELECT user_id, activity_date,
         DATEADD(day, -ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY activity_date), activity_date) AS grp
  FROM activity
)
SELECT user_id, MIN(activity_date) AS start_date, MAX(activity_date) AS end_date, COUNT(*) AS days
FROM d
GROUP BY user_id, grp;

Sessionize events with 30-minute inactivity

Answer (lag + sum):

WITH x AS (
  SELECT *,
         CASE WHEN TIMESTAMPDIFF(minute,
              LAG(event_ts) OVER (PARTITION BY user_id ORDER BY event_ts),
              event_ts
         ) > 30 THEN 1 ELSE 0 END AS new_session
  FROM web_events
),
y AS (
  SELECT *,
         SUM(new_session) OVER (PARTITION BY user_id ORDER BY event_ts) AS session_id
  FROM x
)
SELECT user_id, session_id,
       MIN(event_ts) AS session_start,
       MAX(event_ts) AS session_end,
       COUNT(*) AS events
FROM y
GROUP BY user_id, session_id;

Gotcha: “Exactly 30 minutes” edge case-clarify whether boundary starts a new session.

Incremental load: load only new/changed rows

Best answer: Use high-watermark or CDC, depending on source guarantees.

If you have a reliable updated_at: watermark by updated_at
If updates can arrive late/out-of-order: use CDC or store a hash and re-check a lookback window

Gotcha: Always add a lookback window unless you can prove strict monotonicity.

SCD Type 2 for dimension updates

Answer (conceptual):

Natural key: customer_id
Version columns: valid_from, valid_to, is_current
On change: expire current row, insert new row

Gotcha: Handle no-op updates (same values) or you’ll create fake history.

Why do counts not match between source and warehouse?

Senior answer checklist:

duplicates in source / replays
join explosion (many-to-many)
timezone truncation differences
late data excluded by watermark
filters applied in different places
distinct vs count(*) mismatch due to grain confusion

Read Next: Tricky SQL Queries for Interview

Python Data Engineer Interview Questions

Flatten nested JSON

What they test: recursion + schema variability.

def flatten(d, prefix=""):
    out = {}
    for k, v in d.items():
        key = f"{prefix}.{k}" if prefix else k
        if isinstance(v, dict):
            out.update(flatten(v, key))
        else:
            out[key] = v
    return out

Gotcha: Arrays. Decide whether to explode them or keep JSON strings.

Make this ETL idempotent

Best answer: Idempotency is not a vibe. It’s a mechanism:

deterministic primary keys
upserts/merge by key
write to temp + atomic swap
exactly-once is usually achieved through dedup + merge, not magic

Read a 200GB file – don’t OOM

Correct answer: stream/chunk.

import csv

def iter_rows(path):
    with open(path, newline="") as f:
        reader = csv.DictReader(f)
        for row in reader:
            yield row

Gotcha: If you “just pandas.read_csv”, you fail.

How do you read large files without running out of memory?

Answer: Stream input in chunks or line-by-line, and process incrementally.

def read_lines(path):

with open(path, “r”, encoding=”utf-8″) as f:

for line in f:

yield line.rstrip(“\n”)

Retries: how do you avoid double-writing?

Answer:

Use write-ahead state (checkpoint table) or merge by key
Make writes atomic (staging + rename, transactional table formats)
Store a run_id and enforce uniqueness

What tests matter for pipelines?

Answer that scores:

unit tests for transforms
contract tests for schema + nullability + constraints
integration test for one small run end-to-end
data quality checks in prod (tests are not enough)

What’s idempotency in Python ETL terms?

Answer:

Running the job twice should not double-count or duplicate outputs. Use deterministic keys, overwrite-per-partition, or upserts.

List vs generator: what’s the difference and why does it matter?

Answer:

Lists hold everything in memory; generators yield items lazily. Generators are ideal for large ETL inputs and streaming transformations.

Spark / distributed processing questions for Data Engineer

What causes a shuffle? How do you reduce it?

Answer: joins, groupBy, distinct, repartition. Reduce by:

broadcast small tables
pre-aggregate
partition by join keys
avoid wide transformations

Gotcha: Repartitioning blindly can increase shuffle cost.

Explain data skew and how to fix it

Strong fixes:

salting hot keys
skew hints (when available)
split hot keys into separate path
two-stage aggregation (partial then final)

Gotcha: Broadcast join doesn’t fix skew if the other side is skewed.

Caching: when is it bad?

Answer: caching helps only when:

reused multiple times
dataset fits in memory
you control lineage blowups

Gotcha: Caching a huge dataset triggers eviction thrash and kills performance.

Parquet/Delta small files problem

Answer: Too many small files increases planning overhead and read amplification. Fix:

compaction jobs
optimize/auto-compaction features (format-dependent)
correct partition sizing

Repartition vs coalesce: what’s the difference?

Answer:

repartition(n) reshuffles data to create evenly balanced partitions (more expensive).
coalesce(n) reduces partitions without a full shuffle when possible (cheaper, but may create imbalance).

What is data skew and how do you spot it?

Answer: Skew happens when certain keys dominate (e.g., one customer has 80% of events), causing stragglers and long stages. Spot it via:

skewed key distribution metrics
long tails in stage duration
huge partition sizes for certain keys

Airflow / orchestration questions

How do you design a backfill safely?

Answer:

parameterize by partition/date
isolate side effects (write per partition)
throttle concurrency
enforce idempotency
run data quality checks per partition

Gotcha: Backfills can DDOS your warehouse if you don’t cap concurrency.

What is a DAG in Airflow?

Answer: A DAG is a directed acyclic graph representing task dependencies and execution order. In DE pipelines, DAGs define ingestion, transformation, validation, and publishing steps.

What makes a DAG maintainable?

Answer:

small tasks with clear contracts
retries + timeouts with intent
alerting that’s actionable
sensors avoided or tuned (don’t block workers forever)

What is idempotency and why does it matter in orchestration?

Answer:

Idempotent tasks produce the same output even if retried or rerun. It’s essential for safe retries, backfills, and partial failures.

How do you monitor an Airflow pipeline in production?

Answer:

SLA alerts (late runs)
task failure alerts + error budgets
metrics: row counts, null rates, freshness
logs with correlation IDs (dag_id/run_id)
dashboards: success rate, runtime, lag

Streaming (Kafka) Data Engineer Interview Questions

Event time vs processing time

Answer:

event time = when it happened
processing time = when you saw it
late events require watermarks + retractions or compensating logic

Gotcha: “We’ll just use processing time” breaks analytics correctness.

What is a Kafka topic and partition?

Answer:

A topic is a named stream of records. Partitions split the topic into ordered logs that enable parallelism; ordering is guaranteed within a partition.

What is a consumer group?

Answer:

A consumer group is multiple consumers sharing work. Kafka assigns partitions to consumers within the group so they can process in parallel.

How do you handle out-of-order events?

Answer:

process by event time, not ingestion time
allow a lateness window (watermarking concept)
use upserts in your serving layer
design “corrections” (retractions/updates) downstream

How do you avoid duplicates in streaming pipelines?

Answer:

Use deterministic event IDs, idempotent writes, and de-dup windows keyed by (entity_id, event_id). If duplicates can arrive days later, persist a dedup key store with TTL.

Exactly-once delivery – real or marketing?

Answer:

Exactly-once is usually end-to-end semantics, not a single knob. You still need:

deterministic keys
deduplication
transactional sinks or merge

Data modeling + warehouse/lakehouse questions (2026)

What’s the grain of your fact table?

Answer: The grain is the unique event/unit your fact represents. Define it before writing SQL.

Gotcha: Grain confusion causes “counts don’t match” and join explosions.

Star schema vs wide table

Answer:

star schema: stable, scalable, reusable
wide table: faster for single use-case, risk of duplication + inconsistency

Star schema vs snowflake schema

Answer:

Star schemas denormalize dimensions for simpler queries; snowflake schemas normalize dimensions to reduce redundancy at the cost of extra joins.

Fact vs dimension tables

Answer:

Facts store measurable events (orders, clicks, payments).
Dimensions store descriptive attributes (customer, product, campaign).

What is an OLAP vs OLTP workload?

Answer:

OLTP focuses on fast transactions and updates; OLAP focuses on analytical queries across large history with scans, aggregates, and joins.

What is a lakehouse (practically)?

Answer:

A lakehouse stores data in object storage like a lake but adds warehouse-like features: table management, schema enforcement, and reliable reads/writes for analytics.

How do you design tables for performance?

Answer:

choose columnar formats (Parquet/ORC) when possible
partition by common filters (event_date, tenant_id)
avoid too many tiny files (compact)
cluster/sort by high-cardinality filters if supported
keep grain clear (one row per entity per time)

Data quality: what checks do you implement?

Answer:

schema checks (expected columns/types)
completeness checks (row counts, missing partitions)
validity checks (ranges, allowed values)
uniqueness checks (keys)
freshness checks (data not late)
reconciliation checks (source vs target totals)

Cloud + cost/security questions (2026)

How do you reduce warehouse costs?

Strong answer list:

partition pruning / clustering
reduce scans with incremental models
pre-aggregate for common queries
stop runaway queries (governance)
right-size compute, schedule heavy jobs off-peak

How do you handle PII?

Answer:

classify PII fields
least privilege access
masking/tokenization where needed
audit access
avoid copying PII into “convenience” tables

Gotcha: “We’ll just lock the bucket” is not a plan.

What are “small files” and why do they hurt?

Answer:

Too many small files increase metadata overhead and slow down reads in object storage-based systems. Compaction improves scan efficiency and reduces planning time.

What is least privilege?

Answer:

Least privilege means users and services only get the minimum permissions needed. It limits blast radius if credentials leak.

How do you design for failures in cloud?

Answer:

retries with backoff for transient failures
idempotent outputs for safe re-runs
checkpoints for streaming jobs
dead-letter queues for bad events
alerts on error rates and lag

System design: the 6-layer framework (interview-proof)

Use this framework to answer almost any data engineering system design prompt (batch or streaming). It keeps you structured and interviewers love structure.

Layer 1: Ingestion

batch: files, APIs, database extracts, CDC logs
streaming: Kafka topics, event buses
Key questions: volume, latency, schema, duplication rate

Layer 2: Raw storage (immutable)

object storage or raw tables
store exact source payloads (for replay)
Key questions: retention, partitioning, encryption

Layer 3: Processing / transformation

Spark jobs, SQL transforms, dbt models
validate, clean, enrich, deduplicate
Key questions: incremental vs full refresh, compute cost

Layer 4: Serving layer

warehouse/lakehouse tables
marts for BI, feature tables for ML
Key questions: query patterns, SLA, concurrency

Layer 5: Orchestration

scheduling, dependencies, retries
backfills, versioning, CI/CD for data changes
Key questions: idempotency, data interval logic

Layer 6: Reliability + governance

data quality checks, lineage, access control, monitoring
Key questions: ownership, alerts, SLOs, audits

The “senior signals” to add (even in a junior-friendly job position)

define SLAs: freshness, completeness, correctness
define failure modes and what happens next
plan backfills and reprocessing
mention schema evolution strategy

Coding programs: ALL approaches + outputs

Program 1: Latest record per key (dedup)

Input

rows = [
  {"user_id": 1, "ts": "2026-01-01T10:00:00", "val": "A"},
  {"user_id": 1, "ts": "2026-01-01T10:00:00", "val": "B"},
  {"user_id": 2, "ts": "2026-01-01T09:00:00", "val": "X"},
]

Assume ties broken by last occurrence in input.

Approach A: sort + overwrite dict (simple, O(n log n))

from datetime import datetime

def latest_per_user_sort(rows):
    rows2 = sorted(rows, key=lambda r: (r["user_id"], r["ts"]))
    out = {}
    for r in rows2:
        out[r["user_id"]] = r
    return list(out.values())

Approach B: single pass with comparison (O(n))

def latest_per_user_one_pass(rows):
    out = {}
    for r in rows:
        k = r["user_id"]
        if k not in out or r["ts"] >= out[k]["ts"]:
            out[k] = r
    return list(out.values())

Expected output

[
  {"user_id": 1, "ts": "2026-01-01T10:00:00", "val": "B"},
  {"user_id": 2, "ts": "2026-01-01T09:00:00", "val": "X"},
]

Gotcha to say out loud: define tie-breaking and timestamp parsing rules.

Program 2: Flatten nested JSON

Input

doc = {"a": 1, "b": {"c": 2, "d": {"e": 3}}}

Approach A: recursion

def flatten_dict(d, prefix=""):
    out = {}
    for k, v in d.items():
        key = f"{prefix}.{k}" if prefix else k
        if isinstance(v, dict):
            out.update(flatten_dict(v, key))
        else:
            out[key] = v
    return out

Expected output

{"a": 1, "b.c": 2, "b.d.e": 3}

Gotcha: arrays and mixed types-clarify whether to explode or stringify.

Program 3: Top-K heavy hitters

Input

items = ["a","b","a","c","a","b"]
k = 2

Approach A: Counter (fast, O(n))

from collections import Counter

def topk_counter(items, k):
    return Counter(items).most_common(k)

Approach B: dictionary + heap (good when unique is huge)

import heapq

def topk_heap(items, k):
    counts = {}
    for x in items:
        counts[x] = counts.get(x, 0) + 1
    heap = [(-c, x) for x, c in counts.items()]
    heapq.heapify(heap)
    return [(x, -c) for c, x in heapq.nsmallest(k, heap)]

Expected output

[("a", 3), ("b", 2)]

Gotcha: if streaming unbounded, mention approximate algorithms (Count-Min Sketch) to sound senior.

Program 4: SCD Type 2 + CDC MERGE

Scenario: You receive daily CDC updates: inserts/updates/deletes.

Approach (SQL MERGE conceptual)

Match on business key
If changed: expire current row, insert new version
If delete: mark as deleted or expire row (policy decision)

Example pattern (pseudo-SQL; adapt per warehouse):

-- 1) Expire changed current rows
UPDATE dim_customer d
SET valid_to = c.change_ts, is_current = false
FROM cdc_customer c
WHERE d.customer_id = c.customer_id
  AND d.is_current = true
  AND c.op IN ('U')
  AND (d.name <> c.name OR d.email <> c.email);

-- 2) Insert new versions
INSERT INTO dim_customer (customer_id, name, email, valid_from, valid_to, is_current)
SELECT customer_id, name, email, change_ts, NULL, true
FROM cdc_customer
WHERE op IN ('I','U');

Gotcha: handle “update with same values” or you’ll bloat history.

Comparison Table

Method	Difficulty	Time Complexity	Best For
For/While loop (hash map)	Easy	O(n)	Latest-per-key dedup, fast ETL transforms
Recursion (divide & conquer)	Medium	O(n)	Showing recursion skills, nested structures
Sort + overwrite	Easy	O(n log n)	Simple explanation, moderate data sizes
Heap (nlargest)	Medium	O(n + m log k)	Top-K with many uniques and small k
Built-in libs (Counter, pandas normalize)	Easy	O(n) to O(n log k)	Clean readable solutions under time pressure
SQL MERGE / update+insert	Medium	Engine-dependent	SCD2 correctness, warehouse-native pipelines
Spark DataFrame merge	Medium–Hard	Distributed O(n)	Large-scale CDC/SCD processing

Junior vs Senior: what different answers sound like

Area	Junior passes when they…	Senior passes when they…
SQL	writes correct query	defines grain, edge cases, and performance
Python	solves problem	makes it idempotent + testable + memory-safe
Spark	knows transforms	predicts shuffles, skew, file sizing, AQE behavior
Airflow	can schedule	designs safe backfills + failure containment
Kafka	knows basics	reasons about event time, late data, replays
System design	draws boxes	defines contracts, SLAs, backfills, observability, cost

Basics Questions

How do I explain a pipeline like a senior?

Input → contracts → transformations → output → SLA → quality checks → backfill strategy → cost levers.

What’s an idempotent pipeline?

Rerunning it produces the same final result without duplicates or double side effects.

What is CDC?

Change Data Capture: inserts/updates/deletes from a source system that must be applied correctly downstream.

What is SCD Type 2?

A modeling pattern that preserves historical versions of dimension attributes across time.

What causes Spark shuffles?

Wide operations like join, groupBy, distinct, orderBy, and repartition.

How do you handle data skew?

Salt hot keys, split hot partitions, use two-stage aggregation, or broadcast strategically.

What’s the difference between event time and processing time?

Event time is when an event occurred; processing time is when you processed it. Late events break naive processing-time logic.

Is exactly-once real?

It’s end-to-end semantics achieved with dedup keys + transactional sinks + careful replay handling.

What’s the grain of a table?

The single unit a row represents. Define it before modeling or querying.

FAQ

How do I prepare for a data engineer interview in 2026?

Master SQL windows + incremental loads, Python ETL patterns, and one distributed engine (Spark). Then drill system design with SLAs, backfills, and observability.

What should I study first for a data engineer interview?

Study SQL windows + dedup + incremental loading first. Then learn Python ETL patterns (idempotency, retries, memory). Then drill system design with SLAs and backfills.

How many questions should I practice?

Practice 30–50 high-quality problems across SQL, Python, and system design. Repeating a smaller set until you can explain edge cases beats collecting 300 links.

What’s the hardest part of DE interviews?

Edge cases and operational thinking: late data, retries, backfills, schema evolution, and cost.

How do I handle “unknown” questions in interviews?

Say what you know, ask clarifying questions, propose a safe default, and mention how you’d validate in production.

Conclusion+ 2026 learning roadmap (practical, not theoretical)

To master Data Engineer Interview Questions and Answers in 2026, prepare for both coding correctness and production thinking. Interviews reward candidates who can write clean SQL/Python and design systems that survive retries, backfills, schema changes, and real-world failures

Week 1–2: SQL windows, dedup, sessionization, CDC/SCD2
Week 3–4: Python ETL patterns, idempotency, tests, memory
Week 5–6: Spark performance basics: shuffle, partitioning, skew
Week 7: Airflow backfills + failure containment
Week 8: Streaming semantics (event time, late arrivals, replays)
Ongoing: system design drills using the 6-layer framework