Data Engineering Interview Prep: SQL, ETL Pipelines, and System Design for International Candidates

Master the SQL, ETL, and system design questions that separate hired data engineers from rejected ones — with a visa-aware prep timeline baked in.

By F1Jobs Team · 2026-03-12 · 11 min read

An engineer reviewing a data pipeline architecture sketch on paper at a cafe table, surrounded by schema diagrams and a laptop showing a database interface

You've cleared the recruiter screen. You've convinced the hiring manager you understand data modeling. Now comes the three-to-five-hour technical loop — SQL rounds, a system design whiteboard, maybe a take-home pipeline exercise — and you need to perform well enough in each to get the collective "hire" vote. If you're also managing an OPT clock or H-1B cap-gap, you have less margin to take three months and fail before pivoting.

This guide is built specifically for international students and professionals entering the data engineering market in the US. It maps the full interview loop, covers the SQL and pipeline topics that actually show up, walks through system design framing that experienced interviewers reward, and gives you a realistic prep timeline you can execute while tracking your employment authorization window.

What the data engineering interview loop looks like

Most mid-size and large companies run a four-to-six-stage process for data engineer roles. Understanding the shape of it helps you allocate prep time correctly.

Stage	Format	What's evaluated
Recruiter screen	30 min phone	Communication, sponsorship eligibility, high-level background
Hiring manager screen	45 min video	Past project depth, team fit, data platform familiarity
SQL / coding round	60-90 min live coding	Window functions, CTEs, aggregations, Python data manipulation
System design round	60 min whiteboard	Pipeline architecture, trade-offs, scalability reasoning
Take-home or second coding	2-4 hr async	End-to-end pipeline in a real framework (Spark, Airflow, dbt)
Final loop / debrief	45 min	Behavioral + deep dive on take-home

Some companies compress stages two and three together, or replace the take-home with a second live coding round. The SQL round and system design round appear almost universally at companies above 100 engineers.

SQL interview questions for data engineers

SQL is tested harder in data engineering interviews than in software engineering interviews. The expectation is fluency, not just familiarity. Here are the categories interviewers actually test, in order of difficulty.

Foundation tier — expected at all levels

Filtering with WHERE, aggregations with GROUP BY and HAVING, CASE expressions
JOINs across three or more tables, including LEFT/FULL OUTER JOIN semantics
Subqueries and their rewrite as CTEs
Handling NULLs explicitly (COALESCE, NULLIF, IS NULL in conditions)

Core tier — tested at every mid-level interview

Window functions are the most commonly tested advanced SQL topic for data engineers. Practice these patterns until they're automatic:

-- Running total partitioned by user
SELECT
  user_id,
  event_date,
  revenue,
  SUM(revenue) OVER (
    PARTITION BY user_id
    ORDER BY event_date
    ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
  ) AS running_total
FROM events;

-- Find the previous event timestamp for gap analysis
SELECT
  session_id,
  event_ts,
  LAG(event_ts) OVER (PARTITION BY session_id ORDER BY event_ts) AS prev_ts,
  event_ts - LAG(event_ts) OVER (PARTITION BY session_id ORDER BY event_ts) AS gap_seconds
FROM clickstream;

Interviewers also frequently give "deduplicate a table" prompts. The canonical answer uses ROW_NUMBER():

-- Keep only the most recent record per user
WITH ranked AS (
  SELECT *,
    ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY updated_at DESC) AS rn
  FROM user_events
)
SELECT * FROM ranked WHERE rn = 1;

Optimization tier — tested at senior level

You may be asked to look at a slow query and explain why it's slow. Key concepts:

Full table scans vs index seeks — an EXPLAIN ANALYZE output in Postgres, or EXPLAIN in BigQuery/Redshift, tells you whether the planner is using indexes
Predicate pushdown — filter early, join late
Partition pruning — in partitioned tables (Hive, BigQuery), putting the partition column in WHERE avoids scanning the entire table
Skew — when one key value contains disproportionate rows, a GROUP BY or JOIN on that key creates a hot partition that slows the whole job

For practice resources, StrataScratch has real interview questions tagged by company and role. LeetCode's SQL section covers the window function patterns. Mode Analytics' SQL tutorial is solid for people coming from lighter SQL backgrounds.

ETL and pipeline interview questions

ETL (extract, transform, load) questions test whether you can reason about real-world data plumbing — not just write correct code, but make trade-offs about reliability, latency, and cost.

Common ETL interview question patterns

"Walk me through how you'd build a pipeline that ingests Salesforce data into your warehouse daily."

A strong answer covers:

Source connector type (API-based, not direct DB access for SaaS sources)
Extraction strategy (full load vs incremental; for incremental, what's the watermark — updated_at, cursor ID, or CDC log)
Landing zone (raw S3/GCS bucket before any transformation)
Schema-on-read vs schema-on-write trade-off in the landing zone
Transformation layer (dbt for SQL-based transforms, Spark for heavy processing)
Loading strategy into the warehouse (upsert/MERGE vs truncate-and-reload)
Monitoring and alerting (data quality checks, row count anomaly detection)

"How do you handle late-arriving data in a streaming pipeline?"

This is a Spark/Flink question. The answer centers on watermarks — a threshold of how far behind real-time you're willing to wait before closing a window. In Spark Structured Streaming:

withWatermark("event_time", "10 minutes") tells the engine to accept events up to 10 minutes late
After the watermark threshold, late data is dropped (or written to a dead-letter queue if you configure it)
The trade-off is freshness vs completeness

"Describe how you'd design an idempotent pipeline."

Idempotency means running the same pipeline twice produces the same result as running it once. The standard patterns:

Write to a staging table, then MERGE/upsert into the target using a natural key
Use deterministic partition names (date-based) so a re-run overwrites the correct partition
Track pipeline runs in a metadata table with run_id, status, rows_processed

Airflow/Prefect orchestration questions

Interviewers at companies using Airflow often ask about DAG design. Key concepts to know:

Idempotent tasks — tasks should be safe to retry without side effects
XCom — passing small data between tasks (not large datasets; those go to storage)
SLA misses — Airflow can alert when a DAG or task exceeds its expected duration
Sensor tasks — waiting for an upstream file or event before proceeding; understand the poke_interval and timeout parameters and why a sensor that runs too frequently can starve the scheduler

For a deeper walkthrough on pipeline architecture patterns and how to discuss them in an interview, see our system design interview guide for new grad engineers.

Data pipeline system design interview

System design for data engineering differs from backend system design in its emphasis on data freshness, correctness, and cost at scale rather than request latency and API throughput. The mental model to use is the five-layer pipeline:

Ingestion → Transport → Storage → Transformation → Serving

Walk interviewers through each layer, justifying each choice against the requirements you clarified at the start.

Clarifying questions to ask before drawing anything

What is the data volume? (MB/day vs TB/day changes almost every answer)
What is the latency SLA? Real-time (sub-second), near-real-time (minutes), or batch (hours to daily)?
What are the downstream consumers — a dashboard, an ML model, an operational application?
Is the source structured, semi-structured (JSON/Avro), or unstructured?
What's the retention requirement? (Affects storage tier and partitioning strategy)

A concrete example — event analytics pipeline

Prompt: "Design a pipeline that ingests clickstream events from a mobile app, aggregates them into hourly session summaries, and feeds a dashboard used by product managers."

A production-grade answer:

Ingestion — Mobile SDK sends events to a load balancer → Kafka cluster (events are buffered if downstream is slow; Kafka's log compaction handles duplicates)
Transport — Kafka topic clickstream.raw, partitioned by user_id for ordered processing per user
Real-time processing — Spark Structured Streaming reads from Kafka, applies watermark of 5 minutes for late data, aggregates into hourly session windows, writes to Delta Lake on S3
Batch correction job — Airflow DAG runs at T+2 hours to backfill any events that arrived after the streaming watermark; writes corrected aggregates back to Delta Lake using MERGE
Serving — Delta Lake table is registered in the Hive metastore; BI tool (Tableau, Looker) queries via Presto/Trino; query results are cached at the BI layer for dashboard performance

The streaming + batch "lambda lite" pattern, or the equivalent using Delta Lake's streaming + batch unification, is a strong signal in senior-level interviews.

For general system design framing used across engineering roles, see system design interview prep for backend roles.

Spark and Kafka interview prep

Spark and Kafka are the two distributed systems frameworks most commonly tested in data engineering interviews at mid-to-large companies. Here's what interviewers actually focus on.

Spark concepts interviewers test

Transformations vs actions — Spark is lazy. Transformations (map, filter, groupBy) build a DAG but don't execute. Actions (count, collect, write) trigger execution. A common interview trap: writing a loop that calls an action inside a transformation, causing unexpected recomputation.

Shuffles — when Spark needs to move data between partitions (groupBy, join, distinct), it triggers a shuffle. Shuffles are expensive because they involve disk I/O and network transfer. Key interview answers:

Use reduceByKey instead of groupByKey when possible — it aggregates locally before shuffling
Use broadcast joins for small lookup tables to avoid shuffling the large table
Repartition on the join key before a large join to pre-sort partitions

Spark UI diagnosis — if you claim Spark experience, expect to be asked what you'd look at in the Spark UI for a slow stage. The answers: stage timeline (which task is the longest), task distribution (one task taking 10x longer than others signals data skew), GC time (too much GC means memory pressure, increase executor memory or reduce partition size).

Partitioning — the default number of partitions after a shuffle is 200 (spark.sql.shuffle.partitions). For small datasets this creates 200 tiny files and overhead; for large datasets it may create partitions too large to fit in executor memory. Interviewers ask how you'd tune this.

Kafka concepts interviewers test

Topic partitioning and consumer groups — a Kafka topic is divided into partitions. Within a consumer group, each partition is consumed by exactly one consumer. To scale throughput, you add partitions (and corresponding consumers). This means: partition count sets the maximum parallelism of your consumer group.

Offset management — Kafka tracks what each consumer group has read using offsets. "At-least-once" delivery (the common default) means a consumer may re-process a message after a failure; your processing logic must be idempotent to handle this correctly. "Exactly-once" semantics are available in newer Kafka versions but require careful configuration of the producer, broker, and consumer.

Consumer lag — the difference between the latest offset in a partition and the consumer's current offset. High consumer lag means your pipeline is falling behind real-time. Monitoring lag (via kafka-consumer-groups.sh or a monitoring tool) and alerting on thresholds is a standard operational question.

Replication factor — Kafka replicates each partition across multiple brokers for durability. A replication factor of 3 (one leader, two followers) is typical in production. Interviewers may ask about min.insync.replicas — the minimum number of replicas that must acknowledge a write before it's considered successful.

Visa and sponsorship context for data engineers

Data engineering is a strong field for H-1B sponsorship because the specialty-occupation bar is straightforwardly met — the role requires theoretical and practical application of computer science, information systems, or a closely related engineering discipline, which aligns squarely with USCIS specialty-occupation criteria under 8 CFR 214.2(h)(4).

A few specifics worth knowing:

OPT / STEM OPT alignment — data engineering roles at companies that are E-Verify participants qualify for the 24-month STEM OPT extension if your degree is in computer science, electrical engineering, information systems, statistics, or another field on the DHS STEM Designated Degree Program List. Your employer must file the I-983 Training Plan, which documents the learning objectives of your role. Keep the plan specific — list the Spark frameworks, data warehouse platforms, and pipeline tools you're working with. The 90-day unemployment limit under OPT applies from your EAD start date, so align your start date to when you have a signed offer.

Cap-exempt options — if you're considering a PhD or postdoctoral research position, universities and qualifying nonprofit research organizations are cap-exempt H-1B employers. Data engineering roles at these institutions (building research data infrastructure) can be filed at any time without waiting for the H-1B lottery. See our full guide on cap-exempt H-1B employers.

PERM and EB categories — data engineering maps cleanly to EB-2 (advanced degree or National Interest Waiver for senior researchers) and EB-3 (skilled workers). EB-2 NIW self-petition is a realistic path for data engineers who have published work on novel methods or built infrastructure that has demonstrated national-scale impact, though the standard is meaningfully higher than "experienced engineer." For EB-3, PERM labor certification is the typical path and takes 12-24 months before the I-140 stage.

For a full breakdown of the data engineer sponsorship landscape, see our data engineer H-1B sponsorship guide.

Prep timeline for international candidates

The specific constraint for international candidates is the employment authorization window. If you're on STEM OPT, you may have 18-24 months remaining. If you're in cap-gap after the H-1B lottery, you have a hard deadline. This timeline is calibrated for 8 weeks of focused prep, which fits within a typical job search starting 2 months before your target start date.

Week 1 — Baseline assessment. Do three LeetCode SQL mediums, one system design practice problem (write it out, don't just think about it), and review your current resume against the job description patterns you're targeting. Identify your weakest area from SQL, distributed systems, or pipeline design.
Week 2 — SQL depth. Complete the StrataScratch top-50 data engineer questions. Focus on window functions, deduplication, and multi-table joins. Time yourself — target 20-30 minutes per medium question.
Week 3 — ETL concepts. Read the Airflow documentation on DAG best practices. Set up a local Airflow instance (Docker Compose works) and build a simple three-task DAG that reads a CSV, transforms it, and writes output. Having hands-on experience with the scheduler is a strong interview differentiator.
Week 4 — Spark fundamentals. Complete the Databricks free Spark training (Apache Spark Programming with Databricks). Focus on the shuffle internals module and the Spark UI walkthrough.
Week 5 — System design practice. Practice two full system design problems out loud or with a partner. Use the five-layer framing. Record yourself if you don't have a practice partner — watching the recording is uncomfortable but effective.
Week 6 — Kafka and streaming. Read Confluent's Kafka documentation on consumer groups and offset management. Build a small producer-consumer demo on your local machine. Understand watermarks in Spark Structured Streaming at the code level.
Week 7 — Mock interviews. Do two to three mock interviews on Pramp or Interviewing.io. Pay attention to how you communicate trade-offs, not just whether your answers are correct. See our full guide on coding interview prep timelines for international students.
Week 8 — Application push. Have 15-20 applications out at companies with documented H-1B sponsorship history. Prepare your "do you need sponsorship" answer — be direct, positive, and brief. See our guide on how to answer the sponsorship question in interviews.

Common mistakes

Over-indexing on LeetCode algorithmic problems. Data engineering interviews rarely test tree traversals or dynamic programming. The prep time that software engineers spend on graph algorithms should go toward SQL window functions, distributed systems concepts, and pipeline design patterns.

Writing SQL that works but can't explain. Interviewers probe reasoning. If you write a LAG() function correctly but can't explain what the window frame means, the round goes poorly. Practice explaining every query clause out loud.

Treating system design as a monologue. Experienced interviewers want a conversation. If you launch into a design without asking about requirements, you signal that you build solutions before understanding problems — a real red flag for data engineers who often work with ambiguous stakeholder requests.

Ignoring data quality in system design. Pipelines that don't account for NULL values, schema drift, upstream source failures, or duplicate events are incomplete answers. Always mention where your design handles bad data — a dead-letter queue, a data quality assertion step, or an alerting rule.

Underestimating the take-home. Take-home pipeline exercises separate candidates who've read about data engineering from those who've actually built things. Set up a local environment early in your prep so debugging a Docker Compose error isn't a take-home surprise.

Treating the sponsorship question as an obstacle. Bring it up proactively with the recruiter in the first call. Companies that won't sponsor will tell you early; companies that do sponsor appreciate the directness. Uncertainty about your timeline is worse for both sides than clarity.

Frequently asked questions

What SQL topics are most commonly tested in data engineering interviews?

Window functions (ROW_NUMBER, RANK, LAG/LEAD), aggregations with GROUP BY and HAVING, self-joins, CTEs, and query optimization (explain plans, indexes) are the most frequently tested topics. Many interviewers give a "dirty data" prompt — a table with NULLs, duplicates, or inconsistent types — and ask you to clean and aggregate it in a single query. Practice these on platforms like LeetCode SQL or StrataScratch using real interview questions tagged for data engineer roles.

How should I approach a data pipeline system design question in an interview?

Use a five-layer framing — ingestion, transport, storage, transformation, and serving — and drive with clarifying questions before drawing anything. Ask about data volume, latency SLA (real-time vs batch), upstream source reliability, and downstream consumer type. Then anchor each layer choice to those requirements. For high-throughput streaming use cases name Kafka as the transport and Spark Structured Streaming or Flink as the processor; for batch use cases name Airflow or Prefect for orchestration with Spark or dbt for transformation.

Is data engineering a realistic career path for getting H-1B sponsorship?

Yes — data engineering is squarely within H-1B specialty-occupation criteria because the role requires a bachelor's degree or higher in a specific technical discipline (computer science, information systems, or closely related). Most large tech, fintech, and healthcare companies actively sponsor, and the role's project-based nature makes it easy to document the employer-employee relationship that USCIS requires. Check the DOL LCA disclosure database to verify a specific company's sponsorship history before applying.

How does the 90-day OPT unemployment limit affect a data engineer job search?

The 90-day clock starts the day your OPT EAD becomes valid and counts any day you are not employed in a position directly related to your degree. Data engineering roles qualify as STEM-related for both OPT and STEM OPT purposes when your degree is in computer science, electrical engineering, information systems, or a recognized STEM field. To avoid burning unemployment days during a long job search, start applying 3-4 months before your EAD start date and line up offers to begin close to that date.

What Spark and Kafka concepts do interviewers actually test at the mid-level?

For Spark, interviewers focus on the difference between transformations and actions, the cost of shuffles (why groupByKey is slower than reduceByKey), partitioning strategies, and how to diagnose a slow stage using the Spark UI. For Kafka, they test topic partitioning and consumer group mechanics, offset management (at-least-once vs exactly-once semantics), and how you handle consumer lag. Being able to draw a Kafka-to-Spark Structured Streaming pipeline on a whiteboard and explain backpressure handling is a strong mid-level signal.

Working through the job search while tracking an OPT or H-1B timeline? F1Jobs — we help international data engineers find sponsoring companies and navigate the process from first application to offer.