Quick summary
Summarize this blog with AI
Data engineering and analytics engineering interviews increasingly test more than generic SQL. You may still write joins, aggregations, CTEs, and window functions, but the prompt often includes warehouse realities: nested event payloads, incremental models, late-arriving rows, slowly changing dimensions, data tests, and compute cost.
That is where Snowflake and dbt-style prep differs from ordinary analyst SQL prep. The syntax matters, but the stronger signal is whether your SQL choices hold up in a maintained warehouse.
This guide focuses on the Snowflake and dbt concepts most likely to appear in SQL-heavy interviews: QUALIFY, LATERAL FLATTEN, semi-structured data, facts and dimensions, SCDs, incremental models, tests, and performance tradeoffs.
What the Interviewer Is Testing
A generic SQL screen asks, "Can you get the correct result?" A Snowflake or dbt-oriented screen adds several questions:
- Can this transformation run repeatedly without corrupting data?
- Does the model have a clear grain?
- Will it handle late-arriving or updated source rows?
- Can analysts reuse it without copying business logic?
- Do tests protect the assumptions?
- Is the warehouse doing unnecessary work?
Expect prompts like:
- Deduplicate a raw landing table.
- Keep the latest record per business key.
- Flatten nested JSON event properties.
- Build a fact table and supporting dimensions.
- Track plan changes over time.
- Design an incremental dbt model.
- Add tests for uniqueness, freshness, and referential integrity.
- Reduce cost without changing the answer.
Core SQL Still Comes First
Do not skip fundamentals. You still need:
- Joins and anti-joins.
- Aggregations and conditional aggregation.
- Window functions for ranking and deduplication.
- Date filters and date buckets.
- CTEs that make transformations readable.
- NULL and duplicate handling.
- Validation with row counts and distinct counts.
The difference is context. "Find the latest order per customer" becomes "build a current customer snapshot from a stream of updates." The SQL pattern is familiar, but the production risk is larger.
QUALIFY for Window Filters
Snowflake supports QUALIFY, which filters rows after window functions are evaluated. It is useful for latest-row, top-N, and deduplication queries.
SELECT
user_id,
event_id,
event_name,
event_timestamp
FROM raw_events
QUALIFY ROW_NUMBER() OVER (
PARTITION BY user_id
ORDER BY event_timestamp DESC, event_id DESC
) = 1;
The interview detail is not just knowing the keyword. Explain the tie-breaker. If two events have the same timestamp, event_id DESC makes the chosen row deterministic.
Also know the portable version:
WITH ranked_events AS (
SELECT
user_id,
event_id,
event_name,
event_timestamp,
ROW_NUMBER() OVER (
PARTITION BY user_id
ORDER BY event_timestamp DESC, event_id DESC
) AS rn
FROM raw_events
)
SELECT *
FROM ranked_events
WHERE rn = 1;
In an interview, say: "QUALIFY is concise in Snowflake, but the CTE version is more portable across SQL dialects."
FLATTEN for Nested Data
Snowflake's FLATTEN table function expands arrays or objects from semi-structured data into rows. It is commonly used with LATERAL when a row contains a JSON array.
Example payload:
{
"event_id": "evt_123",
"user_id": "u_9",
"products": [
{"product_id": "p1", "quantity": 2},
{"product_id": "p2", "quantity": 1}
]
}
Flattening products:
SELECT
e.event_id,
e.payload:user_id::string AS user_id,
item.value:product_id::string AS product_id,
item.value:quantity::number AS quantity
FROM raw_events e,
LATERAL FLATTEN(input => e.payload:products) item;
The key concept is grain. Before flattening, the table is one row per event. After flattening, it is one row per event product. If you later count events with COUNT(*), you will overcount events that contain multiple products.
If you need to keep events that have no products, discuss outer flattening:
SELECT
e.event_id,
item.value:product_id::string AS product_id
FROM raw_events e,
LATERAL FLATTEN(input => e.payload:products, outer => TRUE) item;
A strong answer says: "I would flatten only in a product-level model, document that the grain changed, and aggregate back before joining to event-level metrics."
VARIANT and Semi-Structured Tradeoffs
Snowflake can store semi-structured data in VARIANT, but "keep everything as JSON" is not a complete strategy.
A practical warehouse design usually does this:
- Keep raw payloads for traceability and replay.
- Extract stable, frequently used fields into typed staging columns.
- Flatten repeated arrays into child models when analysts query them often.
- Preserve the raw field when schema drift is expected.
- Add tests for required extracted fields.
Example staging model:
SELECT
payload:event_id::string AS event_id,
payload:user_id::string AS user_id,
payload:event_name::string AS event_name,
payload:event_timestamp::timestamp_ntz AS event_timestamp,
payload AS raw_payload,
loaded_at
FROM raw_events;
The interview tradeoff: typed columns make common queries simpler and safer; raw payloads preserve flexibility when the event schema changes.
Facts, Dimensions, and Grain
Warehouse interviews often test whether you can name the grain of a model.
- Fact tables: Measurements or events, such as orders, payments, page views, sessions, or tickets.
- Dimension tables: Descriptive entities, such as customers, products, accounts, campaigns, or plans.
- Bridge tables: Many-to-many relationships, such as users to accounts or products to categories.
For every model, say the grain in one sentence:
fct_orders: one row per order.fct_order_items: one row per order item.dim_customers: one row per current customer.dim_customer_history: one row per customer validity period.
If you cannot state the grain, the model is not ready for downstream dashboards.
Slowly Changing Dimensions
Slowly changing dimensions test whether you understand history. A Type 1 dimension overwrites changes. A Type 2 dimension keeps historical versions.
Current-state customer dimension:
customer_id | plan
------------+------
c_1 | pro
Historical customer dimension:
customer_id | plan | valid_from | valid_to | is_current
------------+-------+------------+------------+-----------
c_1 | free | 2026-01-01 | 2026-03-15 | false
c_1 | pro | 2026-03-15 | null | true
The right choice depends on the question. If support only needs the current plan, Type 1 may be enough. If finance wants revenue by the plan a customer had at purchase time, you need historical validity.
Current record query:
SELECT
customer_id,
plan,
updated_at
FROM customer_updates
QUALIFY ROW_NUMBER() OVER (
PARTITION BY customer_id
ORDER BY updated_at DESC, ingestion_id DESC
) = 1;
A stronger answer mentions time semantics. Is updated_at source event time, warehouse ingestion time, or processing time? Late-arriving records can break a model that uses the wrong timestamp.
dbt Model Layers
dbt terminology appears often in analytics engineering interviews. A clean model stack usually has these layers:
- Sources: Raw tables loaded from applications or third-party systems.
- Staging: Light cleanup, renaming, type casting, and one-to-one source shaping.
- Intermediate: Reusable business logic, joins, sessionization, and deduplication.
- Marts: Final facts and dimensions used by dashboards and analysts.
Do not put heavy business logic in staging. Staging should make source data predictable. Business definitions belong in intermediate or mart models where the intent is clear.
Incremental Models
dbt incremental models process new or changed records instead of rebuilding the entire table every run. They are valuable for large tables, but they introduce correctness risk.
A simplified incremental filter looks like this:
SELECT
order_id,
customer_id,
status,
updated_at
FROM source_orders
WHERE updated_at >= (
SELECT COALESCE(MAX(updated_at), '1900-01-01')
FROM target_orders
);
That pattern is incomplete by itself. In an interview, discuss:
- The unique key used for merge or upsert.
- Late-arriving source records.
- Updates to previously loaded rows.
- Deletes and cancellations.
- A lookback window to reprocess recent records.
- When to run a full refresh.
- How to reconcile the target to the source.
A stronger answer:
"I would configure the model as incremental with a stable unique_key, use merge behavior where supported, include a short lookback window for late updates, and add uniqueness and freshness tests so we know the target remains trustworthy."
dbt Tests
dbt data tests are a common interview topic because they turn assumptions into checks.
Know the basic tests:
not_nullfor required fields.uniquefor primary or business keys.relationshipsfor referential integrity.accepted_valuesfor constrained status/type columns.
Example:
models:
- name: fct_orders
columns:
- name: order_id
data_tests:
- not_null
- unique
- name: customer_id
data_tests:
- not_null
- relationships:
arguments:
to: ref('dim_customers')
field: customer_id
- name: status
data_tests:
- accepted_values:
arguments:
values: ['paid', 'refunded', 'cancelled']
Also mention custom tests. Revenue should reconcile to the source. Freshness should alert if raw data stops arriving. A dashboard-critical table may need a check that daily row count does not unexpectedly drop to zero.
Late-Arriving Data
Late-arriving data is a frequent interview differentiator. A naive incremental model can miss records if it only loads rows newer than the target's maximum timestamp.
Example risk:
- The target table has loaded through May 20.
- A source record for May 18 arrives on May 21.
- The incremental filter only loads records after May 20.
- The late May 18 record is missed.
A common mitigation is a lookback window:
WHERE updated_at >= (
SELECT DATEADD(day, -3, COALESCE(MAX(updated_at), '1900-01-01'))
FROM target_orders
)
This reprocesses recent data and relies on a merge key to update existing rows safely. The right window depends on source behavior and business tolerance.
Performance and Cost
Snowflake performance questions are often cost questions. A query that scans too much data is slow and expensive.
Practical talking points:
- Select only the columns needed in intermediate models.
- Filter early only when it is logically safe.
- Materialize reused expensive transformations.
- Avoid repeatedly flattening the same nested payload downstream.
- Use clustering thoughtfully for large tables with common filter patterns.
- Inspect query profiles instead of guessing.
- Balance full-refresh simplicity against incremental-model complexity.
Be careful with "filter early." It is good for performance only if it does not change the business population. Filtering before a first-ever calculation can produce the wrong answer faster.
A Strong End-to-End Answer
Prompt: "Raw event data arrives as JSON. Analysts need daily conversion by marketing channel. Records can arrive late. How would you model this?"
A strong answer:
"I would keep the raw events table immutable for replay. Then I would build a staging model that extracts typed fields such as event_id, user_id, event_name, event_timestamp, marketing_channel, and loaded_at from the VARIANT payload. If the payload contains repeated products or attributes, I would flatten those into separate child models and document that the grain changes. For conversion, I would build an intermediate model at user-day or session grain depending on the definition, then a mart with daily conversion by channel. I would make the large downstream model incremental with a stable unique key and a lookback window for late events. I would add not-null and unique tests for event_id, accepted-values tests for event_name where appropriate, relationship tests to dimensions, and freshness checks on the raw source. Finally, I would validate daily counts against the source before trusting the dashboard."
This answer connects syntax, grain, model layers, tests, late data, and warehouse cost. That is the bar.
Seven-Day Prep Plan
| Day | Focus | Practice task |
|---|---|---|
| 1 | QUALIFY and windows | Latest record, top-N, dedup, deterministic tie-breakers. |
| 2 | Nested data | Extract JSON fields and flatten arrays without losing grain. |
| 3 | Facts and dimensions | Name model grain and choose fact vs dimension boundaries. |
| 4 | SCDs | Design current-state and historical customer dimensions. |
| 5 | dbt layers and tests | Design source, staging, intermediate, mart, and data tests. |
| 6 | Incremental models | Handle unique keys, lookbacks, late data, and full refreshes. |
| 7 | Mock system prompt | Talk through an end-to-end event pipeline. |
Final Checklist
Before a Snowflake or dbt SQL interview, make sure you can explain:
- Why
QUALIFYis useful and how to write the CTE alternative. - How
FLATTENchanges row grain. - When to extract
VARIANTfields into typed columns. - The difference between fact, dimension, bridge, and history tables.
- When Type 1 and Type 2 dimensions are appropriate.
- How incremental models can miss late-arriving records.
- Which dbt tests protect each model assumption.
- How query design affects Snowflake cost.
- How to validate row counts, uniqueness, freshness, and reconciliation.
The goal is not to sound like a documentation index. The goal is to show that your SQL is maintainable: correct grain, clear model boundaries, tested assumptions, safe incremental logic, and cost-aware execution.