SQL Data Modeling for Analyst Interviews: Grain, Keys

Quick summary

Summarize this blog with AI

Many analyst SQL interviews start with query writing, then quietly move into data modeling. The interviewer may ask how tables should relate, why a join duplicates rows, whether a metric should be calculated from events or orders, or how you would sketch an entity relationship diagram. Those questions are not only for database administrators. They test whether you can reason about data before writing SQL against it.

The good news: you do not need to sound like a database architect. You need a reliable vocabulary and a habit of defining the grain before you join or aggregate.

From concept to interview answer

Data modeling questions are where memorized SQL answers stop working.

SQLPad helps you practice schema-based interview questions with realistic business tables, so you learn to define grain, prevent duplicate joins, and explain the metric before writing the query.

Practice schema-based SQL interviews

Start with Grain

The grain is what one row represents. If you skip this step, the rest of the model is easy to misunderstand.

users: one row per user.
orders: one row per order.
order_items: one row per product on an order.
payments: one row per payment attempt or transaction.
events: one row per user action.

Two tables can both be “about orders” while having different grains. orders might have one row per checkout. order_items might have three rows for the same order if the customer bought three products. Joining them and then counting orders without care will inflate the answer.

In interviews, say: “Before I write the query, I want to confirm the grain of each table.” That sentence prevents many downstream mistakes.

Keys and Cardinality

A primary key uniquely identifies a row. A foreign key points to a row in another table. Cardinality describes how many rows can match across that relationship.

Relationship	Example	Interview Risk
One-to-one	One user profile per user	Usually safe to join without changing row count
One-to-many	One user has many orders	Counts and sums can multiply after the join
Many-to-many	Products can appear in many orders, orders contain many products	Needs a bridge table such as `order_items`

If an interviewer asks why a join creates duplicate rows, the answer is usually cardinality, not a missing DISTINCT. DISTINCT may hide the symptom, but it does not prove the metric is correct.

A Simple ER Model for Ecommerce Analytics

For an ecommerce analytics prompt, a clean first-pass model might look like this:

customers(customer_id, signup_at, acquisition_channel)
orders(order_id, customer_id, ordered_at, order_status)
order_items(order_item_id, order_id, product_id, quantity, item_price)
products(product_id, category, list_price)
payments(payment_id, order_id, payment_status, paid_at, amount)

The relationships are:

One customer can have many orders.
One order can have many order items.
One product can appear in many order items.
One order can have zero, one, or multiple payment attempts.

That last line matters. If payments are one row per attempt, then summing payment amounts can overstate revenue unless you filter to successful captures and handle refunds. The schema shape tells you where metric definitions can break.

Normalization and Denormalization

Normalization means storing facts once and using relationships to connect them. For example, product category belongs in products, not copied into every row of order_items. Normalized models reduce update errors and make the source of truth clearer.

Denormalization means intentionally copying or precomputing data to make reads faster or reporting easier. A daily revenue summary table is denormalized. So is a wide dashboard table with customer attributes already attached.

A balanced interview answer:

“I would keep the source model normalized so updates and constraints are reliable. For analytics, I might build denormalized marts or summary tables when the query is expensive, frequently reused, or needed by a dashboard. I would document the grain and refresh logic so consumers do not mistake a summary table for raw transactions.”

Views, Materialized Views, and Summary Tables

Views can make repeated logic easier to use, but they do not automatically make a query faster. A normal view is usually stored SQL. The database still evaluates it when queried.

Materialized views and summary tables store results. They can speed up dashboards, but they introduce refresh timing, stale data, and ownership questions.

Object	Best Use	Main Tradeoff
View	Shared definition for clean joins or metrics	Can still be slow if the underlying query is slow
Materialized view	Reusable expensive result	Needs refresh strategy
Summary table	Dashboard-ready metric table	Needs data quality checks and lineage

For analyst interviews, it is usually enough to know when these tools are appropriate and what can go wrong.

Validation Queries That Show Modeling Maturity

Data modeling is not only diagramming. It is also proving that assumptions hold.

Check whether a key is unique:

SELECT
  order_id,
  COUNT(*) AS row_count
FROM orders
GROUP BY order_id
HAVING COUNT(*) > 1;

Check whether a join will multiply rows:

SELECT
  o.order_id,
  COUNT(*) AS joined_rows
FROM orders o
JOIN order_items oi
  ON o.order_id = oi.order_id
GROUP BY o.order_id
HAVING COUNT(*) > 1;

This second query is not necessarily bad. It confirms that order_items is lower grain than orders. The mistake would be joining first and then counting * as if each row were still an order.

Check for orphaned foreign keys:

SELECT oi.order_id
FROM order_items oi
LEFT JOIN orders o
  ON oi.order_id = o.order_id
WHERE o.order_id IS NULL;

Check whether a dimension value is unexpectedly missing:

SELECT
  COUNT(*) AS items_missing_product
FROM order_items oi
LEFT JOIN products p
  ON oi.product_id = p.product_id
WHERE p.product_id IS NULL;

These validation queries are simple, but they signal that you care about correctness before presentation.

How to Answer a Schema Design Prompt

If an interviewer asks you to design tables for a product analytics problem, use this sequence:

State the business questions the model must answer.
Name the core entities.
Define the grain of each table.
Identify primary keys and foreign keys.
Call out one-to-many and many-to-many relationships.
Describe the metric risk, such as duplicate joins or late-arriving events.
Suggest validation checks.

You do not need a perfect design. You need a design that is explicit enough for someone else to query without guessing.

Common Interview Mistakes

Using DISTINCT before explaining why duplicates appeared.
Counting rows after joining a high-grain table to a low-grain table.
Assuming every ID column is unique without checking.
Ignoring payment attempts, refunds, cancellations, and late-arriving events.
Designing only for storage and not for the metrics people need to answer.

The strongest candidates do not pretend the model is obvious. They ask the grain question, make assumptions visible, and validate the risky joins.

Turn this into practice

A good next session is one schema prompt, one duplicate-join diagnosis, and one aggregate query where you explain the metric grain out loud.

Build interview-ready SQL fluency

FAQ

Do analysts need to know normalization?

Yes, at a practical level. You should know why source tables are often normalized, why analytics tables are often denormalized, and what correctness risks come from copying or pre-aggregating data.

Should I draw ER diagrams in an interview?

If the prompt asks for schema design, yes. Even a simple boxes-and-lines diagram helps. Label the primary keys, foreign keys, and one-to-many relationships.

Are triggers important for analyst interviews?

Usually less than grain, keys, joins, and constraints. It is useful to know that triggers run automatically on database changes, but most analyst interviews care more about whether you can query and validate a model correctly.

SQL Data Modeling for Analyst Interviews: Grain, Keys, Normalization, and ER Diagrams