Pandas GroupBy Interview Questions: agg, transform

Q: What is the difference between agg and transform?

agg usually reduces each group to summary rows. transform returns values aligned to the original rows. Use agg for summaries and transform for row-level features based on group-level calculations.

Q: Why does pivot fail with duplicate entries?

pivot requires exactly one value for each index-column pair. If multiple rows match, use pivot_table and specify the aggregation rule.

Q: How do I avoid duplicate rows after a merge?

Know the expected relationship and use validate . If each order should match one customer row, use validate="many_to_one" . If validation fails, fix the data or grouping before trusting the result.

Quick summary

Summarize this blog with AI

ChatGPT Claude Grok Perplexity

Pandas interview questions usually test data shape more than method memorization. The interviewer wants to know whether you can keep the row grain clear while grouping, merging, reshaping, and explaining the result.

The most common failure is using a correct-looking pandas method that silently changes the row count or duplicates rows after a merge. This guide gives you a practice set around the patterns that matter most: groupby, agg, transform, top-N per group, pivot_table, and grouped time series.

The Sample Dataset

We will use one small order table. One row is one order.

import pandas as pd

orders = pd.DataFrame({
    "order_id": [1, 2, 3, 4, 5, 6, 7],
    "customer_id": [10, 10, 20, 20, 20, 30, 30],
    "order_date": pd.to_datetime([
        "2026-01-03", "2026-01-10", "2026-01-05",
        "2026-02-01", "2026-02-08", "2026-01-20", "2026-02-20"
    ]),
    "revenue": [80, 120, 50, 75, 125, 200, 150],
    "channel": ["paid", "email", "paid", "email", "paid", "organic", "organic"]
})

Before coding, say the grain out loud: one row per order. After every operation, ask whether the output is still one row per order, one row per customer, one row per customer-month, or something else.

Question 1: Summarize Each Customer

Prompt: Return one row per customer with order count, total revenue, first order date, and last order date.

customer_summary = (
    orders
    .groupby("customer_id")
    .agg(
        order_count=("order_id", "count"),
        total_revenue=("revenue", "sum"),
        first_order=("order_date", "min"),
        last_order=("order_date", "max"),
    )
    .reset_index()
)

Expected output grain: one row per customer.

Why this is the right tool: agg reduces each group to summary values. Named aggregation keeps the output columns readable and avoids cleanup later.

Question 2: Add Customer Revenue to Every Order

Prompt: Keep one row per order, but add each customer's total revenue and each order's share of that customer revenue.

orders_with_share = orders.copy()
orders_with_share["customer_revenue"] = (
    orders_with_share.groupby("customer_id")["revenue"].transform("sum")
)
orders_with_share["share_of_customer_revenue"] = (
    orders_with_share["revenue"] / orders_with_share["customer_revenue"]
)

Expected output grain: one row per order.

Why this is the right tool: transform returns values aligned to the original rows. It is cleaner than aggregating and merging back when the desired result is row-preserving.

Question 3: Latest Order Per Customer

Prompt: Return each customer's latest order. If two orders have the same date, use the larger order_id.

latest_order = (
    orders
    .sort_values(
        ["customer_id", "order_date", "order_id"],
        ascending=[True, False, False],
    )
    .drop_duplicates("customer_id", keep="first")
)

Expected output grain: one row per customer.

Interview explanation: Sorting defines the rule. drop_duplicates then keeps the first row in each customer group. The tie-breaker makes the result deterministic.

If the interviewer asks for top two orders per customer, use cumcount after sorting:

ranked = orders.sort_values(
    ["customer_id", "revenue", "order_id"],
    ascending=[True, False, True],
)
ranked["revenue_position"] = ranked.groupby("customer_id").cumcount() + 1

top_two = ranked[ranked["revenue_position"] <= 2]

Use rank when ties should share a rank. Use cumcount when you need exactly N rows per group after deterministic ordering.

Question 4: Pivot Revenue by Channel

Prompt: Return one row per customer with columns for each channel's revenue.

This fails if a customer has multiple orders in the same channel:

orders.pivot(
    index="customer_id",
    columns="channel",
    values="revenue",
)

The correct version states how duplicate customer-channel pairs should be combined:

channel_revenue = (
    orders
    .pivot_table(
        index="customer_id",
        columns="channel",
        values="revenue",
        aggfunc="sum",
        fill_value=0,
    )
    .reset_index()
)

Interview explanation: pivot requires one value per index-column pair. pivot_table handles repeated pairs because you give it an aggregation rule.

Question 5: Monthly Revenue Per Customer

Prompt: Return monthly revenue for each customer.

monthly = (
    orders
    .set_index("order_date")
    .groupby("customer_id")
    .resample("MS")["revenue"]
    .sum()
    .reset_index()
)

This returns months where a customer has orders. If the prompt requires missing months to appear as zero, create a complete customer-month index:

customers = orders["customer_id"].sort_values().unique()
months = pd.date_range(
    orders["order_date"].min().to_period("M").to_timestamp(),
    orders["order_date"].max().to_period("M").to_timestamp(),
    freq="MS",
)

full_index = pd.MultiIndex.from_product(
    [customers, months],
    names=["customer_id", "order_date"],
)

monthly_complete = (
    monthly
    .set_index(["customer_id", "order_date"])
    .reindex(full_index, fill_value=0)
    .reset_index()
)

Interview explanation: Missing rows and zero revenue are different. If a customer-month must exist even without orders, build the missing combinations intentionally.

Question 6: Merge a Summary Back Safely

Prompt: Add customer-level total revenue back to each order.

customer_summary = (
    orders
    .groupby("customer_id", as_index=False)
    .agg(total_revenue=("revenue", "sum"))
)

orders_with_summary = orders.merge(
    customer_summary,
    on="customer_id",
    how="left",
    validate="many_to_one",
)

validate="many_to_one" catches a common mistake: the right-hand table accidentally has duplicate customer rows. If you only need a group-level value on each original row, transform is often simpler.

The Interview Script

Use this structure before you code:

Input grain: one row per order.
Output grain: one row per customer, customer-month, or original order.
Grouping keys: which columns define the group?
Aggregation rule: sum, count, max, first, mean, or custom logic?
Tie-breakers: how should equal dates or scores be handled?
Validation: do row counts and merge relationships make sense?

This is the pandas equivalent of defining SQL grain before joining or grouping. For SQL examples, see why SQL joins create duplicate rows.

Practice Drills

Try these without looking at the answers first:

Find the highest-revenue order per customer, with smaller order_id winning ties.
Calculate each channel's share of total revenue within each customer.
Return one row per customer-month-channel with missing combinations filled as zero.
Merge a customer table to orders and prove the relationship is many orders to one customer.

Answer sketch for the second drill:

channel = (
    orders
    .groupby(["customer_id", "channel"], as_index=False)
    .agg(channel_revenue=("revenue", "sum"))
)
channel["customer_revenue"] = (
    channel.groupby("customer_id")["channel_revenue"].transform("sum")
)
channel["channel_share"] = (
    channel["channel_revenue"] / channel["customer_revenue"]
)

Common Mistakes

Using agg when the original rows must remain.
Using pivot when duplicate index-column pairs exist.
Merging an aggregate back without checking duplicate keys.
Using rank when the requirement is exactly N rows per group.
Ignoring missing dates in time series output.
Explaining code without explaining the output grain.

FAQ

What is the difference between agg and transform?

agg usually reduces each group to summary rows. transform returns values aligned to the original rows. Use agg for summaries and transform for row-level features based on group-level calculations.

Why does pivot fail with duplicate entries?

pivot requires exactly one value for each index-column pair. If multiple rows match, use pivot_table and specify the aggregation rule.

Is groupby apply bad in interviews?

Not always, but it is often slower and less clear than built-in groupby operations. Use vectorized methods first. Reach for apply only when the logic truly requires custom per-group code.

How do I avoid duplicate rows after a merge?

Know the expected relationship and use validate. If each order should match one customer row, use validate="many_to_one". If validation fails, fix the data or grouping before trusting the result.

Pandas GroupBy Interview Questions: agg, transform, pivot_table, and Time Series