How to Solve Gaps and Islands SQL Interview Questions with LAG and ROW_NUMBER

SQL Updated Mar 25, 2026 7 mins read Leon Leon
How to Solve Gaps and Islands SQL Interview Questions with LAG and ROW_NUMBER cover image

Quick summary

Summarize this blog with AI

Introduction

Gaps and islands questions slow people down because they rarely look like standard “join and group by” interview prompts. The sample data is messy, the business story sounds custom, and the right answer usually takes a couple of structured steps.

But most of these problems reduce to the same core task: identify where a new run starts, assign a stable group identifier, and aggregate within each group.

That is all “gaps and islands” really means. An island is a contiguous run of related rows. A gap is the break between one run and the next. Interviewers reuse this structure for login streaks, sessionization, consecutive purchases, unchanged-price periods, subscription states, and activity windows.

Once you learn how to classify the pattern, the SQL becomes much more mechanical.

How to Spot the Pattern Quickly

If the prompt asks for one of these, you should think gaps-and-islands immediately:

  • consecutive active days or months
  • streaks
  • periods where a value stayed the same
  • grouped runs separated by a threshold gap
  • sessions built from event timestamps
  • start and end dates for continuous states

The right mental question is not “what fancy SQL trick do I need?” It is:

“What condition starts a new group?”

Pick the Pattern Before You Write the Query

Most interview versions fall into one of these buckets:

Problem shape Best starting pattern Why
Consecutive days or months ROW_NUMBER() plus shifted date key Best when continuity is strict and calendar-based
Value changes over time LAG() plus running SUM() Best when the previous row decides whether a new segment starts
Sessions with a time threshold LAG() plus time-diff flag Best when a gap such as 30 minutes defines the boundary

If you choose the grouping rule first, the code becomes easier to explain and debug.

Start by Locking the Grain

Before you touch a window function, define four things out loud:

  • What is the entity? User, account, device, subscription, product?
  • What is the ordering column? Date, timestamp, month bucket?
  • What makes two rows belong to the same island?
  • What event starts a new island?

Examples:

  • For daily login streaks, a new island starts when a day is missing.
  • For a status-history table, a new island starts when the status changes.
  • For sessionization, a new island starts when the time gap exceeds a threshold.

Many candidates know LAG() and ROW_NUMBER() but still lose points because they never define what continuity means for the business question.

Pattern A: Consecutive Dates with ROW_NUMBER()

When the prompt is about exact consecutive calendar periods, one of the cleanest patterns is:

  1. Deduplicate to the real business grain.
  2. Assign a row number ordered by date.
  3. Subtract the row-number offset from the date.
  4. Group by the derived key.
WITH login_days AS (
    SELECT DISTINCT
        user_id,
        login_at::date AS login_day
    FROM logins
), numbered AS (
    SELECT
        user_id,
        login_day,
        ROW_NUMBER() OVER (
            PARTITION BY user_id
            ORDER BY login_day
        ) AS rn
    FROM login_days
), grouped AS (
    SELECT
        user_id,
        login_day,
        login_day - (rn * INTERVAL '1 day') AS island_key
    FROM numbered
)
SELECT
    user_id,
    MIN(login_day) AS streak_start,
    MAX(login_day) AS streak_end,
    COUNT(*) AS streak_days
FROM grouped
GROUP BY user_id, island_key
ORDER BY user_id, streak_start;

Why this works: consecutive dates move forward one step at a time, and the row number also moves forward one step at a time. Their offset stays constant inside the streak.

This is the best pattern for:

  • daily streaks
  • consecutive months with purchases
  • contiguous calendar activity

Date-Arithmetic Dialect Notes

The grouping idea is portable, but date arithmetic syntax varies.

  • PostgreSQL: login_day - (rn * INTERVAL '1 day')
  • MySQL: DATE_SUB(login_day, INTERVAL rn DAY)
  • SQL Server: DATEADD(day, -rn, login_day)
  • BigQuery: DATE_SUB(login_day, INTERVAL rn DAY)

In interviews, it is fine to say, “I know the pattern; exact interval syntax is dialect-specific.” That is a much better move than freezing on a function name.

Pattern B: State Changes or Threshold Gaps with LAG()

When continuity depends on the previous row's value or timestamp, LAG() is usually the right starting point.

The workflow is:

  1. compare each row to the previous row
  2. flag the rows that start a new segment
  3. convert those flags into group IDs with a running sum
  4. aggregate each group
WITH ordered AS (
    SELECT
        account_id,
        event_ts,
        status,
        LAG(status) OVER (
            PARTITION BY account_id
            ORDER BY event_ts
        ) AS prev_status,
        LAG(event_ts) OVER (
            PARTITION BY account_id
            ORDER BY event_ts
        ) AS prev_event_ts
    FROM account_status_history
), flagged AS (
    SELECT
        account_id,
        event_ts,
        status,
        CASE
            WHEN prev_status IS NULL THEN 1
            WHEN status <> prev_status THEN 1
            WHEN event_ts - prev_event_ts > INTERVAL '30 minutes' THEN 1
            ELSE 0
        END AS starts_new_group
    FROM ordered
), grouped AS (
    SELECT
        account_id,
        event_ts,
        status,
        SUM(starts_new_group) OVER (
            PARTITION BY account_id
            ORDER BY event_ts
            ROWS UNBOUNDED PRECEDING
        ) AS group_id
    FROM flagged
)
SELECT
    account_id,
    status,
    MIN(event_ts) AS segment_start,
    MAX(event_ts) AS segment_end
FROM grouped
GROUP BY account_id, status, group_id
ORDER BY account_id, segment_start;

This is the better pattern whenever “same group” depends on previous-row comparison instead of strict calendar adjacency.

Worked Example: Login Streaks of Three or More Days

Suppose the prompt is:

“Return users who logged in on at least three consecutive days, along with the start and end of each streak.”

The clean approach is:

  • convert timestamps to days
  • deduplicate same-day logins
  • build islands of consecutive days
  • keep only streaks with length at least three
WITH login_days AS (
    SELECT DISTINCT
        user_id,
        login_at::date AS login_day
    FROM logins
), numbered AS (
    SELECT
        user_id,
        login_day,
        ROW_NUMBER() OVER (
            PARTITION BY user_id
            ORDER BY login_day
        ) AS rn
    FROM login_days
), grouped AS (
    SELECT
        user_id,
        login_day,
        login_day - (rn * INTERVAL '1 day') AS island_key
    FROM numbered
), streaks AS (
    SELECT
        user_id,
        MIN(login_day) AS streak_start,
        MAX(login_day) AS streak_end,
        COUNT(*) AS streak_days
    FROM grouped
    GROUP BY user_id, island_key
)
SELECT *
FROM streaks
WHERE streak_days >= 3
ORDER BY user_id, streak_start;

The key quality move here is the DISTINCT in the first CTE. If the business question is about active days, duplicate logins on the same day should not inflate the streak.

Worked Example: Sessionization with a 30-Minute Threshold

Now change the prompt:

“Group each user's events into sessions. A new session starts when more than 30 minutes pass between events.”

That is not a ROW_NUMBER() date-key problem. It is a previous-row comparison problem.

WITH ordered AS (
    SELECT
        user_id,
        event_ts,
        LAG(event_ts) OVER (
            PARTITION BY user_id
            ORDER BY event_ts
        ) AS prev_event_ts
    FROM events
), flagged AS (
    SELECT
        user_id,
        event_ts,
        CASE
            WHEN prev_event_ts IS NULL THEN 1
            WHEN event_ts - prev_event_ts > INTERVAL '30 minutes' THEN 1
            ELSE 0
        END AS starts_new_session
    FROM ordered
), grouped AS (
    SELECT
        user_id,
        event_ts,
        SUM(starts_new_session) OVER (
            PARTITION BY user_id
            ORDER BY event_ts
            ROWS UNBOUNDED PRECEDING
        ) AS session_id
    FROM flagged
)
SELECT
    user_id,
    session_id,
    MIN(event_ts) AS session_start,
    MAX(event_ts) AS session_end,
    COUNT(*) AS events_in_session
FROM grouped
GROUP BY user_id, session_id
ORDER BY user_id, session_start;

Same pattern, different grouping rule.

What to Say Before You Type

A strong interview explanation sounds like this:

“I am going to partition by user, order by the event date, define what starts a new group, build a stable group identifier, and then aggregate each group into start and end boundaries.”

That is much better than jumping straight to window functions without framing the problem.

If the prompt is ambiguous, ask one targeted question first:

  • Should multiple events on the same day count once or multiple times?
  • Does any missing day break the streak?
  • Is the boundary exact calendar adjacency or a threshold gap?
  • Are we grouping by value stability or by time continuity?

Debugging Checklist

If the final result looks wrong, inspect the intermediate rows before you aggregate.

Check these columns directly:

  • entity id
  • ordered date or timestamp
  • previous date or previous value
  • break flag
  • derived group id

Most errors become obvious there. You will usually catch the bug faster by validating the flagged rows than by rewriting the final aggregation repeatedly.

To go one step wider on adjacent interview patterns:

FAQ

Do I always need LAG() for gaps and islands?

No. If the problem is strict calendar consecutiveness, ROW_NUMBER() plus a shifted date key is often enough. LAG() becomes more useful when the previous row decides whether a new segment starts.

Why should I deduplicate to one row per day before building streaks?

Because many event tables contain multiple records per user per day. If the business question is about active days, duplicates will distort the grouping unless you collapse to the intended grain first.

What if the gap threshold is 30 days or 30 minutes instead of exact consecutiveness?

That usually points to the LAG() plus running-sum pattern. Compare the current row with the previous row, flag the boundary, then turn the flags into stable group IDs.

Interview Prep

Begin Your SQL, Python, and R Journey

Master 230 interview-style coding questions and build the data skills needed for analyst, scientist, and engineering roles.

Related Articles

All Articles