Forum

Posted by Luming, Dec. 14, 2022, 7:29 p.m.

140

Hi all,

I am wondering whats the difference between putting the "restaurant_id = 100011" in where and ON clause?

For example,
SELECT date, COALESCE(restaurant_id,100011), SUM(COALESCE(f.total_amt,0)) as daily_revenue
FROM dates d 
LEFT JOIN food_order f
ON d.date = DATE(f.created_at)  AND restaurant_id = 100011
WHERE DATE(date)>='2021-08-01' AND DATE(date)<='2021-08-31'
GROUP BY 1,2


vs

SELECT date, COALESCE(restaurant_id,100011), SUM(COALESCE(f.total_amt,0)) as daily_revenue
FROM dates d 
LEFT JOIN food_order f
ON d.date = DATE(f.created_at)
WHERE DATE(date)>='2021-08-01' AND DATE(date)<='2021-08-31' AND restaurant_id = 100011
GROUP BY 1,2

Best
Ming

Answers

Hi Ming,

This is a great question. It really boils down to how the engine executes the query (in which order).

The JOIN condition always run before the WHERE condition.

In your first queries:

Since the LEFT JOIN condition is first run, many restaurants that are not 100011 will still have a lot of matched records:  they don't meet the restaurant_id condition  but will meet the dates condition.

Those restaurants will still have a record with a valid date and everything else will be NULLs, and they are included before the aggregation function ran. So imagin if you are using an AVG, or statistics function, those NULL records may have an impact to your final results (after the aggregation).

It could be much clearer if you ran the following queries and see the results.

1) -- Results will include restaurant_id != 100011

SELECT date, f.restaurant_id, f.order_id
FROM dates d
LEFT JOIN food_order f
ON d.date = DATE(f.created_at)
AND restaurant_id = 100011
WHERE DATE(date)>='2021-08-01'
AND DATE(date)<='2021-08-31'
;

2) --  Results only include restaurant_id = 100011

SELECT date, f.restaurant_id, f.order_id
FROM dates d
LEFT JOIN food_order f
ON d.date = DATE(f.created_at)
WHERE DATE(date)>='2021-08-01' 
AND DATE(date)<='2021-08-31'
AND restaurant_id = 100011
;

Hope this helps.

-Leon

@Leon, would you downgrade someone who does a cross-join of date+restaurants to get all date/restaurant combinations to fill in the blank revenue dates? Yes, more intensive on the database but allows more configuration beyond just restaurant 100011 in the future.

WITH
 
all_dates_and_restaurants AS (
SELECT
DISTINCT
restaurant_id,
date
FROM dates
CROSS JOIN food_order
)
 
SELECT
date,
adr.restaurant_id,
SUM(COALESCE(total_amt, 0)) as daily_revenue
FROM all_dates_and_restaurants adr
LEFT JOIN food_order f
ON adr.restaurant_id = f.restaurant_id
AND adr.date = DATE(f.created_at)
WHERE EXTRACT(MONTH FROM date) = 8
AND EXTRACT(YEAR FROM date) = 2021
AND adr.restaurant_id = 100011
GROUP BY date, adr.restaurant_id
ORDER BY date ASC

 

Colin, Dec. 20, 2023, 8:44 p.m.

Hey @colin,

Great question! Generally, I don't recommend using Cartesian or cross joins. The main issue is that you'll likely be unable to utilize indexes from the Cartesian join results, rendering your query quite inefficient.

However, if you create a physical table from the cross-joined results and then add indexes, your subsequent queries might perform well.

Another point to consider, from a non-technical perspective, is that in the business world, metrics like sales, revenue, or traffic often follow a long-tail distribution. For instance, the top 10 restaurants might account for 90% of total orders. Consequently, a cross join might generate a lot of 'sparseness' in your results.

But given that this is a simulated scenario, and I think that a cross join does make the code more readable for others, it could be a good approach in this context.

Cheers,

-Leon

Leon, Dec. 20, 2023, 9:38 p.m.

Hey @colin,

Great question! Generally, I don't recommend using Cartesian or cross joins. The main issue is that you'll likely be unable to utilize indexes from the Cartesian join results, rendering your query quite inefficient.

However, if you create a physical table from the cross-joined results and then add indexes, your subsequent queries might perform well.

Another point to consider, from a non-technical perspective, is that in the business world, metrics like sales, revenue, or traffic often follow a long-tail distribution. For instance, the top 10 restaurants might account for 90% of total orders. Consequently, a cross join might generate a lot of 'sparseness' in your results.

But given that this is a simulated scenario, and I think that a cross join does make the code more readable for others, it could be a good approach in this context.

Cheers,

-Leon

Leon, Dec. 20, 2023, 9:55 p.m.
SQLPad user avatar

Leon (949)

Dec. 15, 2022, 9:38 a.m.