Forum

Posted by Luming, Nov. 25, 2021, 2:20 p.m.

#126

Hi Leo & Folks,
Here is my approach to this question:
WITH restaurant_avg_rating AS
(SELECT city,restaurant_id,
AVG(CASE WHEN thumb_up=True THEN 1 ELSE 0 END) as avg_rating
FROM trip t
JOIN trip_rating tr
ON t.trip_id = tr.trip_id
WHERE is_completed=True
GROUP BY 1,2
HAVING COUNT(DISTINCT t.trip_id)>=10)
SELECT *
FROM
(SELECT *,RANK()OVER(PARTITION BY city ORDER BY avg_rating DESC)as ranking
FROM restaurant_avg_rating) sub
WHERE ranking<=3
My question is even this query could run, however, it yields a different result. I am wondering why that happens? It that because the execution order of from->where->group by->agg function->having->select/self-define statement ->order by? If that's case, that's why we should filter the trip table by using 'HAVING COUNT(DISTINCT t.trip_id)>=10' first then further construct the query by using this 'having' statement?

Answers

Hey Luming,
That is a great question, the business logic behind this question is that we want to have an apple to apple comparison and exclude restaurants with less than 10 trips. (Because there might not be enough samples to really tell us how good the trip experience is).
You solution differs when computing average, the HAVING COUNT(DISTINCT t.trip_id)>=10 filters out restaurants with less than 10 trips in the final reporting, but when computing the average ratings, those restaurants were included.
Query 1:
SELECT city,restaurant_id,
AVG(CASE WHEN thumb_up=True THEN 1 ELSE 0 END) as avg_rating
FROM trip t
JOIN trip_rating tr
ON t.trip_id = tr.trip_id
WHERE is_completed=True
GROUP BY 1,2
HAVING COUNT(DISTINCT t.trip_id)>=10
ORDER BY 2;
Result 1:
city restaurant_id avg_rating
Shanghai 100002 0.92857142857142857143
London 100003 1.00000000000000000000
Portland 100004 0.92857142857142857143
Query 2:
SELECT city,restaurant_id,
AVG(CASE WHEN thumb_up=True THEN 1 ELSE 0 END) as avg_rating
FROM trip t
JOIN trip_rating tr
ON t.trip_id = tr.trip_id
WHERE is_completed=True
GROUP BY 1,2
-- HAVING COUNT(DISTINCT t.trip_id)>=10
ORDER BY 2;
Result 2:
city restaurant_id avg_rating
Berlin 100001 1.00000000000000000000
Shanghai 100002 0.92857142857142857143
London 100003 1.00000000000000000000
Portland 100004 0.92857142857142857143
As you can see, the HAVING clause was used in the last stage to remove unqualified restaurants, but those unqualified restaurants were included when computing average ratings.
Query 3:
WITH qualified_resaturant AS (
SELECT restaurant_id, COUNT(DISTINCT trip_id)
FROM trip
WHERE is_completed = TRUE
GROUP BY restaurant_id
HAVING COUNT(DISTINCT trip_id) >=10
),
avg_rating as (
SELECT city, T.restaurant_id, AVG(CASE WHEN thumb_up IS TRUE THEN 1 ELSE 0 END ) AS avg_rating
FROM trip T
INNER JOIN qualified_resaturant Q
ON Q.restaurant_id = T.restaurant_id
INNER JOIN trip_rating R
ON T.trip_id = R.trip_id
GROUP BY 1,2
)
SELECT * FROM avg_rating
WHERE restaurant_id IN (100001, 100002, 100003, 100004);
Result 3:
city restaurant_id avg_rating
London 100003 1.00000000000000000000
Portland 100004 0.93333333333333333333
Shanghai 100002 0.93333333333333333333
The avg_rating values are different.
Hope it answered your question, please let us know if you have any other questions.
- Leon

Really appreciate that Leon! Cheers

Luming, Nov. 26, 2021, 11:59 p.m.

Hi All - a little late to this thread, but I was struggling with this question my self.  I wanted to follow up on the business logic above of ensuring apples to apples comparisions by making sure there was enough information on trip experience.  After some investigation, it turns out that the number of trips per restaurant, as defined by the trip table, is always greater than the number of trip ratings a restaurant received, as defined by the trip_rating table.  This produces some interesting edge cases at the threshold of 10.  Using the following query:

SELECT city,
                restaurant_id,
                SUM(IF(thumb_up IS TRUE, 1, 0)) / COUNT(DISTINCT t.trip_id) AS avg_rating_t,
                SUM(IF(thumb_up IS TRUE, 1, 0)) / COUNT(DISTINCT tr.trip_id) AS avg_rating_tr,
                COUNT(t.trip_id) AS num_trips_t,
                COUNT(tr.trip_id) AS num_trips_tr
   FROM trip AS t
                 LEFT JOIN trip_rating AS tr
                ON t.trip_id = tr.trip_id
 GROUP BY city, restaurant_id
HAVING num_trips_t >= 10
      AND num_trips_tr < 10;

I generated the table: 

city restaurant_id avg_rating_t avg_rating_tr num_trips_t num_trips_tr
Berlin 100134 0.7000 0.7778 10 9
Chicago 100096 0.6000 0.8571 10 7
Chicago 100097 0.6000 0.7500 10 8
Delhi 100099 0.5000 0.6667 12 9
Los Angeles 100135 0.7000 0.7778 10 9
Portland 100090 0.6667 0.8889 12 9
Portland 100115 0.6364 0.8750 11 8
Seattle 100079 0.8182 1.0000 11 9
Tokyo 100141 0.8000 0.8889 10 9

What is interesting to me is that the solution effectively uses num_trips_t to determine if the restaurant is included in the results, but it uses num_trips_tr to calculate the average rating.  This ends up affecting the results such that restaurant 100079 in Seattle is ranked in the top three for the solution, but if you were to use the same num_trips (either t or tr) as both the threshold to be included AND as the denominator in calculating the average rating you get the results:

- num_trips_t produces an avg_rating of 0.8182, which ranks 4th (out of our top 3 spots)

OR

- num_trips_tr excludes the restaurant from the rankings all together because it is less than 10

Further, it would be possible with the current solution to have a restaurant with 10 or more trips as per the trip table but only one rating as per the trip_rating table with a 'thumbs-up', which would rank it as number one in that city, but with only one review.

So my question is, do we consider this a fair apples to apples comparison?  And I appreciate this is far more of a business logic question than a pure SQL question, but thought it was worth asking since I found this interesting!

 

Thanks!

Will

Will, Jan. 24, 2023, 11:31 a.m.
SQLPad user avatar

Leon (949)

Nov. 26, 2021, 9 a.m.