SQL Window Functions: Using Dense Rank Over Partition By
9 min read
This is a follow-on post on SQL queries.
The next task I had to do in SQL involved reformatting our data into a very specific format for use in our charting tool. I found it pretty daunting at first ("can we even do this with SQL?") but by doing the standard programming thing of starting slow and layering on complexity step by step, I got to where I needed.
What I found really helpful was being able to see how the results table was changing as I made edits to the query. From this, I was able to picture in my mind what keywords such as RANK()
, GROUP BY
and CAST
were actually doing behind the scenes, which more often than not, was really not what I thought they should have been doing. 😅
I've created some mock tables below to try to illustrate what I was trying to achieve. Turns out it's actually quite fiddly to devise contrived examples to make a point, so if it seems like a ridiculous problem, well then... deal with it! 🤪
Here's what my starting tables looked like and what I wanted my end result to be. To put a bit of a story around it, let's say we run an e-commerce store that sells hobby equipment. We have a bunch of users for whom we know some traits, like their favourite foods, car and hobbies. Some of these users make purchases in our store. We'd like to know for each type of hobby, how many purchases were made by users who listed those as a hobby.
One issue is that our users
table doesn't check for unique emails. We'd therefore only like to consider the first answer provided if there are multiple users with the same email. e.g. user 4 below would be disregarded as they have the same email as user 1.
// Starting point for users table
| user_id | hashed_email | traits | answer | created_at |
| ------- | ------------ | -------- | ---------- | ---------- |
| 1 | wrwqefw3fsaf | fav_food | burger | 2021-03-01 |
| 1 | wrwqefw3fsaf | fav_car | volvo | 2021-03-01 |
| 1 | wrwqefw3fsaf | hobbies | ?h=running | 2021-03-01 |
| 2 | sfwrqwreasda | fav_food | chicken | 2021-03-01 |
| 2 | sfwrqwreasda | fav_car | mercedes | 2021-03-01 |
| 2 | sfwrqwreasda | hobbies | ?t=reading | 2021-03-01 |
| 3 | r4qcdas43fas | fav_food | fries | 2021-03-01 |
| 3 | r4qcdas43fas | fav_car | ferrari | 2021-03-01 |
| 3 | r4qcdas43fas | hobbies | ?i=Hiking | 2021-03-01 |
| 4 | wrwqefw3fsaf | fav_food | burger | 2021-03-04 |
| 4 | wrwqefw3fsaf | fav_car | volvo | 2021-03-04 |
| 4 | wrwqefw3fsaf | hobbies | ?h=hiking | 2021-03-04 |
// Second table with orders data
| hashed_email | sales | recorded_at |
| ------------ | ----- | ----------- |
| wrwqefw3fsaf | 100 | 2021-03-01 |
| sfwrqwreasda | 100 | 2021-03-01 |
| r4qcdas43fas | 100 | 2021-03-01 |
| sfwrqwreasda | 100 | 2021-03-01 |
| r4qcdas43fas | 100 | 2021-03-02 |
| sfwrqwreasda | 100 | 2021-03-02 |
| wrwqefw3fsaf | 100 | 2021-03-03 |
| sfwrqwreasda | 200 | 2021-03-03 |
| r4qcdas43fas | 200 | 2021-03-04 |
| dsa3qfa324se | 100 | 2021-03-04 |
// Desired output
| hobbies | transactions | sales |
| ------- | ------------ | ----- |
| running | 1 | 200 |
| reading | 1 | 500 |
| hiking | 1 | 400 |
| none | 1 | 100 |
Here are the steps I took to build up my query.
I started by creating the JOINs table on
hashed_email
, so that I could get the basic columns I needed to get me started. I chose aLEFT OUTER JOIN
as I wanted allorders
to be captured, regardless of whether we had logged the user in theusers
table. These rows would show up asnull
for thetraits
andanswer
column if there's nohashed_email
match.Additional filters I applied was only returning a single row back for each user, where
traits = 'hobbies'
. I also only wanted to consider orders recorded between a certain date range.FROM datawarehouse.orders as orders LEFT OUTER JOIN datawarehouse.users as users ON orders.hashed_email = users.hashed_email AND users.traits = 'hobbies' WHERE orders.recorded_at between '2021-03-01' and '2021-03-05'
As I was only interested in the earliest unique user by email, I needed a way of "grouping" user rows by
hashed_email
and then grabbing the earliest entry for each group. This can be done with the window functionDENSE_RANK()
.What I did specifically was to create a new 'column' that I would be able to select called
rank
, by running the following:DENSE_RANK() OVER(PARTITION BY hashed_email ORDER BY UNIX_SECONDS(created_at)) AS rank
. TheUNIX_SECONDS()
bit was needed because of the particular idiosyncrasies with the BigQuery querying language (which is slightly different to standard SQL), and also the format we had saved thecreated_at
field in.I then added a
SELECT
to choose specific columns from theusers
table, included my newly createdrank
column.FROM datawarehouse.orders as orders LEFT OUTER JOIN ( SELECT created_at, hashed_email, answer, traits, DENSE_RANK() OVER(PARTITION BY hashed_email ORDER BY UNIX_SECONDS(created_at)) AS rank, FROM datawarehouse.users ) as users ON orders.hashed_email = users.hashed_email AND users.traits = 'hobbies' WHERE orders.recorded_at between '2021-03-01' and '2021-03-05'
Now that we've added a new column to our JOINs table, we can extend our filter to also take into account
rank
. In this case, since we ordered our rank in ascending order by thecreated_at
date, I added a filter whererank
was eithernull
(where an order doesn't have a match on the users tables) or1
(the earliest user).FROM datawarehouse.orders as orders LEFT OUTER JOIN ( SELECT created_at, hashed_email, answer, traits, DENSE_RANK() OVER(PARTITION BY hashed_email ORDER BY UNIX_SECONDS(created_at)) AS rank, FROM datawarehouse.users ) as users ON orders.hashed_email = users.hashed_email AND users.traits = 'hobbies' AND (rank IS null OR rank = 1) WHERE orders.recorded_at between '2021-03-01' and '2021-03-05'
We're nearly there with getting the table in the structure we need, but there's just one more step. The
answer
column is currently a bit dirty, including redundant characters like?h=
, so we'd want to clean this up, to enable us to group rows by hobbies and count them. I usedCASE
statements in order to do this.LOWER()
is used to ensure theanswer
fields are standardised to lowercase, before looking for keywords likerunning
, withLIKE
. What I'm effectively doing here is creating a new column calledhobbies
that gives me back the hobby in a nice, clean, standard format. I can then add aGROUP
to my SQL query knowing all the rows will be picked up.Note that
traits IS null
occurs when there's anorder
with ahashed_email
that doesn't appear in theusers
table. I also added anelse
statement as a final catch which will, for instance, identify rows with a hobby we haven't specifically looked out for.SELECT (case when traits = 'hobbies' and LOWER(answer) LIKE '%running%' then 'running' when traits = 'hobbies' and LOWER(answer) LIKE '%reading%' then 'reading' when traits = 'hobbies' and LOWER(answer) LIKE '%hiking%' then 'hiking' when traits = 'hobbies' and LOWER(answer) LIKE '%sailing%' then 'sailing' when traits = 'hobbies' and LOWER(answer) LIKE '%cycling%' then 'cycling' when traits = 'hobbies' and LOWER(answer) LIKE '%swimming%' then 'swimming' when traits IS null then 'none' else 'Unsure - need to check' end) as hobbies
Final step! Let's put it all together and do the actual counting of sales transactions and sum of sales value. This is done by using
COUNT()
andSUM()
respectively, and tacking these on to theSELECT
. Because of how we had saved sales data to the database, I had to ensure thesales
column data was in the right type (i.e. a number rather than string) to allow me to sum the columns up.
P.S. Note that the GROUP BY hobbies
(our new column from step 4) comes right at the bottom of the query.
SELECT
(case
when traits = 'hobbies' and LOWER(answer) LIKE '%running%' then 'running'
when traits = 'hobbies' and LOWER(answer) LIKE '%reading%' then 'reading'
when traits = 'hobbies' and LOWER(answer) LIKE '%hiking%' then 'hiking'
when traits = 'hobbies' and LOWER(answer) LIKE '%sailing%' then 'sailing'
when traits = 'hobbies' and LOWER(answer) LIKE '%cycling%' then 'cycling'
when traits = 'hobbies' and LOWER(answer) LIKE '%swimming%' then 'swimming'
when traits IS null then 'none'
else 'Unsure - need to check'
end) as hobbies,
COUNT( orders.hashed_email ) as transactions,
SUM (CAST(orders.sales AS NUMERIC)) as sales,
FROM datawarehouse.orders orders
LEFT OUTER JOIN (
select created_at, hashed_email, answer, traits, DENSE_RANK() OVER(PARTITION BY hashed_email ORDER BY UNIX_SECONDS(created_at)) AS rank,
from datawarehouse.users
) as users
ON orders.hashed_email = users.hashed_email AND users.traits = 'hobbies' and (rank IS null or rank = 1)
WHERE orders.recorded_at between '2021-03-01' and '2021-03-05'
GROUP BY hobbies
Phew, got there in the end! 🥳 Definitely took me a while to get here, but I'm really glad I worked through it to the end as I learnt a lot from having to do this task on my own. It's definitely all about building the query up (extremely) slowly, step by step, and really understanding what each additional level of complexity does.
One final note I'll add is that I could have used RANK()
instead of DENSE_RANK()
for this query, as I was only after rank = 1. The main difference between the 2 is that:
RANK
: If there's a tie, rows are assigned the same rank, with the next ranking skipped. e.g. If you have 3 rows at rank 3, the next rank listed would be 6.DENSE_RANK
: Ranks provided are consecutive. e.g. If you have 3 rows at rank 3, the next rank listed would be 4.