Avatar

Nǐ hǎo, I'm Julia.

SQL Window Functions: Using Dense Rank Over Partition By

#databases #sql

9 min read

This is a follow-on post on SQL queries.

The next task I had to do in SQL involved reformatting our data into a very specific format for use in our charting tool. I found it pretty daunting at first ("can we even do this with SQL?") but by doing the standard programming thing of starting slow and layering on complexity step by step, I got to where I needed.

What I found really helpful was being able to see how the results table was changing as I made edits to the query. From this, I was able to picture in my mind what keywords such as RANK() , GROUP BY and CAST were actually doing behind the scenes, which more often than not, was really not what I thought they should have been doing. 😅

I've created some mock tables below to try to illustrate what I was trying to achieve. Turns out it's actually quite fiddly to devise contrived examples to make a point, so if it seems like a ridiculous problem, well then... deal with it! 🤪

Here's what my starting tables looked like and what I wanted my end result to be. To put a bit of a story around it, let's say we run an e-commerce store that sells hobby equipment. We have a bunch of users for whom we know some traits, like their favourite foods, car and hobbies. Some of these users make purchases in our store. We'd like to know for each type of hobby, how many purchases were made by users who listed those as a hobby.

One issue is that our users table doesn't check for unique emails. We'd therefore only like to consider the first answer provided if there are multiple users with the same email. e.g. user 4 below would be disregarded as they have the same email as user 1.

// Starting point for users table

| user_id | hashed_email | traits   | answer     | created_at |
| ------- | ------------ | -------- | ---------- | ---------- |
| 1       | wrwqefw3fsaf | fav_food | burger     | 2021-03-01 |
| 1       | wrwqefw3fsaf | fav_car  | volvo      | 2021-03-01 |
| 1       | wrwqefw3fsaf | hobbies  | ?h=running | 2021-03-01 |
| 2       | sfwrqwreasda | fav_food | chicken    | 2021-03-01 |
| 2       | sfwrqwreasda | fav_car  | mercedes   | 2021-03-01 |
| 2       | sfwrqwreasda | hobbies  | ?t=reading | 2021-03-01 |
| 3       | r4qcdas43fas | fav_food | fries      | 2021-03-01 |
| 3       | r4qcdas43fas | fav_car  | ferrari    | 2021-03-01 |
| 3       | r4qcdas43fas | hobbies  | ?i=Hiking  | 2021-03-01 |
| 4       | wrwqefw3fsaf | fav_food | burger     | 2021-03-04 |
| 4       | wrwqefw3fsaf | fav_car  | volvo      | 2021-03-04 |
| 4       | wrwqefw3fsaf | hobbies  | ?h=hiking  | 2021-03-04 |

// Second table with orders data

| hashed_email | sales | recorded_at |
| ------------ | ----- | ----------- |
| wrwqefw3fsaf | 100   | 2021-03-01  |
| sfwrqwreasda | 100   | 2021-03-01  |
| r4qcdas43fas | 100   | 2021-03-01  |
| sfwrqwreasda | 100   | 2021-03-01  |
| r4qcdas43fas | 100   | 2021-03-02  |
| sfwrqwreasda | 100   | 2021-03-02  |
| wrwqefw3fsaf | 100   | 2021-03-03  |
| sfwrqwreasda | 200   | 2021-03-03  |
| r4qcdas43fas | 200   | 2021-03-04  |
| dsa3qfa324se | 100   | 2021-03-04  |

// Desired output

| hobbies | transactions | sales |
| ------- | ------------ | ----- |
| running | 1            | 200   |
| reading | 1            | 500   |
| hiking  | 1            | 400   |
| none    | 1            | 100   |

Here are the steps I took to build up my query.

  1. I started by creating the JOINs table on hashed_email, so that I could get the basic columns I needed to get me started. I chose a LEFT OUTER JOIN as I wanted all orders to be captured, regardless of whether we had logged the user in the users table. These rows would show up as null for the traits and answer column if there's no hashed_email match.

    Additional filters I applied was only returning a single row back for each user, where traits = 'hobbies'. I also only wanted to consider orders recorded between a certain date range.

    FROM datawarehouse.orders as orders
    LEFT OUTER JOIN datawarehouse.users as users
    ON orders.hashed_email = users.hashed_email AND users.traits = 'hobbies'
    WHERE orders.recorded_at between '2021-03-01' and '2021-03-05'
  2. As I was only interested in the earliest unique user by email, I needed a way of "grouping" user rows by hashed_email and then grabbing the earliest entry for each group. This can be done with the window function DENSE_RANK().

    What I did specifically was to create a new 'column' that I would be able to select called rank, by running the following: DENSE_RANK() OVER(PARTITION BY hashed_email ORDER BY UNIX_SECONDS(created_at)) AS rank. The UNIX_SECONDS() bit was needed because of the particular idiosyncrasies with the BigQuery querying language (which is slightly different to standard SQL), and also the format we had saved the created_at field in.

    I then added a SELECT to choose specific columns from the users table, included my newly created rank column.

    FROM datawarehouse.orders as orders
    LEFT OUTER JOIN (
      SELECT created_at, hashed_email, answer, traits, DENSE_RANK() OVER(PARTITION BY hashed_email ORDER BY UNIX_SECONDS(created_at)) AS rank,
      FROM datawarehouse.users
     ) as users
    ON orders.hashed_email = users.hashed_email AND users.traits = 'hobbies'
    WHERE orders.recorded_at between '2021-03-01' and '2021-03-05'
  3. Now that we've added a new column to our JOINs table, we can extend our filter to also take into account rank. In this case, since we ordered our rank in ascending order by the created_at date, I added a filter where rank was either null (where an order doesn't have a match on the users tables) or 1 (the earliest user).

    FROM datawarehouse.orders as orders
    LEFT OUTER JOIN (
      SELECT created_at, hashed_email, answer, traits, DENSE_RANK() OVER(PARTITION BY hashed_email ORDER BY UNIX_SECONDS(created_at)) AS rank,
      FROM datawarehouse.users
     ) as users
    ON orders.hashed_email = users.hashed_email AND users.traits = 'hobbies' AND (rank IS null OR rank = 1)
    WHERE orders.recorded_at between '2021-03-01' and '2021-03-05'
  4. We're nearly there with getting the table in the structure we need, but there's just one more step. The answer column is currently a bit dirty, including redundant characters like ?h= , so we'd want to clean this up, to enable us to group rows by hobbies and count them. I used CASE statements in order to do this. LOWER() is used to ensure the answer fields are standardised to lowercase, before looking for keywords like running, with LIKE. What I'm effectively doing here is creating a new column called hobbies that gives me back the hobby in a nice, clean, standard format. I can then add a GROUP to my SQL query knowing all the rows will be picked up.

    Note that traits IS null occurs when there's an order with a hashed_email that doesn't appear in the users table. I also added an else statement as a final catch which will, for instance, identify rows with a hobby we haven't specifically looked out for.

    SELECT
      (case
        when traits = 'hobbies' and LOWER(answer) LIKE '%running%' then 'running'
        when traits = 'hobbies' and LOWER(answer) LIKE '%reading%' then 'reading'
        when traits = 'hobbies' and LOWER(answer) LIKE '%hiking%' then 'hiking'
        when traits = 'hobbies' and LOWER(answer) LIKE '%sailing%' then 'sailing'
    		when traits = 'hobbies' and LOWER(answer) LIKE '%cycling%' then 'cycling'
    		when traits = 'hobbies' and LOWER(answer) LIKE '%swimming%' then 'swimming'
        when traits IS null then 'none'
        else 'Unsure - need to check'
       end) as hobbies
  5. Final step! Let's put it all together and do the actual counting of sales transactions and sum of sales value. This is done by using COUNT() and SUM() respectively, and tacking these on to the SELECT. Because of how we had saved sales data to the database, I had to ensure the sales column data was in the right type (i.e. a number rather than string) to allow me to sum the columns up.

P.S. Note that the GROUP BY hobbies (our new column from step 4) comes right at the bottom of the query.

SELECT
  (case
    when traits = 'hobbies' and LOWER(answer) LIKE '%running%' then 'running'
    when traits = 'hobbies' and LOWER(answer) LIKE '%reading%' then 'reading'
    when traits = 'hobbies' and LOWER(answer) LIKE '%hiking%' then 'hiking'
    when traits = 'hobbies' and LOWER(answer) LIKE '%sailing%' then 'sailing'
		when traits = 'hobbies' and LOWER(answer) LIKE '%cycling%' then 'cycling'
		when traits = 'hobbies' and LOWER(answer) LIKE '%swimming%' then 'swimming'
    when traits IS null then 'none'
    else 'Unsure - need to check'
   end) as hobbies,
  COUNT( orders.hashed_email ) as transactions,
  SUM (CAST(orders.sales AS NUMERIC)) as sales,
FROM datawarehouse.orders orders
LEFT OUTER JOIN (
  select created_at, hashed_email, answer, traits, DENSE_RANK() OVER(PARTITION BY hashed_email ORDER BY UNIX_SECONDS(created_at)) AS rank,
  from datawarehouse.users
 ) as users
ON orders.hashed_email = users.hashed_email AND users.traits = 'hobbies' and (rank IS null or rank = 1)
WHERE orders.recorded_at between '2021-03-01' and '2021-03-05'
GROUP BY hobbies

Phew, got there in the end! 🥳 Definitely took me a while to get here, but I'm really glad I worked through it to the end as I learnt a lot from having to do this task on my own. It's definitely all about building the query up (extremely) slowly, step by step, and really understanding what each additional level of complexity does.

One final note I'll add is that I could have used RANK() instead of DENSE_RANK() for this query, as I was only after rank = 1. The main difference between the 2 is that:

  • RANK: If there's a tie, rows are assigned the same rank, with the next ranking skipped. e.g. If you have 3 rows at rank 3, the next rank listed would be 6.
  • DENSE_RANK: Ranks provided are consecutive. e.g. If you have 3 rows at rank 3, the next rank listed would be 4.

© 2016-2024 Julia Tan · Powered by Next JS.