Avatar

Hej, I'm Julia.

Ingesting and Processing Data Tables For Strategic Analysis With dbt

#databases #sql

7 min read

I recently had to work with dbt for the first time and thought I’d write up some notes on it. First off, what even is dbt? From my point of view as a software engineer, it’s a way of cleaning and organising your teams’ data so that team members can analyse the data in a way that makes sense for the business, rather than based on how the data is stored.

I didn’t really grasp the need for this initially, because my initial, naive thought was just, “Why not save the data in a way that makes sense for it to be analysed?”. Whilst this might well be easy to do for simple apps and data structures, it’s not always possible or efficient for more complex ones. For example:

  • You might need to make multiple API calls to save to multiple databases which would be suboptimal.
  • Reporting requirements might change down the line.
  • The need to send and save data in a specific format may be problematic / inefficient when it comes to analysing the data.

I recognise that this might all sound a little abstract, so it might be simplest to illustrate the point with an example. Let’s say that the raw data saved to the database looks like this.

unrolled table

You can imagine how it might have been easy to send across the JSON for the metadata field in a single API call and save it without having to do much processing. It’s a different matter when it comes to analysing the data though, where we’d ideally want the JSON metadata to be unrolled across multiple columns e.g.

unrolled table

So how does one go about this?

Organising the data into levels

To keep things organised and reusable, it’s likely that you’ll want to do your data cleaning, filtering and processing in different stages. Check out this blog post which explains the “multi-hop architecture” in more detail and how you might want to think about structuring tables with increasing quality from bronze and silver to gold.

Relating it to the data we have, it makes sense to perhaps have:

  • Bronze table - ingesting the data from the data warehouse e.g. BigQuery. What we need to do at this level is to tell dbt that this table of data exists in our data warehouse.
  • Silver table - this might be split into 2 substages:
    • Staging - for cleaning the data and typecasting into the correct types for each column we want to surface. This is where we’ll define the new columns for country, medium, source, visitor_id, playstation_plan and store_subscription.
    • Intermediate - combining our staging table with other tables that might exist in the organisation to make best use of all the data we have. e.g. LEFT JOINing with a table that contains the prices for the Playstation plans or info about a visitor.
  • Gold - final transformation of data so it can be easily visualised / charted by reporting software like Data Studio or Holistics.

Bronze level

At this level, we want to tell dbt what data tables are available for processing. In my case, this meant defining the data sources in a source.yaml file like so (assuming I just have the one data source table called analytics-event which is stored in a database called backend-production-database within the ecommerce-analytics namespace).

version: 2

sources:
  - name: ecommerce-analytics
    database: backend-production-database
    tables:
      - name: analytics_event
        description: An append-only log of events recorded for visitors to the ecommerce site
        columns:
          - name: id
          - name: created_at
          - name: email
            policy_tags:
              - '{{ var("pii_email_tag") }}'
          - name: metadata

You’ll notice that I tagged the email column with a pii_email_tag policy tag. Policy tags are a feature in BigQuery (a cloud data warehouse service from Google) that enables column-level security, and is a feature that dbt enables as a column resource property. In this case, we’ve defined this policy tag to obfuscate the email address for privacy and security reasons.

Silver level

I split processing of the data into 2 levels here, the first of which is the “silver staging” table, followed by the “silver intermediate” table.

Staging

In the staging table, I wanted to unroll the data in the metadata JSON into various columns (which makes it easier to do analyses on down the line). This was done by writing a series of SQL queries. Whilst it may be possible to do all the data cleanup in one step, it’s a lot clearer to split it out over a series of steps.

Note that with the query_string, what I’m essentially doing is getting rid of unwanted characters that might be present, then mapping through the utm_params_list and saving values to new columns using Jinja templating.

{% set utm_params_list = [
    ('utm_medium', 'medium'),
    ('utm_source', 'source'),
    ]
%}

WITH
analytics_event_unrolled AS
(
    SELECT
        id,
        created_at,
        email,
        JSON_VALUE(metadata, "$.country") AS country,
        JSON_VALUE(metadata, "$.cookies.unique_visitor_id") AS visitor_id,
        JSON_VALUE_ARRAY(metadata, "$.products") AS product_ids,
        REPLACE(REPLACE(JSON_VALUE(metadata, "$.queryString"), '%20', ' '), '"', '') AS query_string
    FROM
        {{ source('ecommerce-analytics', 'analytics_event') }}
),

analytics_event_utm_params AS
(
    SELECT
        *,
        {% for utm_param, utm_column_name in utm_params_list %}
        {{ dbt_utils.get_url_parameter("query_string", utm_param) }} AS {{ utm_column_name }}
        {%- if not loop.last -%},{%- endif -%}
        {% endfor %}
    FROM
        analytics_event_unrolled
),

analytics_event_sanitised_utm_params AS
(
    SELECT
        id,
        created_at,
        email,
        country,
        {% for utm_param, utm_column_name in utm_params_list %}
        REPLACE(LOWER({{ utm_column_name }}), '+', ' ') AS {{ utm_column_name }}
        {%- if not loop.last -%},{%- endif -%}
        {% endfor %},
				visitor_id,
				(SELECT product_id FROM UNNEST(product_ids) AS product_id WHERE product_id LIKE 'PLAYSTATION%') AS playstation_plan,
        (SELECT product_id FROM UNNEST(product_ids) AS product_id WHERE product_id LIKE 'PLAY_SUBSCRIPTION%') AS store_subscription
    FROM
        analytics_event_utm_params
)

SELECT
    *
FROM
    analytics_event_sanitised_utm_params

To complete setup at this stage, we also need to include a schema.yaml file for this new table we’ve created.

version: 2

models:
  - name: stg_ecommerce_analytics_event
    description: A derived table from the analytics_event table, where the data column has been unrolled.
    columns:
      - name: id
        description: Unique identifier for each event
        tests:
          - unique
          - not_null
      - name: created_at
        description: Timestamp when the event was created
      - name: email
        description: Email entered in the checkout process
        policy_tags:
          - "{{ var('pii_email_tag') }}"
      - name: country
        description: IP geolocation
      - name: medium
        description: Identifies what type of link was used
      - name: source
        description: Identifies which site sent the traffic
			- name: visitor_id
				description: Unique id assigned when a visitor lands on the ecommerce site
      - name: playstation_plan
        description: Playstation payment plan
      - name: store_subscription
        description: Play store subscription plan

Intermediate

At the intermediate level, I wanted to include a new column from another table that was present at the company silver staging level. In this example, this other table was called stg_visitor_identity, and I did a LEFT JOIN on visitor_id (which is stg_visitor_identity’s primary key id field), so that I could SELECT the last_seen field from the second table.

WITH

visitor_identity AS
(
    SELECT
        *
    FROM
        {{ ref('stg_visitor_identity') }}
),

analytics_event AS
(
    SELECT DISTINCT
        event.*,
        visitor_identity.last_seen
    FROM
        {{ ref('stg_ecommerce_analytics_event') }} AS event
    LEFT JOIN visitor_identity ON event.visitor_id = visitor_identity.id
)

SELECT
    *
FROM
    analytics_event

What we’ll then need to do once again, is to include a schema.yaml file to tell dbt about this new intermediate table create, which in essence will be the same as the staging schema file, just with an additional last_seen column.

Conclusion

The next step from here is the Gold level. This will be highly dependent on what it is your stakeholders actually want to see analysed. I won’t add an example for that here, as I feel that it starts to get intuitive once you’ve got the Bronze and Silver levels going…

So, what do you think? Ever worked with dbt before? I’d love to hear about your experiences!

© 2016-2024 Julia Tan · Powered by Next JS.