Ingesting and Processing Data Tables For Strategic Analysis With dbt
7 min read
I recently had to work with dbt for the first time and thought I’d write up some notes on it. First off, what even is dbt? From my point of view as a software engineer, it’s a way of cleaning and organising your teams’ data so that team members can analyse the data in a way that makes sense for the business, rather than based on how the data is stored.
I didn’t really grasp the need for this initially, because my initial, naive thought was just, “Why not save the data in a way that makes sense for it to be analysed?”. Whilst this might well be easy to do for simple apps and data structures, it’s not always possible or efficient for more complex ones. For example:
- You might need to make multiple API calls to save to multiple databases which would be suboptimal.
- Reporting requirements might change down the line.
- The need to send and save data in a specific format may be problematic / inefficient when it comes to analysing the data.
I recognise that this might all sound a little abstract, so it might be simplest to illustrate the point with an example. Let’s say that the raw data saved to the database looks like this.
You can imagine how it might have been easy to send across the JSON for the metadata field in a single API call and save it without having to do much processing. It’s a different matter when it comes to analysing the data though, where we’d ideally want the JSON metadata to be unrolled across multiple columns e.g.
So how does one go about this?
Organising the data into levels
To keep things organised and reusable, it’s likely that you’ll want to do your data cleaning, filtering and processing in different stages. Check out this blog post which explains the “multi-hop architecture” in more detail and how you might want to think about structuring tables with increasing quality from bronze and silver to gold.
Relating it to the data we have, it makes sense to perhaps have:
- Bronze table - ingesting the data from the data warehouse e.g. BigQuery. What we need to do at this level is to tell dbt that this table of data exists in our data warehouse.
- Silver table - this might be split into 2 substages:
- Staging - for cleaning the data and typecasting into the correct types for each column we want to surface. This is where we’ll define the new columns for
country
,medium
,source
,visitor_id
,playstation_plan
andstore_subscription
. - Intermediate - combining our staging table with other tables that might exist in the organisation to make best use of all the data we have. e.g.
LEFT JOIN
ing with a table that contains the prices for the Playstation plans or info about a visitor.
- Staging - for cleaning the data and typecasting into the correct types for each column we want to surface. This is where we’ll define the new columns for
- Gold - final transformation of data so it can be easily visualised / charted by reporting software like Data Studio or Holistics.
Bronze level
At this level, we want to tell dbt what data tables are available for processing. In my case, this meant defining the data sources in a source.yaml
file like so (assuming I just have the one data source table called analytics-event
which is stored in a database called backend-production-database
within the ecommerce-analytics
namespace).
version: 2
sources:
- name: ecommerce-analytics
database: backend-production-database
tables:
- name: analytics_event
description: An append-only log of events recorded for visitors to the ecommerce site
columns:
- name: id
- name: created_at
- name: email
policy_tags:
- '{{ var("pii_email_tag") }}'
- name: metadata
You’ll notice that I tagged the email
column with a pii_email_tag
policy tag. Policy tags are a feature in BigQuery (a cloud data warehouse service from Google) that enables column-level security, and is a feature that dbt enables as a column resource property. In this case, we’ve defined this policy tag to obfuscate the email address for privacy and security reasons.
Silver level
I split processing of the data into 2 levels here, the first of which is the “silver staging” table, followed by the “silver intermediate” table.
Staging
In the staging table, I wanted to unroll the data in the metadata
JSON into various columns (which makes it easier to do analyses on down the line). This was done by writing a series of SQL queries. Whilst it may be possible to do all the data cleanup in one step, it’s a lot clearer to split it out over a series of steps.
Note that with the query_string
, what I’m essentially doing is getting rid of unwanted characters that might be present, then mapping through the utm_params_list
and saving values to new columns using Jinja templating.
{% set utm_params_list = [
('utm_medium', 'medium'),
('utm_source', 'source'),
]
%}
WITH
analytics_event_unrolled AS
(
SELECT
id,
created_at,
email,
JSON_VALUE(metadata, "$.country") AS country,
JSON_VALUE(metadata, "$.cookies.unique_visitor_id") AS visitor_id,
JSON_VALUE_ARRAY(metadata, "$.products") AS product_ids,
REPLACE(REPLACE(JSON_VALUE(metadata, "$.queryString"), '%20', ' '), '"', '') AS query_string
FROM
{{ source('ecommerce-analytics', 'analytics_event') }}
),
analytics_event_utm_params AS
(
SELECT
*,
{% for utm_param, utm_column_name in utm_params_list %}
{{ dbt_utils.get_url_parameter("query_string", utm_param) }} AS {{ utm_column_name }}
{%- if not loop.last -%},{%- endif -%}
{% endfor %}
FROM
analytics_event_unrolled
),
analytics_event_sanitised_utm_params AS
(
SELECT
id,
created_at,
email,
country,
{% for utm_param, utm_column_name in utm_params_list %}
REPLACE(LOWER({{ utm_column_name }}), '+', ' ') AS {{ utm_column_name }}
{%- if not loop.last -%},{%- endif -%}
{% endfor %},
visitor_id,
(SELECT product_id FROM UNNEST(product_ids) AS product_id WHERE product_id LIKE 'PLAYSTATION%') AS playstation_plan,
(SELECT product_id FROM UNNEST(product_ids) AS product_id WHERE product_id LIKE 'PLAY_SUBSCRIPTION%') AS store_subscription
FROM
analytics_event_utm_params
)
SELECT
*
FROM
analytics_event_sanitised_utm_params
To complete setup at this stage, we also need to include a schema.yaml
file for this new table we’ve created.
version: 2
models:
- name: stg_ecommerce_analytics_event
description: A derived table from the analytics_event table, where the data column has been unrolled.
columns:
- name: id
description: Unique identifier for each event
tests:
- unique
- not_null
- name: created_at
description: Timestamp when the event was created
- name: email
description: Email entered in the checkout process
policy_tags:
- "{{ var('pii_email_tag') }}"
- name: country
description: IP geolocation
- name: medium
description: Identifies what type of link was used
- name: source
description: Identifies which site sent the traffic
- name: visitor_id
description: Unique id assigned when a visitor lands on the ecommerce site
- name: playstation_plan
description: Playstation payment plan
- name: store_subscription
description: Play store subscription plan
Intermediate
At the intermediate level, I wanted to include a new column from another table that was present at the company silver staging level. In this example, this other table was called stg_visitor_identity
, and I did a LEFT JOIN
on visitor_id
(which is stg_visitor_identity
’s primary key id
field), so that I could SELECT
the last_seen
field from the second table.
WITH
visitor_identity AS
(
SELECT
*
FROM
{{ ref('stg_visitor_identity') }}
),
analytics_event AS
(
SELECT DISTINCT
event.*,
visitor_identity.last_seen
FROM
{{ ref('stg_ecommerce_analytics_event') }} AS event
LEFT JOIN visitor_identity ON event.visitor_id = visitor_identity.id
)
SELECT
*
FROM
analytics_event
What we’ll then need to do once again, is to include a schema.yaml
file to tell dbt about this new intermediate table create, which in essence will be the same as the staging schema file, just with an additional last_seen
column.
Conclusion
The next step from here is the Gold level. This will be highly dependent on what it is your stakeholders actually want to see analysed. I won’t add an example for that here, as I feel that it starts to get intuitive once you’ve got the Bronze and Silver levels going…
So, what do you think? Ever worked with dbt before? I’d love to hear about your experiences!