dbt (data build tool) has fundamentally changed how I approach data transformations. It brought software engineering practices—version control, testing, documentation, and CI/CD—to the world of analytics engineering. In this comprehensive guide, I’ll share everything you need to know about dbt, from basic concepts to production patterns.
What is dbt and Why Does It Matter?
dbt is a transformation tool that enables data analysts and engineers to transform, test, and document data directly in their warehouse using SQL. Here’s what makes it special:
The dbt Philosophy
| Traditional ETL | dbt Approach |
|---|---|
| Transform in external tool | Transform in the warehouse |
| Black-box transformations | Transparent SQL transformations |
| No testing | Built-in data testing |
| Poor documentation | Auto-generated documentation |
| Hard to version | Git-friendly modular code |
Why dbt Won the Analytics Engineering Space
- Leverages SQL: Uses the language data teams already know
- Warehouse-native: Pushes compute to Snowflake, BigQuery, Redshift, etc.
- Software engineering practices: Modular, testable, version-controlled code
- Documentation as code: Auto-generates data catalogs
- Community ecosystem: dbt Hub with hundreds of pre-built packages
Core dbt Concepts Explained
Models: The Building Blocks
Models are SELECT statements that transform your data. Each model becomes a table or view in your warehouse:
-- models/staging/stg_orders.sql
{{
config(
materialized='view',
tags=['staging', 'orders']
)
}}
with source as (
select * from {{ source('raw', 'orders') }}
),
renamed as (
select
-- IDs
id as order_id,
customer_id,
user_id,
-- Timestamps
ordered_at,
shipped_at,
delivered_at,
-- Financials
subtotal,
tax_paid,
discount_amount,
total_amount,
-- Metadata
status,
payment_method,
shipping_address_id
from source
)
select * from renamed
Key points:
- Caffe:
{{ }}denotes Jinja templating source(): References source tables defined in YAMLconfig(): Sets materialization and metadata- CTEs: Break complex logic into readable steps
Sources: Defining Raw Data
# models/sources.yml
version: 2
sources:
- name: raw
description: "Raw data imported from source systems"
database: analytics
schema: raw_data
tables:
- name: orders
description: "E-commerce order records"
loaded_at_field: _etl_loaded_at
freshness:
warn_after: {count: 6, period: hour}
error_after: {count: 24, period: hour}
columns:
- name: id
description: "Primary key"
tests:
- unique
- not_null
- name: customers
description: "Customer master data"
loaded_at_field: _etl_loaded_at
freshness:
warn_after: {count: 24, period: hour}
columns:
- name: id
tests:
- unique
- not_null
Tests: Ensuring Data Quality
# models/schema.yml
version: 2
models:
- name: stg_orders
description: "Staging model for order data"
columns:
- name: order_id
description: "Unique order identifier"
tests:
- unique
- not_null
- name: total_amount
description: "Total order amount in USD"
tests:
- not_null
- dbt_utils.accepted_range:
min_value: 0
inclusive: true
- name: status
description: "Order fulfillment status"
tests:
- accepted_values:
values: ['pending', 'processing', 'shipped', 'delivered', 'cancelled']
- name: customer_id
description: "Foreign key to customer"
tests:
- not_null
- relationships:
to: ref('stg_customers')
field: customer_id
- name: fct_orders
description: "Fact table for order analytics"
columns:
- name: order_id
tests:
- unique
- not_null
Custom SQL tests:
-- tests/assert_order_totals.sql
{{
config(severity='error')
}}
-- Ensure order totals are calculated correctly
with order_calculation as (
select
order_id,
subtotal + tax_paid - discount_amount as calculated_total,
total_amount as recorded_total
from {{ ref('fct_orders') }}
)
select *
from order_calculation
where abs(calculated_total - recorded_total) > 0.01
Snapshots: Tracking Historical Changes
-- snapshots/customers_snapshot.sql
{% snapshot snp_customers %}
{{
config(
target_database='analytics',
target_schema='snapshots',
unique_key='customer_id',
strategy='timestamp',
updated_at='updated_at',
)
}}
select * from {{ source('raw', 'customers') }}
{% endsnapshot %}
This creates a Type 2 Slowly Changing Dimension (SCD) automatically:
| customer_id | name | updated_at | dbt_valid_from | dbt_valid_to | dbt_is_current | |
|---|---|---|---|---|---|---|
| 1 | John | john@old.com | 2026-01-01 | 2026-01-01 | 2026-02-15 | false |
| 1 | John | john@new.com | 2026-02-15 | 2026-02-15 | NULL | true |
Seeds: Static Reference Data
-- seeds/country_codes.csv
code,name,region
US,United States,North America
CA,Canada,North America
GB,United Kingdom,Europe
DE,Germany,Europe
FR,France,Europe
JP,Japan,Asia
AU,Australia,Oceania
-- Reference in models
select
o.*,
c.name as country_name,
c.region
from {{ ref('stg_orders') }} o
left join {{ ref('country_codes') }} c
on o.country_code = c.code
Advanced dbt Patterns
Incremental Models: Processing Only New Data
-- models/fct_orders.sql
{{
config(
materialized='incremental',
unique_key='order_id',
incremental_strategy='merge',
tags=['fact', 'orders']
)
}}
with source as (
select * from {{ source('raw', 'orders') }}
{% if is_incremental() %}
-- Only process new/updated records
where updated_at > (select max(updated_at) from {{ this }})
{% endif %}
),
final as (
select
id as order_id,
customer_id,
total_amount,
status,
ordered_at,
updated_at
from source
)
select * from final
Mart Layer: Business-Ready Aggregations
-- models/marts/daily_metrics.sql
{{
config(
materialized='table',
tags=['marts', 'metrics']
)
}}
with orders as (
select * from {{ ref('fct_orders') }}
),
daily_order_metrics as (
select
date_trunc('day', ordered_at) as order_date,
-- Order counts
count(*) as total_orders,
count(distinct customer_id) as unique_customers,
-- Revenue metrics
sum(total_amount) as total_revenue,
avg(total_amount) as avg_order_value,
median(total_amount) as median_order_value,
-- Conversion
sum(case when status = 'delivered' then 1 else 0 end) as delivered_orders,
sum(case when status = 'cancelled' then 1 else 0 end) as cancelled_orders,
-- Time-based comparisons
sum(total_amount) filter (where ordered_at >= date_sub(current_date, 7)) as revenue_last_7d,
sum(total_amount) filter (where ordered_at >= date_sub(current_date, 30)) as revenue_last_30d
from orders
group by 1
)
select * from daily_order_metrics
Macros: DRY SQL with Jinja
-- macros/currency_conversion.sql
{% macro convert_to_usd(amount, currency_column) %}
case
when {{ currency_column }} = 'USD' then {{ amount }}
when {{ currency_column }} = 'EUR' then {{ amount }} * 1.08
when {{ currency_column }} = 'GBP' then {{ amount }} * 1.27
when {{ currency_column }} = 'JPY' then {{ amount }} * 0.0067
else {{ amount }}
end
{% endmacro %}
Usage in models:
select
order_id,
{{ convert_to_usd('total_amount', 'currency') }} as total_amount_usd
from {{ ref('stg_orders') }}
Custom Materializations
-- macros/materializations/slowly_changing_dimension.sql
{% materialization slowly_changing_dimension, adapter='snowflake' %}
-- Custom materialization logic
{% set target_relation = this.incorporate(type='table') %}
{% set intermediate_relation = make_intermediate_relation(target_relation) %}
-- Drop temp table if exists
{% set existing_relation = load_relation(this) %}
{% if existing_relation is not none %}
{% call statement('drop_existing') %}
drop table if exists {{ intermediate_relation }}
{% endcall %}
{% endif %}
-- Create new table with SCD logic
{% call statement('main') %}
create table {{ intermediate_relation }} as
select
*,
current_timestamp as dbt_updated_at
from {{ ref(sql) }}
{% endcall %}
-- Swap tables
{% do rename_relation(existing_relation, intermediate_relation) %}
{% do rename_relation(intermediate_relation, target_relation) %}
{{ return({'relations': [target_relation]}) }}
{% endmaterialization %}
dbt Project Structure
Here’s my recommended project structure:
dbt_project/
├── dbt_project.yml # Project configuration
├── README.md
├── .gitignore
├── packages.yml # External packages
├── profiles.yml # Connection profiles (usually in ~/.dbt)
├── macros/
│ ├── currency_conversion.sql
│ ├── date_utils.sql
│ └── custom_tests.sql
├── models/
│ ├── sources.yml # Source definitions
│ ├── staging/
│ │ ├── stg_orders.sql
│ │ ├── stg_customers.sql
│ │ └── stg_products.sql
│ ├── intermediate/
│ │ ├── int_order_items.sql
│ │ └── int_customer_metrics.sql
│ └── marts/
│ ├── finance/
│ │ ├── fct_revenue.sql
│ │ └── dim_accounts.sql
│ └── marketing/
│ ├── fct_campaigns.sql
│ └── dim_channels.sql
├── seeds/
│ ├── country_codes.csv
│ └── marketing_campaigns.csv
├── snapshots/
│ ├── customers_snapshot.sql
│ └── products_snapshot.sql
├── tests/
│ ├── generic/
│ │ └── assert_positive_revenue.sql
│ └── singular/
│ └── verify_order_totals.sql
└── analyses/
└── customer_cohort_analysis.sql
CI/CD with dbt
GitHub Actions for dbt
# .github/workflows/dbt-ci.yml
name: dbt CI/CD
on:
pull_request:
branches: [main]
push:
branches: [main]
jobs:
dbt-tests:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Setup Python
uses: actions/setup-python@v4
with:
python-version: '3.10'
- name: Install dbt
run: |
pip install dbt-snowflake
pip install dbt-utils
- name: Install dependencies
run: dbt deps
- name: Run linter
run: dbt ls --select state:modified+
env:
DBT_PROFILES_DIR: .
- name: Run tests
run: dbt test --fail-fast
env:
SNOWFLAKE_ACCOUNT: ${{ secrets.SNOWFLAKE_ACCOUNT }}
SNOWFLAKE_USER: ${{ secrets.SNOWFLAKE_USER }}
SNOWFLAKE_PASSWORD: ${{ secrets.SNOWFLAKE_PASSWORD }}
DBT_PROFILES_DIR: .
- name: Build models
run: dbt build --select state:modified+ --fail-fast
env:
SNOWFLAKE_ACCOUNT: ${{ secrets.SNOWFLAKE_ACCOUNT }}
SNOWFLAKE_USER: ${{ secrets.SNOWFLAKE_USER }}
SNOWFLAKE_PASSWORD: ${{ secrets.SNOWFLAKE_PASSWORD }}
DBT_PROFILES_DIR: .
- name: Generate docs
run: dbt docs generate
env:
SNOWFLAKE_ACCOUNT: ${{ secrets.SNOWFLAKE_ACCOUNT }}
SNOWFLAKE_USER: ${{ secrets.SNOWFLAKE_USER }}
SNOWFLAKE_PASSWORD: ${{ secrets.SNOWFLAKE_PASSWORD }}
DBT_PROFILES_DIR: .
Performance Optimization
1. Optimize Model Materialization
# dbt_project.yml
models:
my_project:
staging:
+materialized: view
+tags: ['staging']
intermediate:
+materialized: ephemeral
+tags: ['intermediate']
marts:
finance:
+materialized: table
+tags: ['finance', 'critical']
marketing:
+materialized: incremental
+unique_key: campaign_id
+updated_at: updated_at
2. Use CTEs Strategically
-- BAD: Multiple references to same model
select * from {{ ref('stg_orders') }} where status = 'shipped'
union all
select * from {{ ref('stg_orders') }} where status = 'delivered'
-- GOOD: Single reference with CTE
with orders as (
select * from {{ ref('stg_orders') }}
)
select * from orders where status = 'shipped'
union all
select * from orders where status = 'delivered'
3. Partition and Cluster
-- models/fct_events.sql
{{
config(
materialized='incremental',
partition_by={
"field": "event_date",
"data_type": "date",
"granularity": "day"
},
cluster_by=['event_type', 'user_id']
)
}}
select
event_id,
user_id,
event_type,
date_trunc('day', event_timestamp) as event_date,
-- ... other columns
from {{ source('raw', 'events') }}
Monitoring and Alerting
dbt Cloud with Slack
# dbt Cloud webhook configuration
notifications:
- type: slack
channels:
- "#data-alerts"
events:
- job_failure
- job_warning
# Custom alerting in models
{{
config(
post_hook="{{ log_data_quality_metrics(this) }}"
)
}}
Data Quality Monitoring Macro
-- macros/log_data_quality_metrics.sql
{% macro log_data_quality_metrics(model) %}
{% set query %}
insert into analytics.data_quality_metrics
(model_name, check_date, row_count, null_rate, freshness_hours)
select
'{{ model.name }}' as model_name,
current_timestamp as check_date,
count(*) as row_count,
sum(case when id is null then 1 else 0 end)::float / count(*) as null_rate,
timestampdiff(hour, max(updated_at), current_timestamp) as freshness_hours
from {{ model }}
{% endset %}
{% do run_query(query) %}
{% endmacro %}
Common dbt Packages
# packages.yml
packages:
- package: dbt-labs/dbt_utils
version: 1.1.1
- package: dbt-labs/codegen
version: 0.9.0
- package: dbt-labs/audit_helper
version: 0.8.0
- package: calogica/dbt_expectations
version: 0.8.5
- package: elementary-data/elementary
version: 0.12.0
Useful dbt_utils macros:
-- Generate surrogate key
{{ dbt_utils.generate_surrogate_key(['first_name', 'last_name', 'birth_date']) }}
-- Cross-database date truncation
{{ dbt_utils.date_trunc('day', 'order_timestamp') }}
-- Pivot data
{{ dbt_utils.pivot('status', ['pending', 'shipped', 'delivered']) }}
-- Test for schema changes
{{ dbt_utils.equal_rowcount(ref('model_a'), ref('model_b')) }}
Complete Production Example
Here’s a complete model from my production setup:
-- models/marts/finance/fct_monthly_revenue.sql
{{
config(
materialized='table',
tags=['finance', 'revenue', 'monthly'],
post_hook=[
"{{ grant_select(['analytics_bi_role']) }}"
]
)
}}
with orders as (
select * from {{ ref('fct_orders') }}
where status not in ('cancelled', 'refunded')
),
currency_conversion as (
select * from {{ ref('dim_exchange_rates') }}
),
converted_orders as (
select
o.order_id,
o.customer_id,
o.ordered_at,
{{ dbt_utils.date_trunc('month', 'o.ordered_at') }} as revenue_month,
o.total_amount * coalesce(c.rate_to_usd, 1.0) as total_amount_usd,
o.tax_paid * coalesce(c.rate_to_usd, 1.0) as tax_paid_usd,
o.discount_amount * coalesce(c.rate_to_usd, 1.0) as discount_usd
from orders o
left join currency_conversion c
on o.currency = c.currency_code
and date_trunc('month', o.ordered_at) = c.rate_month
),
monthly_aggregation as (
select
revenue_month,
-- Volume metrics
count(distinct order_id) as total_orders,
count(distinct customer_id) as unique_customers,
-- Revenue metrics
sum(total_amount_usd) as gross_revenue,
sum(tax_paid_usd) as tax_revenue,
sum(discount_usd) as total_discounts,
sum(total_amount_usd) - sum(discount_usd) as net_revenue,
-- Average metrics
avg(total_amount_usd) as avg_order_value,
percentile_cont(0.5) within group (order by total_amount_usd) as median_order_value,
-- Customer metrics
count(distinct customer_id) * 1.0 / count(distinct order_id) as orders_per_customer,
-- Month-over-month comparison
lag(sum(total_amount_usd), 1) over (order by revenue_month) as prev_month_revenue,
(
sum(total_amount_usd) - lag(sum(total_amount_usd), 1) over (order by revenue_month)
) / nullif(lag(sum(total_amount_usd), 1) over (order by revenue_month), 0) * 100 as mom_growth_pct
from converted_orders
group by 1
)
select * from monthly_aggregation
order by revenue_month
Key Takeaways
dbt transforms data engineering by:
- Treating SQL as code: Version control, code review, CI/CD
- Testing everything: Schema tests, data tests, custom assertions
- Documenting automatically: Auto-generated data catalogs
- Modular design: Reusable models, macros, and tests
- Community-driven: Hundreds of packages and best practices
Master dbt, and you’ll build data transformations that are reliable, maintainable, and well-documented.
Questions about dbt implementation? Reach out through the contact page or connect on LinkedIn.
Related Posts
PostgreSQL Performance Optimization: Practical Tips for Data Engineers
Master PostgreSQL performance tuning. Learn indexing strategies, query optimization, configuration tuning, and monitoring techniques for faster, more scalable databases.
Data EngineeringBuilding Scalable Data Pipelines with Apache Spark: A Complete Guide
Learn how to design and implement production-ready data pipelines using Apache Spark. Covers architecture patterns, best practices, fault tolerance, and real-world examples for processing millions of events per second.
Data EngineeringBuilding a Modern Data Lakehouse Architecture
Combine the best of data lakes and warehouses with lakehouse architecture for flexible, scalable analytics.