Modern Data Transformation with dbt: The Complete Guide

dbt (data build tool) has fundamentally changed how I approach data transformations. It brought software engineering practices—version control, testing, documentation, and CI/CD—to the world of analytics engineering. In this comprehensive guide, I’ll share everything you need to know about dbt, from basic concepts to production patterns.

What is dbt and Why Does It Matter?

dbt is a transformation tool that enables data analysts and engineers to transform, test, and document data directly in their warehouse using SQL. Here’s what makes it special:

The dbt Philosophy

Traditional ETL	dbt Approach
Transform in external tool	Transform in the warehouse
Black-box transformations	Transparent SQL transformations
No testing	Built-in data testing
Poor documentation	Auto-generated documentation
Hard to version	Git-friendly modular code

Why dbt Won the Analytics Engineering Space

Leverages SQL: Uses the language data teams already know
Warehouse-native: Pushes compute to Snowflake, BigQuery, Redshift, etc.
Software engineering practices: Modular, testable, version-controlled code
Documentation as code: Auto-generates data catalogs
Community ecosystem: dbt Hub with hundreds of pre-built packages

Core dbt Concepts Explained

Models: The Building Blocks

Models are SELECT statements that transform your data. Each model becomes a table or view in your warehouse:

-- models/staging/stg_orders.sql
{{
    config(
        materialized='view',
        tags=['staging', 'orders']
    )
}}

with source as (
    select * from {{ source('raw', 'orders') }}
),

renamed as (
    select
        -- IDs
        id as order_id,
        customer_id,
        user_id,

        -- Timestamps
        ordered_at,
        shipped_at,
        delivered_at,

        -- Financials
        subtotal,
        tax_paid,
        discount_amount,
        total_amount,

        -- Metadata
        status,
        payment_method,
        shipping_address_id

    from source
)

select * from renamed

Key points:

Caffe: {{ }} denotes Jinja templating
source(): References source tables defined in YAML
config(): Sets materialization and metadata
CTEs: Break complex logic into readable steps

Sources: Defining Raw Data

# models/sources.yml
version: 2

sources:
  - name: raw
    description: "Raw data imported from source systems"
    database: analytics
    schema: raw_data

    tables:
      - name: orders
        description: "E-commerce order records"
        loaded_at_field: _etl_loaded_at
        freshness:
          warn_after: {count: 6, period: hour}
          error_after: {count: 24, period: hour}
        columns:
          - name: id
            description: "Primary key"
            tests:
              - unique
              - not_null

      - name: customers
        description: "Customer master data"
        loaded_at_field: _etl_loaded_at
        freshness:
          warn_after: {count: 24, period: hour}
        columns:
          - name: id
            tests:
              - unique
              - not_null

Tests: Ensuring Data Quality

# models/schema.yml
version: 2

models:
  - name: stg_orders
    description: "Staging model for order data"
    columns:
      - name: order_id
        description: "Unique order identifier"
        tests:
          - unique
          - not_null

      - name: total_amount
        description: "Total order amount in USD"
        tests:
          - not_null
          - dbt_utils.accepted_range:
              min_value: 0
              inclusive: true

      - name: status
        description: "Order fulfillment status"
        tests:
          - accepted_values:
              values: ['pending', 'processing', 'shipped', 'delivered', 'cancelled']

      - name: customer_id
        description: "Foreign key to customer"
        tests:
          - not_null
          - relationships:
              to: ref('stg_customers')
              field: customer_id

  - name: fct_orders
    description: "Fact table for order analytics"
    columns:
      - name: order_id
        tests:
          - unique
          - not_null

Custom SQL tests:

-- tests/assert_order_totals.sql
{{
    config(severity='error')
}}

-- Ensure order totals are calculated correctly
with order_calculation as (
    select
        order_id,
        subtotal + tax_paid - discount_amount as calculated_total,
        total_amount as recorded_total
    from {{ ref('fct_orders') }}
)

select *
from order_calculation
where abs(calculated_total - recorded_total) > 0.01

Snapshots: Tracking Historical Changes

-- snapshots/customers_snapshot.sql
{% snapshot snp_customers %}

{{
    config(
        target_database='analytics',
        target_schema='snapshots',
        unique_key='customer_id',
        strategy='timestamp',
        updated_at='updated_at',
    )
}}

select * from {{ source('raw', 'customers') }}

{% endsnapshot %}

This creates a Type 2 Slowly Changing Dimension (SCD) automatically:

customer_id	name	email	updated_at	dbt_valid_from	dbt_valid_to	dbt_is_current
1	John	john@old.com	2026-01-01	2026-01-01	2026-02-15	false
1	John	john@new.com	2026-02-15	2026-02-15	NULL	true

Seeds: Static Reference Data

-- seeds/country_codes.csv
code,name,region
US,United States,North America
CA,Canada,North America
GB,United Kingdom,Europe
DE,Germany,Europe
FR,France,Europe
JP,Japan,Asia
AU,Australia,Oceania

-- Reference in models
select
    o.*,
    c.name as country_name,
    c.region
from {{ ref('stg_orders') }} o
left join {{ ref('country_codes') }} c
    on o.country_code = c.code

Advanced dbt Patterns

Incremental Models: Processing Only New Data

-- models/fct_orders.sql
{{
    config(
        materialized='incremental',
        unique_key='order_id',
        incremental_strategy='merge',
        tags=['fact', 'orders']
    )
}}

with source as (
    select * from {{ source('raw', 'orders') }}
    {% if is_incremental() %}
        -- Only process new/updated records
        where updated_at > (select max(updated_at) from {{ this }})
    {% endif %}
),

final as (
    select
        id as order_id,
        customer_id,
        total_amount,
        status,
        ordered_at,
        updated_at
    from source
)

select * from final

Mart Layer: Business-Ready Aggregations

-- models/marts/daily_metrics.sql
{{
    config(
        materialized='table',
        tags=['marts', 'metrics']
    )
}}

with orders as (
    select * from {{ ref('fct_orders') }}
),

daily_order_metrics as (
    select
        date_trunc('day', ordered_at) as order_date,

        -- Order counts
        count(*) as total_orders,
        count(distinct customer_id) as unique_customers,

        -- Revenue metrics
        sum(total_amount) as total_revenue,
        avg(total_amount) as avg_order_value,
        median(total_amount) as median_order_value,

        -- Conversion
        sum(case when status = 'delivered' then 1 else 0 end) as delivered_orders,
        sum(case when status = 'cancelled' then 1 else 0 end) as cancelled_orders,

        -- Time-based comparisons
        sum(total_amount) filter (where ordered_at >= date_sub(current_date, 7)) as revenue_last_7d,
        sum(total_amount) filter (where ordered_at >= date_sub(current_date, 30)) as revenue_last_30d

    from orders
    group by 1
)

select * from daily_order_metrics

Macros: DRY SQL with Jinja

-- macros/currency_conversion.sql
{% macro convert_to_usd(amount, currency_column) %}
    case
        when {{ currency_column }} = 'USD' then {{ amount }}
        when {{ currency_column }} = 'EUR' then {{ amount }} * 1.08
        when {{ currency_column }} = 'GBP' then {{ amount }} * 1.27
        when {{ currency_column }} = 'JPY' then {{ amount }} * 0.0067
        else {{ amount }}
    end
{% endmacro %}

Usage in models:

select
    order_id,
    {{ convert_to_usd('total_amount', 'currency') }} as total_amount_usd
from {{ ref('stg_orders') }}

Custom Materializations

-- macros/materializations/slowly_changing_dimension.sql
{% materialization slowly_changing_dimension, adapter='snowflake' %}
    -- Custom materialization logic
    {% set target_relation = this.incorporate(type='table') %}
    {% set intermediate_relation = make_intermediate_relation(target_relation) %}

    -- Drop temp table if exists
    {% set existing_relation = load_relation(this) %}
    {% if existing_relation is not none %}
        {% call statement('drop_existing') %}
            drop table if exists {{ intermediate_relation }}
        {% endcall %}
    {% endif %}

    -- Create new table with SCD logic
    {% call statement('main') %}
        create table {{ intermediate_relation }} as
        select
            *,
            current_timestamp as dbt_updated_at
        from {{ ref(sql) }}
    {% endcall %}

    -- Swap tables
    {% do rename_relation(existing_relation, intermediate_relation) %}
    {% do rename_relation(intermediate_relation, target_relation) %}

    {{ return({'relations': [target_relation]}) }}
{% endmaterialization %}

dbt Project Structure

Here’s my recommended project structure:

dbt_project/
├── dbt_project.yml          # Project configuration
├── README.md
├── .gitignore
├── packages.yml             # External packages
├── profiles.yml             # Connection profiles (usually in ~/.dbt)
├── macros/
│   ├── currency_conversion.sql
│   ├── date_utils.sql
│   └── custom_tests.sql
├── models/
│   ├── sources.yml          # Source definitions
│   ├── staging/
│   │   ├── stg_orders.sql
│   │   ├── stg_customers.sql
│   │   └── stg_products.sql
│   ├── intermediate/
│   │   ├── int_order_items.sql
│   │   └── int_customer_metrics.sql
│   └── marts/
│       ├── finance/
│       │   ├── fct_revenue.sql
│       │   └── dim_accounts.sql
│       └── marketing/
│           ├── fct_campaigns.sql
│           └── dim_channels.sql
├── seeds/
│   ├── country_codes.csv
│   └── marketing_campaigns.csv
├── snapshots/
│   ├── customers_snapshot.sql
│   └── products_snapshot.sql
├── tests/
│   ├── generic/
│   │   └── assert_positive_revenue.sql
│   └── singular/
│       └── verify_order_totals.sql
└── analyses/
    └── customer_cohort_analysis.sql

CI/CD with dbt

GitHub Actions for dbt

# .github/workflows/dbt-ci.yml
name: dbt CI/CD

on:
  pull_request:
    branches: [main]
  push:
    branches: [main]

jobs:
  dbt-tests:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3

      - name: Setup Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.10'

      - name: Install dbt
        run: |
          pip install dbt-snowflake
          pip install dbt-utils

      - name: Install dependencies
        run: dbt deps

      - name: Run linter
        run: dbt ls --select state:modified+
        env:
          DBT_PROFILES_DIR: .

      - name: Run tests
        run: dbt test --fail-fast
        env:
          SNOWFLAKE_ACCOUNT: ${{ secrets.SNOWFLAKE_ACCOUNT }}
          SNOWFLAKE_USER: ${{ secrets.SNOWFLAKE_USER }}
          SNOWFLAKE_PASSWORD: ${{ secrets.SNOWFLAKE_PASSWORD }}
          DBT_PROFILES_DIR: .

      - name: Build models
        run: dbt build --select state:modified+ --fail-fast
        env:
          SNOWFLAKE_ACCOUNT: ${{ secrets.SNOWFLAKE_ACCOUNT }}
          SNOWFLAKE_USER: ${{ secrets.SNOWFLAKE_USER }}
          SNOWFLAKE_PASSWORD: ${{ secrets.SNOWFLAKE_PASSWORD }}
          DBT_PROFILES_DIR: .

      - name: Generate docs
        run: dbt docs generate
        env:
          SNOWFLAKE_ACCOUNT: ${{ secrets.SNOWFLAKE_ACCOUNT }}
          SNOWFLAKE_USER: ${{ secrets.SNOWFLAKE_USER }}
          SNOWFLAKE_PASSWORD: ${{ secrets.SNOWFLAKE_PASSWORD }}
          DBT_PROFILES_DIR: .

Performance Optimization

1. Optimize Model Materialization

# dbt_project.yml
models:
  my_project:
    staging:
      +materialized: view
      +tags: ['staging']
    intermediate:
      +materialized: ephemeral
      +tags: ['intermediate']
    marts:
      finance:
        +materialized: table
        +tags: ['finance', 'critical']
      marketing:
        +materialized: incremental
        +unique_key: campaign_id
        +updated_at: updated_at

2. Use CTEs Strategically

-- BAD: Multiple references to same model
select * from {{ ref('stg_orders') }} where status = 'shipped'
union all
select * from {{ ref('stg_orders') }} where status = 'delivered'

-- GOOD: Single reference with CTE
with orders as (
    select * from {{ ref('stg_orders') }}
)
select * from orders where status = 'shipped'
union all
select * from orders where status = 'delivered'

3. Partition and Cluster

-- models/fct_events.sql
{{
    config(
        materialized='incremental',
        partition_by={
            "field": "event_date",
            "data_type": "date",
            "granularity": "day"
        },
        cluster_by=['event_type', 'user_id']
    )
}}

select
    event_id,
    user_id,
    event_type,
    date_trunc('day', event_timestamp) as event_date,
    -- ... other columns
from {{ source('raw', 'events') }}

Monitoring and Alerting

dbt Cloud with Slack

# dbt Cloud webhook configuration
notifications:
  - type: slack
    channels:
      - "#data-alerts"
    events:
      - job_failure
      - job_warning

# Custom alerting in models
{{
    config(
        post_hook="{{ log_data_quality_metrics(this) }}"
    )
}}

Data Quality Monitoring Macro

-- macros/log_data_quality_metrics.sql
{% macro log_data_quality_metrics(model) %}
{% set query %}
    insert into analytics.data_quality_metrics
    (model_name, check_date, row_count, null_rate, freshness_hours)
    select
        '{{ model.name }}' as model_name,
        current_timestamp as check_date,
        count(*) as row_count,
        sum(case when id is null then 1 else 0 end)::float / count(*) as null_rate,
        timestampdiff(hour, max(updated_at), current_timestamp) as freshness_hours
    from {{ model }}
{% endset %}

{% do run_query(query) %}
{% endmacro %}

Common dbt Packages

# packages.yml
packages:
  - package: dbt-labs/dbt_utils
    version: 1.1.1

  - package: dbt-labs/codegen
    version: 0.9.0

  - package: dbt-labs/audit_helper
    version: 0.8.0

  - package: calogica/dbt_expectations
    version: 0.8.5

  - package: elementary-data/elementary
    version: 0.12.0

Useful dbt_utils macros:

-- Generate surrogate key
{{ dbt_utils.generate_surrogate_key(['first_name', 'last_name', 'birth_date']) }}

-- Cross-database date truncation
{{ dbt_utils.date_trunc('day', 'order_timestamp') }}

-- Pivot data
{{ dbt_utils.pivot('status', ['pending', 'shipped', 'delivered']) }}

-- Test for schema changes
{{ dbt_utils.equal_rowcount(ref('model_a'), ref('model_b')) }}

Complete Production Example

Here’s a complete model from my production setup:

-- models/marts/finance/fct_monthly_revenue.sql
{{
    config(
        materialized='table',
        tags=['finance', 'revenue', 'monthly'],
        post_hook=[
            "{{ grant_select(['analytics_bi_role']) }}"
        ]
    )
}}

with orders as (
    select * from {{ ref('fct_orders') }}
    where status not in ('cancelled', 'refunded')
),

currency_conversion as (
    select * from {{ ref('dim_exchange_rates') }}
),

converted_orders as (
    select
        o.order_id,
        o.customer_id,
        o.ordered_at,
        {{ dbt_utils.date_trunc('month', 'o.ordered_at') }} as revenue_month,
        o.total_amount * coalesce(c.rate_to_usd, 1.0) as total_amount_usd,
        o.tax_paid * coalesce(c.rate_to_usd, 1.0) as tax_paid_usd,
        o.discount_amount * coalesce(c.rate_to_usd, 1.0) as discount_usd

    from orders o
    left join currency_conversion c
        on o.currency = c.currency_code
        and date_trunc('month', o.ordered_at) = c.rate_month
),

monthly_aggregation as (
    select
        revenue_month,

        -- Volume metrics
        count(distinct order_id) as total_orders,
        count(distinct customer_id) as unique_customers,

        -- Revenue metrics
        sum(total_amount_usd) as gross_revenue,
        sum(tax_paid_usd) as tax_revenue,
        sum(discount_usd) as total_discounts,
        sum(total_amount_usd) - sum(discount_usd) as net_revenue,

        -- Average metrics
        avg(total_amount_usd) as avg_order_value,
        percentile_cont(0.5) within group (order by total_amount_usd) as median_order_value,

        -- Customer metrics
        count(distinct customer_id) * 1.0 / count(distinct order_id) as orders_per_customer,

        -- Month-over-month comparison
        lag(sum(total_amount_usd), 1) over (order by revenue_month) as prev_month_revenue,
        (
            sum(total_amount_usd) - lag(sum(total_amount_usd), 1) over (order by revenue_month)
        ) / nullif(lag(sum(total_amount_usd), 1) over (order by revenue_month), 0) * 100 as mom_growth_pct

    from converted_orders
    group by 1
)

select * from monthly_aggregation
order by revenue_month

Key Takeaways

dbt transforms data engineering by:

Treating SQL as code: Version control, code review, CI/CD
Testing everything: Schema tests, data tests, custom assertions
Documenting automatically: Auto-generated data catalogs
Modular design: Reusable models, macros, and tests
Community-driven: Hundreds of packages and best practices

Master dbt, and you’ll build data transformations that are reliable, maintainable, and well-documented.

Questions about dbt implementation? Reach out through the contact page or connect on LinkedIn.

Data Engineering dbt Analytics SQL Data Warehouse

MD Furkanul Islam

Data Engineer & AI/ML Specialist

9+ years building intelligent data systems at scale. Passionate about bridging the gap between data engineering, AI, and robotics.

LinkedIn GitHub Twitter