dbt Sources, Models, Tests, Snapshots & Docs — A Practical Deep Dive

dbt Sources, Models, Tests, Snapshots & Documentation — A Practical Deep Dive

dbt (Data Build Tool) has become the de-facto standard for SQL-first data transformation in modern warehouses like Snowflake, BigQuery, Redshift, and Databricks. What makes dbt powerful isn’t just that it runs SQL in your warehouse—it’s that it brings software engineering discipline to analytics: modular code, version control, automated tests, documentation, and reproducibility. In this deep dive, we’ll explore dbt’s core building blocks—sources, models, tests, snapshots, and documentation—then tie them together with best practices, workflows, and CI/CD so you can ship reliable data products.

Why These Components Matter Together

Think of dbt as a pipeline where sources define raw inputs, models transform them into analytics-ready datasets, tests continuously validate quality, snapshots preserve history, and documentation explains what everything means. The result: trusted data, faster iteration, and fewer late-night fire drills.

1) Sources — Declare Your Raw Data Once

A source is a reference to a raw table in your warehouse. Instead of hardcoding database/schema/table strings throughout your SQL, you declare them once in YAML and reference them with source() Benefits: portability, central governance, freshness checks, and clear lineage.

version: 2

sources:
  - name: raw
    schema: raw_data
    description: "Raw ingested data from apps and vendors."
    freshness:
      warn_after: {count: 24, period: hour}
      error_after: {count: 48, period: hour}
    tables:
      - name: orders
        description: "All orders from the ecommerce platform"
        loaded_at_field: ingested_at
      - name: customers
        description: "Customer master from CRM"

Use a source inside a model like this:

select * from {{ source('raw', 'orders') }}

Tip: Add freshness checks to critical sources so failed data loads are caught early, before analytics break.

2) Models — Staging → Intermediate → Marts

A model is a single SQL SELECT saved to models/. dbt compiles the SQL and materializes it as a view or table. The most productive teams organize models into layers:

Staging stg_*: one-table-in, one-table-out; rename columns, cast types, apply light cleaning.
Intermediate int_*: business logic, joins, deduplication, window functions.
Marts dim_*, fct_*: analytics-ready dimensions and facts for BI and downstream consumers.

-- models/staging/stg_orders.sql
select
  id               as order_id,
  customer_id,
  cast(order_timestamp as timestamp) as order_ts,
  status,
  total_amount::numeric(18,2)       as total_amount
from {{ source('raw', 'orders') }};

-- models/marts/fct_orders.sql
with base as (
  select
    order_id,
    customer_id,
    date_trunc('day', order_ts) as order_day,
    total_amount
  from {{ ref('stg_orders') }}
)
select
  order_day,
  customer_id,
  count(*)         as orders_cnt,
  sum(total_amount) as revenue
from base
group by 1, 2;

Materializations & Project Defaults

You can control whether models build as tables or views globally or per folder:

# dbt_project.yml
models:
  my_project:
    staging:
      +materialized: view
    marts:
      +materialized: table

3) Tests — Data Quality as Code

dbt tests are SQL assertions that fail when they return any rows. There are two flavors: generic (parameterized in YAML) and singular (custom SQL files in tests/) Running dbt test turns data quality into a first-class CI step.

version: 2

models:
  - name: stg_orders
    description: "Cleaned orders"
    columns:
      - name: order_id
        tests: [not_null, unique]
      - name: customer_id
        tests:
          - relationships:
              to: ref('stg_customers')
              field: customer_id
      - name: status
        tests:
          - accepted_values:
              values: ['pending','shipped','cancelled']

sources:
  - name: raw
    tables:
      - name: orders
        columns:
          - name: id
            tests: [not_null]

Example of a singular (custom) test:

-- tests/no_future_orders.sql
select *
from {{ ref('stg_orders') }}
where order_ts > current_timestamp

When to run: During local development for quick feedback, and always in CI/CD after building models.

4) Snapshots — Track How Records Change Over Time

Snapshots implement slowly changing dimensions (SCD Type 2) without specialized ETL tools. dbt writes a history table using a unique_key and either a timestamp or a column-comparison strategy. This is essential for auditing, compliance, and time-travel analytics.

{% raw %}{% snapshot customers_snapshot %}

{{
  config(
    target_schema='snapshots',
    unique_key='id',
    strategy='timestamp',
    updated_at='updated_at'
  )
}}

select * from {{ source('raw', 'customers') }}

{% endsnapshot %}{% endraw %}

Snapshot tables include metadata columns such as dbt_valid_from, dbt_valid_to and dbt_scd_id.

5) Documentation — Lineage & Meaning, Auto-Generated

dbt lets you describe projects directly in YAML. Running dbt docs generate produces a static documentation site with a searchable catalog and a lineage graph. This turns tribal knowledge into shared, living documentation.

version: 2

models:
  - name: dim_customers
    description: "Customer dimension used by BI dashboards and churn models."
    columns:
      - name: customer_id
        description: "Primary key."
      - name: country
        description: "Country at signup. ISO-2 where available."

Serve docs locally while developing:

dbt docs generate
dbt docs serve

6) Macros & Packages — Reuse and Standardize

Use macros (Jinja + SQL) to eliminate repetition, and install packages like dbt_utils or dbt-expectations for common patterns and richer tests.

{% raw %}{% macro safe_div(n, d) -%}
case when {{ d }} = 0 then null else {{ n }} / {{ d }} end
{%- endmacro %}{% endraw %}

# packages.yml
packages:
  - package: dbt-labs/dbt_utils
    version: 1.1.1

7) dbt run vs dbt test vs dbt build

Command	What it does	When to use
`dbt run`	Builds models (materializes views/tables) in dependency order.	During development to create/update models quickly.
`dbt test`	Executes data quality tests (generic + singular) against existing relations.	Validate data after models exist; fast feedback on integrity.
`dbt build`	Runs `seed → run → snapshot → test` end-to-end in proper order.	CI/CD and production jobs to guarantee fresh builds and tests together.

8) A Practical Workflow (Local → PR → CI → Prod)

Develop locally: Create/modify models, add tests and docs. Run dbt run and dbt test.
Open a PR: Colleagues review SQL, naming, tests, and performance implications.
CI pipeline: On PR, run dbt build (optionally on a scratch schema/env). Fail the build if any tests fail.
Merge & deploy: Scheduled or on-merge job runs dbt build in production.
Observe: Monitor job run times, test failure rate, and model freshness. Iterate.

Tip: Even with dbt Core, you can achieve robust CI using GitHub Actions or GitLab CI. dbt Cloud simplifies this with built-in jobs and environment management.

9) Best Practices That Prevent Pain Later

Adopt clear layers and naming: stg_* for staging; int_* for intermediate logic; dim_*and fct_* for marts.
Keep models small and focused: Prefer more simple models over mega-queries. Compose with ref()
Test primary keys and relationships: unique + not_null on IDs; relationships for foreign keys.
Document as you go: Add description fields to models/columns and regenerate docs in PRs.
Parameterize repetitive logic with macros: DRY code is easier to maintain and optimize.
Use dbt build in CI: Ensure seeds, snapshots, models, and tests run together.
Watch warehouse costs: Materialize heavy models as tables, partition/cluster when possible, and prune unused models.

10) Real-World Use Cases

Fintech — Transaction Integrity & Ledger Analytics

Stage raw transactions, enforce not_null/unique on transaction IDs, and use relationships to confirm all transactions map to valid accounts. Build fact tables for balances and fees, and snapshot customer KYC status changes for audit trails.

E-commerce — Customer 360 & Revenue Modeling

Clean orders and line items in staging; join to customers and products in intermediate models; expose a fct_orders mart with daily revenue and AOV. Use accepted_values tests on order statuses, and snapshots to track price changes over time.

SaaS — Product Analytics & Churn

Transform event logs into standardized session and feature usage tables. Build retention and activation cohorts. Tests ensure no future timestamps and valid user IDs. Snapshots capture subscription tier shifts to analyze upgrade/downgrade paths.

11) Putting It All Together

A healthy dbt project feels like a well-run software repo: clean layers, modular SQL, automated tests, generated docs, and predictable builds. You’ll move faster because refactors are safer, stakeholders trust the numbers, and on-call anxiety drops dramatically.

Quick Reference — Common Commands

# Build + test everything end-to-end
dbt build

# Only build models
dbt run

# Only run tests
dbt test

# Generate and serve documentation locally
dbt docs generate
dbt docs serve

# Run a subset (e.g., all staging models)
dbt run --select tag:staging

# Rerun failed tests from last invocation
dbt test --select result:error

Pro tip: Use selectors (by tag, directory, resource type) to run targeted subsets for faster feedback loops during development.

Have questions about adapting this structure to your team’s warehouse or CI? Drop them in the comments and I’ll help tailor a setup.

K2D Data Analytics