dbt Sources, Models, Tests, Snapshots & Docs — A Practical Deep Dive
dbt Sources, Models, Tests, Snapshots & Documentation — A Practical Deep Dive
dbt (Data Build Tool) has become the de-facto standard for SQL-first data transformation in modern warehouses like Snowflake, BigQuery, Redshift, and Databricks. What makes dbt powerful isn’t just that it runs SQL in your warehouse—it’s that it brings software engineering discipline to analytics: modular code, version control, automated tests, documentation, and reproducibility. In this deep dive, we’ll explore dbt’s core building blocks—sources, models, tests, snapshots, and documentation—then tie them together with best practices, workflows, and CI/CD so you can ship reliable data products.
Why These Components Matter Together
Think of dbt as a pipeline where sources define raw inputs, models transform them into analytics-ready datasets, tests continuously validate quality, snapshots preserve history, and documentation explains what everything means. The result: trusted data, faster iteration, and fewer late-night fire drills.
1) Sources — Declare Your Raw Data Once
A source is a reference to a raw table in your warehouse. Instead of hardcoding database/schema/table strings throughout your SQL,
you declare them once in YAML and reference them with source()
Benefits: portability, central governance, freshness checks, and clear lineage.
version: 2
sources:
- name: raw
schema: raw_data
description: "Raw ingested data from apps and vendors."
freshness:
warn_after: {count: 24, period: hour}
error_after: {count: 48, period: hour}
tables:
- name: orders
description: "All orders from the ecommerce platform"
loaded_at_field: ingested_at
- name: customers
description: "Customer master from CRM"
Use a source inside a model like this:
select * from {{ source('raw', 'orders') }}
freshness
checks to critical sources so failed data loads are caught early, before analytics break.
2) Models — Staging → Intermediate → Marts
A model is a single SQL SELECT
saved to models/
. dbt compiles the SQL and materializes it as a view or table.
The most productive teams organize models into layers:
- Staging
stg_*
: one-table-in, one-table-out; rename columns, cast types, apply light cleaning. - Intermediate
int_*
: business logic, joins, deduplication, window functions. - Marts
dim_*, fct_*
: analytics-ready dimensions and facts for BI and downstream consumers.
-- models/staging/stg_orders.sql
select
id as order_id,
customer_id,
cast(order_timestamp as timestamp) as order_ts,
status,
total_amount::numeric(18,2) as total_amount
from {{ source('raw', 'orders') }};
-- models/marts/fct_orders.sql
with base as (
select
order_id,
customer_id,
date_trunc('day', order_ts) as order_day,
total_amount
from {{ ref('stg_orders') }}
)
select
order_day,
customer_id,
count(*) as orders_cnt,
sum(total_amount) as revenue
from base
group by 1, 2;
Materializations & Project Defaults
You can control whether models build as tables or views globally or per folder:
# dbt_project.yml
models:
my_project:
staging:
+materialized: view
marts:
+materialized: table
3) Tests — Data Quality as Code
dbt tests are SQL assertions that fail when they return any rows. There are two flavors:
generic (parameterized in YAML) and singular (custom SQL files in tests/
)
Running dbt test
turns data quality into a first-class CI step.
version: 2
models:
- name: stg_orders
description: "Cleaned orders"
columns:
- name: order_id
tests: [not_null, unique]
- name: customer_id
tests:
- relationships:
to: ref('stg_customers')
field: customer_id
- name: status
tests:
- accepted_values:
values: ['pending','shipped','cancelled']
sources:
- name: raw
tables:
- name: orders
columns:
- name: id
tests: [not_null]
Example of a singular (custom) test:
-- tests/no_future_orders.sql
select *
from {{ ref('stg_orders') }}
where order_ts > current_timestamp
4) Snapshots — Track How Records Change Over Time
Snapshots implement slowly changing dimensions (SCD Type 2) without specialized ETL tools. dbt writes a history table
using a unique_key
and either a timestamp or a column-comparison strategy. This is essential for auditing, compliance, and time-travel analytics.
{% raw %}{% snapshot customers_snapshot %}
{{
config(
target_schema='snapshots',
unique_key='id',
strategy='timestamp',
updated_at='updated_at'
)
}}
select * from {{ source('raw', 'customers') }}
{% endsnapshot %}{% endraw %}
Snapshot tables include metadata columns such as dbt_valid_from, dbt_valid_to and dbt_scd_id
.
5) Documentation — Lineage & Meaning, Auto-Generated
dbt lets you describe projects directly in YAML. Running dbt docs generate
produces a static documentation site
with a searchable catalog and a lineage graph. This turns tribal knowledge into shared, living documentation.
version: 2
models:
- name: dim_customers
description: "Customer dimension used by BI dashboards and churn models."
columns:
- name: customer_id
description: "Primary key."
- name: country
description: "Country at signup. ISO-2 where available."
Serve docs locally while developing:
dbt docs generate
dbt docs serve
6) Macros & Packages — Reuse and Standardize
Use macros (Jinja + SQL) to eliminate repetition, and install packages like dbt_utils or dbt-expectations
for common patterns and richer tests.
{% raw %}{% macro safe_div(n, d) -%}
case when {{ d }} = 0 then null else {{ n }} / {{ d }} end
{%- endmacro %}{% endraw %}
# packages.yml
packages:
- package: dbt-labs/dbt_utils
version: 1.1.1
7) dbt run vs dbt test vs dbt build
Command | What it does | When to use |
---|---|---|
dbt run |
Builds models (materializes views/tables) in dependency order. | During development to create/update models quickly. |
dbt test |
Executes data quality tests (generic + singular) against existing relations. | Validate data after models exist; fast feedback on integrity. |
dbt build |
Runs seed → run → snapshot → test end-to-end in proper order. |
CI/CD and production jobs to guarantee fresh builds and tests together. |
8) A Practical Workflow (Local → PR → CI → Prod)
- Develop locally: Create/modify models, add tests and docs. Run
dbt run
anddbt test
. - Open a PR: Colleagues review SQL, naming, tests, and performance implications.
- CI pipeline: On PR, run
dbt build
(optionally on a scratch schema/env). Fail the build if any tests fail. - Merge & deploy: Scheduled or on-merge job runs
dbt build
in production. - Observe: Monitor job run times, test failure rate, and model freshness. Iterate.
9) Best Practices That Prevent Pain Later
- Adopt clear layers and naming:
stg_*
for staging;int_*
for intermediate logic;dim_*and fct_*
for marts. - Keep models small and focused: Prefer more simple models over mega-queries. Compose with
ref()
- Test primary keys and relationships:
unique + not_null
on IDs;relationships
for foreign keys. - Document as you go: Add
description
fields to models/columns and regenerate docs in PRs. - Parameterize repetitive logic with macros: DRY code is easier to maintain and optimize.
- Use
dbt build
in CI: Ensure seeds, snapshots, models, and tests run together. - Watch warehouse costs: Materialize heavy models as tables, partition/cluster when possible, and prune unused models.
10) Real-World Use Cases
Fintech — Transaction Integrity & Ledger Analytics
Stage raw transactions, enforce not_null
/unique
on transaction IDs, and use relationships
to confirm all transactions map to valid accounts.
Build fact tables for balances and fees, and snapshot customer KYC status changes for audit trails.
E-commerce — Customer 360 & Revenue Modeling
Clean orders and line items in staging; join to customers and products in intermediate models; expose a fct_orders
mart with daily revenue and AOV.
Use accepted_values
tests on order statuses, and snapshots to track price changes over time.
SaaS — Product Analytics & Churn
Transform event logs into standardized session and feature usage tables. Build retention and activation cohorts. Tests ensure no future timestamps and valid user IDs. Snapshots capture subscription tier shifts to analyze upgrade/downgrade paths.
11) Putting It All Together
A healthy dbt project feels like a well-run software repo: clean layers, modular SQL, automated tests, generated docs, and predictable builds. You’ll move faster because refactors are safer, stakeholders trust the numbers, and on-call anxiety drops dramatically.
Quick Reference — Common Commands
# Build + test everything end-to-end
dbt build
# Only build models
dbt run
# Only run tests
dbt test
# Generate and serve documentation locally
dbt docs generate
dbt docs serve
# Run a subset (e.g., all staging models)
dbt run --select tag:staging
# Rerun failed tests from last invocation
dbt test --select result:error
Pro tip: Use selectors (by tag, directory, resource type) to run targeted subsets for faster feedback loops during development.
Have questions about adapting this structure to your team’s warehouse or CI? Drop them in the comments and I’ll help tailor a setup.
Post a Comment