Slowly Changing Dimensions: Types, Patterns, and AI Considerations

Slowly changing dimensions (SCDs) are techniques for tracking historical changes in dimensional data. Learn the SCD types, implementation patterns, and how to model them for AI-ready analytics.

7 min read·

Slowly changing dimensions (SCDs) are dimensional modeling techniques for handling data that changes over time in dimension tables. Unlike fact data that captures events at specific moments, dimensional data - like customer attributes, product details, or organizational hierarchies - changes periodically and those changes often need to be tracked historically.

The core challenge is balancing simplicity with historical accuracy. Do you need to know what a customer's segment was when they placed an order six months ago? Or is the current segment sufficient? The answer determines which SCD approach fits your needs and has significant implications for AI analytics accuracy.

Why Slowly Changing Dimensions Matter

Consider a simple scenario: a customer was classified as "SMB" when they made a purchase, but later grew to "Enterprise" status. When analyzing that historical purchase:

  • Current state analysis might attribute the purchase to an Enterprise customer
  • Historical state analysis correctly attributes it to an SMB customer

Both perspectives have valid use cases, but conflating them produces incorrect insights. AI systems querying dimensional data without proper SCD handling will produce results that mix temporal contexts - a subtle but significant form of hallucination.

SCD Types Explained

Type 0: Fixed Dimension

Dimensional attributes never change after initial load. This works for truly immutable data - like date dimensions or original customer acquisition source.

Implementation: Simply don't update the row.

Use cases: Birth date, original signup channel, historical snapshot records.

Pros: Simplest approach, no additional complexity.

Cons: Doesn't reflect reality when attributes do change.

Type 1: Overwrite

When a dimension attribute changes, overwrite the old value with the new one. No history is maintained.

Implementation:

UPDATE customers
SET segment = 'Enterprise'
WHERE customer_id = 12345;

Use cases: Correcting data errors, attributes where history is irrelevant.

Pros: Simple, dimension table stays small.

Cons: History is lost permanently. Historical facts analyzed with current dimensions may be misleading.

Type 2: Add New Row

When a dimension attribute changes, keep the old row and add a new row with the new values. Both rows are marked with effective date ranges.

Implementation:

-- Mark old row as expired
UPDATE customers
SET effective_end_date = '2024-02-09',
    is_current = false
WHERE customer_id = 12345 AND is_current = true;

-- Insert new row
INSERT INTO customers (customer_id, name, segment, effective_start_date, effective_end_date, is_current)
VALUES (12345, 'Acme Corp', 'Enterprise', '2024-02-10', '9999-12-31', true);

Key columns:

  • effective_start_date: When this row version became active
  • effective_end_date: When this row version expired (often '9999-12-31' for current)
  • is_current: Flag for quick current-record lookup
  • surrogate_key: Unique identifier for each row version

Use cases: Any dimension where historical accuracy matters - customer segmentation, product pricing tiers, organizational structure.

Pros: Full historical accuracy, enables point-in-time analysis.

Cons: Table grows with changes, joins require date range logic.

Type 3: Add New Column

Keep both old and new values in separate columns on the same row.

Implementation:

ALTER TABLE customers ADD COLUMN previous_segment VARCHAR(50);

UPDATE customers
SET previous_segment = segment,
    segment = 'Enterprise'
WHERE customer_id = 12345;

Use cases: When only one level of history matters, or for specific attributes with limited change patterns.

Pros: Simple queries for both current and previous values.

Cons: Only tracks one historical value, schema changes needed per tracked attribute.

Type 4: History Table

Maintain a separate history table for slowly changing attributes while keeping the main dimension table current-only.

Implementation:

  • Main table: customers - always current values only
  • History table: customers_history - all historical versions with effective dates

Use cases: When most queries need current values but some require history. Balances query simplicity with historical completeness.

Pros: Main dimension stays simple and small, full history available when needed.

Cons: Two tables to maintain, history queries require join.

Type 6: Hybrid Approach

Combines Types 1, 2, and 3 - maintains full history rows (Type 2) while also storing current values (Type 1) and previous values (Type 3) for frequently compared attributes.

Use cases: Complex requirements where you need easy access to current, previous, and full historical values.

Pros: Maximum flexibility for different query patterns.

Cons: Most complex to implement and maintain, data redundancy.

Implementation Patterns

Surrogate Keys Are Essential

For Type 2 dimensions, natural keys are no longer unique - the same customer can have multiple rows. Surrogate keys (system-generated unique identifiers) provide row-level uniqueness:

CREATE TABLE customers (
  customer_key INT PRIMARY KEY,     -- Surrogate key (unique per row)
  customer_id INT,                  -- Natural key (same across versions)
  name VARCHAR(100),
  segment VARCHAR(50),
  effective_start_date DATE,
  effective_end_date DATE,
  is_current BOOLEAN
);

Fact tables reference surrogate keys to maintain point-in-time accuracy:

-- Fact table points to specific dimension row version
INSERT INTO orders (order_id, customer_key, order_date, amount)
VALUES (9999, 42, '2024-02-15', 1500.00);
-- customer_key 42 is the dimension row that was current on order date

Effective Date Handling

Consistent date range conventions prevent gaps and overlaps:

  • Use closed-open intervals: [start_date, end_date)
  • Start date is inclusive, end date is exclusive
  • Current rows have end date of '9999-12-31' or NULL

Query for current records:

WHERE is_current = true
-- or
WHERE effective_end_date = '9999-12-31'

Query for point-in-time:

WHERE '2024-01-15' >= effective_start_date
  AND '2024-01-15' < effective_end_date

Change Detection

Identifying which dimension records have changed requires comparing source data to existing dimension state:

  • Hash-based comparison: Compare hashes of attribute values
  • Column-by-column comparison: Check each tracked attribute
  • Change data capture: Use database logs to identify changes

Only changes in tracked attributes should trigger new row versions - changes to untracked attributes use Type 1 overwrites.

AI Analytics Considerations

Slowly changing dimensions create complexity that AI systems handle poorly without proper abstraction.

The Point-in-Time Problem

An AI asked "What was revenue by customer segment last year?" must:

  1. Identify that customer segment is a slowly changing dimension
  2. Determine the correct point-in-time for each transaction
  3. Join facts to the dimension row that was current at transaction time

Without explicit guidance, AI systems often join to current dimension values - producing results that retroactively apply current segmentation to historical transactions.

Semantic Layer Solutions

A semantic layer should abstract SCD complexity:

dimension:
  name: Customer Segment
  description: Customer classification based on ARR at time of transaction
  scd_type: 2
  temporal_handling: point_in_time
  effective_date_column: effective_start_date
  expiration_date_column: effective_end_date

The semantic layer then automatically generates correct point-in-time joins, removing this complexity from AI query generation.

Documentation Requirements

For AI-ready SCD modeling, document:

  • Which dimensions are slowly changing and which type
  • Effective date column names and conventions
  • Whether analyses should use point-in-time or current values by default
  • Business meaning of each tracked attribute change

Choosing the Right Approach

Default to Type 2 when historical accuracy matters for analytics. The storage and complexity costs are manageable, and you preserve options for future analysis.

Use Type 1 only when you're certain history is irrelevant - correcting errors, updating display names, or dimensions that truly don't affect analysis.

Consider Type 4 when most queries need current values and historical queries are rare - it optimizes the common case while preserving history.

Avoid Type 3 and Type 6 unless you have specific requirements that simpler approaches can't meet. The added complexity rarely justifies the benefits.

For AI analytics, Type 2 with proper semantic layer abstraction provides the best balance of accuracy and usability. AI systems get correct results without needing to understand SCD mechanics directly.

Questions

Use SCD Type 2 when historical accuracy matters - when you need to know what something was at a point in time. Use Type 1 when only current values matter and historical changes are irrelevant. For AI analytics, Type 2 is often preferred because it enables point-in-time correct analysis.

Related