Hierarchical Data Modeling: Representing Parent-Child Relationships in Analytics

Hierarchical data - organizational structures, product categories, geographic regions - requires special modeling techniques. Learn the patterns for modeling, querying, and maintaining hierarchical data for analytics and AI.

8 min read·

Hierarchical data represents parent-child relationships where entities are organized in tree structures. Common examples include organizational charts (CEO to departments to teams to individuals), geographic regions (country to state to city to zip code), and product categories (electronics to computers to laptops to gaming laptops).

Modeling hierarchies correctly is essential for analytics that roll up or drill down through these structures. Poor hierarchy modeling leads to incorrect aggregations, broken drill-down paths, and AI systems that cannot navigate organizational structures properly.

Hierarchy Characteristics

Depth

Fixed depth: Every branch has the same number of levels

  • Geographic: Country (1) > State (2) > City (3) > Zip (4)
  • Time: Year > Quarter > Month > Day

Variable depth: Branches have different numbers of levels

  • Org chart: CEO may have 3 levels to some roles, 7 to others
  • Product categories: Some products have 2 category levels, others 6

Balance

Balanced hierarchies: All leaf nodes are at the same level

  • Calendar hierarchy: Every day is at level 4 (Year > Quarter > Month > Day)

Ragged hierarchies: Leaf nodes appear at different levels

  • Geography: Some sales go to City level, others only to Country level

Cardinality

Single-parent: Each node has exactly one parent

  • Standard org chart: Each employee has one manager

Multi-parent: Nodes can have multiple parents

  • Product attributes: A laptop might be in both "Electronics > Computers" and "Office > Equipment"

Modeling Patterns

Flattened Columns (Fixed Depth)

For fixed-depth hierarchies, denormalize into separate columns:

CREATE TABLE dim_geography (
  geo_key INT PRIMARY KEY,
  zip_code VARCHAR(10),
  city VARCHAR(100),
  state VARCHAR(50),
  country VARCHAR(50),
  region VARCHAR(50)        -- e.g., "North America", "EMEA"
);

Every level is a column. Queries are simple:

-- Aggregate to country level
SELECT country, SUM(sales)
FROM facts f
JOIN dim_geography g ON f.geo_key = g.geo_key
GROUP BY country;

-- Drill down from country to state
SELECT state, SUM(sales)
FROM facts f
JOIN dim_geography g ON f.geo_key = g.geo_key
WHERE country = 'USA'
GROUP BY state;

Pros:

  • Simple queries - no recursive logic needed
  • Excellent query performance
  • Easy for AI systems to understand

Cons:

  • Schema changes needed for new levels
  • Doesn't handle variable-depth hierarchies
  • Redundant storage of parent values

Adjacency List (Variable Depth)

Each row stores a reference to its parent:

CREATE TABLE dim_org (
  employee_key INT PRIMARY KEY,
  employee_name VARCHAR(100),
  title VARCHAR(100),
  parent_key INT REFERENCES dim_org(employee_key),
  hierarchy_level INT       -- Distance from root
);

Data example:

employee_keyemployee_nameparent_keyhierarchy_level
1Sarah (CEO)NULL0
2Mike (VP Sales)11
3Lisa (Sales Manager)22
4Tom (Sales Rep)33

Querying requires recursion:

-- All employees under Mike (VP Sales)
WITH RECURSIVE subordinates AS (
  SELECT employee_key, employee_name, parent_key, 1 as depth
  FROM dim_org
  WHERE employee_key = 2

  UNION ALL

  SELECT o.employee_key, o.employee_name, o.parent_key, s.depth + 1
  FROM dim_org o
  JOIN subordinates s ON o.parent_key = s.employee_key
)
SELECT * FROM subordinates;

Pros:

  • Handles any depth
  • Simple to update (change parent_key)
  • Natural representation

Cons:

  • Recursive queries are complex
  • Performance degrades with deep hierarchies
  • Not supported in all analytics tools

Path Enumeration (Materialized Path)

Store the full path from root to each node:

CREATE TABLE dim_category (
  category_key INT PRIMARY KEY,
  category_name VARCHAR(100),
  category_path VARCHAR(500),    -- e.g., '/Electronics/Computers/Laptops'
  path_array TEXT[],             -- e.g., ['Electronics', 'Computers', 'Laptops']
  hierarchy_level INT
);

Queries use path matching:

-- All categories under Electronics
SELECT * FROM dim_category
WHERE category_path LIKE '/Electronics/%';

-- All ancestor categories of Laptops
SELECT * FROM dim_category
WHERE '/Electronics/Computers/Laptops' LIKE category_path || '%';

Pros:

  • Fast subtree queries (index on path)
  • No recursion needed
  • Works in all databases

Cons:

  • Path updates cascade to all descendants
  • Path length limits may apply
  • Separator character must not appear in names

Nested Sets

Each node stores left and right bounds that encompass all descendants:

CREATE TABLE dim_product_category (
  category_key INT PRIMARY KEY,
  category_name VARCHAR(100),
  lft INT,                      -- Left bound
  rgt INT,                      -- Right bound
  hierarchy_level INT
);

For a hierarchy:

Electronics (1, 10)
  ├── Computers (2, 7)
  │   ├── Laptops (3, 4)
  │   └── Desktops (5, 6)
  └── Audio (8, 9)

Queries are elegant:

-- All descendants of Computers (lft=2, rgt=7)
SELECT * FROM dim_product_category
WHERE lft > 2 AND rgt < 7;

-- All ancestors of Laptops (lft=3, rgt=4)
SELECT * FROM dim_product_category
WHERE lft < 3 AND rgt > 4;

-- Count of products by category including subcategories
SELECT c.category_name, COUNT(*)
FROM products p
JOIN dim_product_category pc ON p.category_key = pc.category_key
JOIN dim_product_category c ON pc.lft BETWEEN c.lft AND c.rgt
GROUP BY c.category_key, c.category_name;

Pros:

  • Very fast read queries
  • No recursion needed
  • Efficient subtree operations

Cons:

  • Updates are expensive (may require renumbering)
  • Not intuitive to understand
  • Insert/delete complexity

Bridge Tables

A separate table mapping all ancestor-descendant relationships:

CREATE TABLE org_hierarchy_bridge (
  ancestor_key INT,
  descendant_key INT,
  distance INT,               -- Levels between them
  PRIMARY KEY (ancestor_key, descendant_key)
);

For employee 4 (Tom) under manager 3 under VP 2 under CEO 1:

ancestor_keydescendant_keydistance
110
121
132
143
220
231
242
330
341
440

Queries use simple joins:

-- All sales under VP Mike (employee_key = 2)
SELECT SUM(f.sales)
FROM facts f
JOIN org_hierarchy_bridge b ON f.employee_key = b.descendant_key
WHERE b.ancestor_key = 2;

Pros:

  • Fast queries at any aggregation level
  • No recursion
  • Flexible for complex hierarchies

Cons:

  • Storage grows quadratically with hierarchy depth
  • Maintenance complexity on hierarchy changes
  • Bridge must be rebuilt or updated when hierarchy changes

Handling Ragged Hierarchies

When hierarchies have varying depths, several approaches work:

Repeat Values

Repeat the lowest known level up the missing levels:

productcategory_l3category_l2category_l1
Widget AWidgetsWidgetsHardware
Gadget BGadgetsElectronicsHardware

"Widgets" appears at L3 and L2 because there's no intermediate level.

NULL Values

Use NULL for missing levels:

productcategory_l3category_l2category_l1
Widget ANULLWidgetsHardware
Gadget BGadgetsElectronicsHardware

Queries must handle NULLs appropriately.

Skip Levels

Include a level indicator and query conditionally:

SELECT
  CASE level
    WHEN 3 THEN category_l3
    WHEN 2 THEN category_l2
    WHEN 1 THEN category_l1
  END as leaf_category
FROM products;

AI Analytics Considerations

Hierarchical data presents specific challenges for AI systems.

Level Ambiguity

When asked "What are the sales by region?" - does region mean:

  • Geographic region (North America, EMEA)?
  • Sales region (as defined by sales org)?
  • The state/province level?

The semantic layer must define hierarchy levels explicitly:

hierarchy:
  name: geography
  levels:
    - name: region
      description: Continental grouping (North America, EMEA, APAC)
    - name: country
      description: Nation-level
    - name: state
      description: State, province, or equivalent subdivision
    - name: city
      description: Metropolitan area

Aggregate Ambiguity

"Total sales for Electronics" - does this include:

  • Just products directly in Electronics?
  • Products in all subcategories (Computers, Audio)?

Document the default behavior:

dimension:
  name: product_category
  hierarchy_aggregation: inclusive  # includes all descendants

Historical Hierarchies

When org structures change, historical analysis becomes complex:

  • Q1 sales by team - which team structure?
  • If team A split into teams B and C, how to compare?

Consider maintaining hierarchy snapshots or clearly documenting that analyses use current structure.

Query Optimization

Index Strategies

  • Flattened columns: Index each level column
  • Adjacency list: Index parent_key for traversal
  • Path enumeration: Index path for LIKE queries
  • Nested sets: Index (lft, rgt) for range queries
  • Bridge tables: Index on ancestor_key and descendant_key

Materialized Aggregates

Pre-compute aggregates at each hierarchy level:

CREATE TABLE sales_by_category AS
SELECT
  b.ancestor_key as category_key,
  SUM(f.sales) as total_sales
FROM facts f
JOIN category_bridge b ON f.category_key = b.descendant_key
GROUP BY b.ancestor_key;

Refresh as data or hierarchy changes.

Flattening for Analytics

Even if you maintain normalized hierarchies for flexibility, consider flattened views for analytics:

CREATE VIEW dim_org_flat AS
WITH RECURSIVE org_tree AS (
  -- Build hierarchy recursively, then flatten
  ...
)
SELECT
  employee_key,
  employee_name,
  manager_name,
  director_name,
  vp_name,
  ceo_name
FROM org_tree;

AI and BI tools query the simple view; the underlying structure remains flexible.

Hierarchical data modeling requires matching the approach to hierarchy characteristics and query patterns. Fixed-depth hierarchies suit flattened columns. Variable-depth hierarchies need adjacency lists, path enumeration, or bridge tables. Document hierarchy semantics clearly in the semantic layer so AI systems can navigate structures correctly.

Questions

It depends on hierarchy characteristics. For fixed-depth hierarchies (Country > State > City), flattened columns work well. For variable-depth or frequently changing hierarchies (org charts, product taxonomies), adjacency lists or path enumeration are more flexible. Consider query patterns too - drill-down queries favor different structures than aggregate-up queries.

Related