Auto-Ingestion: From Database to Semantic Models¶

TL;DR: SLayer can introspect a database schema and automatically generate models with dimensions, measures, and join relationships matching the schema.

This notebook shows what happens under the hood.

We use the Jaffle Shop dataset — a synthetic e-commerce schema with 7 tables and foreign key relationships between them.

Prerequisites: pip install motley-slayer

See also: Auto-Ingestion docs | Models docs

In [1]:

Copied!

import os
import sys

import duckdb
import sqlalchemy as sa

sys.path.insert(0, os.path.join(os.getcwd(), "..", "..", ".."))
sys.path.insert(0, os.path.join(os.getcwd(), "..", "jaffle_data"))

from setup_jaffle import ensure_jaffle_shop, DB_PATH

engine, storage, models = ensure_jaffle_shop()
import os
import sys

import duckdb
import sqlalchemy as sa

sys.path.insert(0, os.path.join(os.getcwd(), "..", "..", ".."))
sys.path.insert(0, os.path.join(os.getcwd(), "..", "jaffle_data"))

from setup_jaffle import ensure_jaffle_shop, DB_PATH

engine, storage, models = ensure_jaffle_shop()

The Jaffle Shop Schema¶

The database has 7 tables connected by foreign keys:

customers  <──  orders  ──>  stores
                  │
              items  ──>  products  <──  supplies

customers  <──  tweets

Let's peek at the data:

In [2]:

Copied!





conn = duckdb.connect(DB_PATH, read_only=True)
for table in ["customers", "stores", "products", "orders", "items", "supplies", "tweets"]:
    count = conn.execute(f"SELECT COUNT(*) FROM {table}").fetchone()[0]
    print(f"{table:>15}: {count:>8,} rows")
conn.close()
conn = duckdb.connect(DB_PATH, read_only=True)
for table in ["customers", "stores", "products", "orders", "items", "supplies", "tweets"]:
    count = conn.execute(f"SELECT COUNT(*) FROM {table}").fetchone()[0]
    print(f"{table:>15}: {count:>8,} rows")
conn.close()

      customers:    2,590 rows
         stores:        6 rows
       products:       10 rows
         orders:  661,146 rows
          items:  981,268 rows
       supplies:       65 rows
         tweets:  215,535 rows

Step 1: FK Graph Discovery¶

The first thing auto-ingestion does is introspect foreign key constraints and build a directed dependency graph. Each edge means "this table has a FK pointing to that table."

The graph must be acyclic — if cycles are found, SLayer raises a RollupGraphError.

For each table, SLayer computes the transitive closure: all tables reachable by following FK chains. This is used for column introspection (knowing which tables' columns might be queried), but joins are only created for direct FKs — multi-hop paths are resolved at query time by walking the join graph.

In [3]:

Copied!





from slayer.core.models import DatasourceConfig
from slayer.engine.ingestion import _build_fk_graph, _compute_transitive_closure

ds = DatasourceConfig(name="jaffle_shop", type="duckdb", database=DB_PATH)
sa_engine = sa.create_engine(ds.resolve_env_vars().get_connection_string())
inspector = sa.inspect(sa_engine)
table_names = inspector.get_table_names()

fk_graph = _build_fk_graph(inspector=inspector, table_names=table_names, schema=None)

print("FK Graph (table -> tables it references):")
for table in sorted(fk_graph):
    refs = sorted(fk_graph[table])
    print(f"  {table} -> {refs}")

print()
print("Tables with no FK references (leaf tables):")
for t in sorted(table_names):
    if t not in fk_graph:
        print(f"  {t}")
from slayer.core.models import DatasourceConfig
from slayer.engine.ingestion import _build_fk_graph, _compute_transitive_closure

ds = DatasourceConfig(name="jaffle_shop", type="duckdb", database=DB_PATH)
sa_engine = sa.create_engine(ds.resolve_env_vars().get_connection_string())
inspector = sa.inspect(sa_engine)
table_names = inspector.get_table_names()

fk_graph = _build_fk_graph(inspector=inspector, table_names=table_names, schema=None)

print("FK Graph (table -> tables it references):")
for table in sorted(fk_graph):
    refs = sorted(fk_graph[table])
    print(f"  {table} -> {refs}")

print()
print("Tables with no FK references (leaf tables):")
for t in sorted(table_names):
    if t not in fk_graph:
        print(f"  {t}")

FK Graph (table -> tables it references):
  items -> ['orders', 'products']
  orders -> ['customers', 'stores']
  supplies -> ['products']
  tweets -> ['customers']

Tables with no FK references (leaf tables):
  customers
  products
  stores

In [4]:

Copied!





# Transitive closure: what tables are reachable from items?
for table in ["orders", "items"]:
    reachable = _compute_transitive_closure(fk_graph, table)
    print(f"{table} can reach: {sorted(reachable)}")

sa_engine.dispose()
# Transitive closure: what tables are reachable from items?
for table in ["orders", "items"]:
    reachable = _compute_transitive_closure(fk_graph, table)
    print(f"{table} can reach: {sorted(reachable)}")

sa_engine.dispose()

orders can reach: ['customers', 'stores']
items can reach: ['customers', 'orders', 'products', 'stores']

Notice that items transitively reaches customers and stores through orders, even though it has no direct FK to them. At query time, SLayer walks the join graph through orders to reach these tables.

Step 2: Join Generation¶

Each table's own FK relationships become ModelJoin objects — one per FK. Each join specifies:

target_model: the table being joined
join_pairs: [[source_column, target_column]] — the columns to join on

Only direct FKs are stored. Tables reachable via multiple hops (e.g. items → orders → customers) are resolved at query time by walking each intermediate model's own joins.

In [5]:

Copied!





orders_model = next(m for m in models if m.name == "orders")
items_model = next(m for m in models if m.name == "items")

print("=== orders model joins (direct FKs only) ===")
for j in orders_model.joins:
    src, tgt = j.join_pairs[0]
    print(f"  -> {j.target_model}  ON {src} = {tgt}")

print()
print("=== items model joins (direct FKs only) ===")
for j in items_model.joins:
    src, tgt = j.join_pairs[0]
    print(f"  -> {j.target_model}  ON {src} = {tgt}")

print()
print("Note: items has no join to customers or stores —")
print("those are reached at query time via orders.joins.")
orders_model = next(m for m in models if m.name == "orders")
items_model = next(m for m in models if m.name == "items")

print("=== orders model joins (direct FKs only) ===")
for j in orders_model.joins:
    src, tgt = j.join_pairs[0]
    print(f"  -> {j.target_model}  ON {src} = {tgt}")

print()
print("=== items model joins (direct FKs only) ===")
for j in items_model.joins:
    src, tgt = j.join_pairs[0]
    print(f"  -> {j.target_model}  ON {src} = {tgt}")

print()
print("Note: items has no join to customers or stores —")
print("those are reached at query time via orders.joins.")

=== orders model joins (direct FKs only) ===
  -> customers  ON customer_id = id
  -> stores  ON store_id = id

=== items model joins (direct FKs only) ===
  -> orders  ON order_id = id
  -> products  ON sku = sku

Note: items has no join to customers or stores —
those are reached at query time via orders.joins.

Step 3: Dimension & Measure Generation¶

For each table, SLayer generates:

Dimensions for every column. See the joins on how to refer to joined dimensions and measures.
Measures: count always; one measure per numeric non-ID column (with sql set to the column name); non-numeric non-ID columns also get a measure

In [6]:

Copied!





# Iterate the unified `columns` list — v2 merges v1's separate dimensions and
# row-level measures into one list. Each Column carries data type, primary_key,
# allowed_aggregations (whitelist), and filter (CASE-WHEN at aggregation time).
print("=== orders model columns ===")
print(f"{'name':<25} {'sql':<25} {'type':<10}")
print("-" * 70)
for col in orders_model.columns:
    pk = " [PK]" if col.primary_key else ""
    sql_str = col.sql or col.name
    print(f"{col.name:<25} {sql_str:<25} {str(col.type):<10}{pk}")
# Iterate the unified `columns` list — v2 merges v1's separate dimensions and
# row-level measures into one list. Each Column carries data type, primary_key,
# allowed_aggregations (whitelist), and filter (CASE-WHEN at aggregation time).
print("=== orders model columns ===")
print(f"{'name':<25} {'sql':<25} {'type':<10}")
print("-" * 70)
for col in orders_model.columns:
    pk = " [PK]" if col.primary_key else ""
    sql_str = col.sql or col.name
    print(f"{col.name:<25} {sql_str:<25} {str(col.type):<10}{pk}")

=== orders model columns ===
name                      sql                       type      
----------------------------------------------------------------------
id                        id                        string     [PK]
customer_id               customer_id               string    
ordered_at                ordered_at                date      
store_id                  store_id                  string    
subtotal                  subtotal                  number    
tax_paid                  tax_paid                  number    
order_total               order_total               number

In [7]:

Copied!





# In v2 `model.measures` holds *named formulas* (ModelMeasure objects), not
# row-level measure definitions. After auto-ingest the list is empty by
# default; agents/users populate it via edit_model with reusable formulas
# like `{"name": "aov", "formula": "revenue:sum / *:count"}`.
print("=== orders model measures (named formulas) ===")
print(f"{len(orders_model.measures)} formula(s) defined.")
# In v2 `model.measures` holds *named formulas* (ModelMeasure objects), not
# row-level measure definitions. After auto-ingest the list is empty by
# default; agents/users populate it via edit_model with reusable formulas
# like `{"name": "aov", "formula": "revenue:sum / *:count"}`.
print("=== orders model measures (named formulas) ===")
print(f"{len(orders_model.measures)} formula(s) defined.")

=== orders model measures (named formulas) ===
0 formula(s) defined.

Joined dimensions aren't stored on the model — they're resolved at query time.

Use dot syntax to walk the join graph, including through multiple hops.

In [8]:

Copied!





# 1-hop: items -> products
result = engine.execute_sync(
    query={
        "source_model": "items",
        "measures": ["*:count"],
        "dimensions": ["products.name"],
        "order": [{"column": {"name": "_count"}, "direction": "desc"}],
        "limit": 5,
    }
)

print("=== 1-hop: items -> products ===")
for row in result.data:
    print(
        f"  {row['items.products.name']}: {row['items._count']} items"
    )

# Multi-hop: items -> orders -> customers
result = engine.execute_sync(
    query={
        "source_model": "items",
        "measures": ["*:count"],
        "dimensions": ["orders.customers.name"],
        "order": [{"column": {"name": "_count"}, "direction": "desc"}],
        "limit": 5,
    }
)

print("\n=== Multi-hop: items -> orders -> customers ===")
for row in result.data:
    print(
        f"  {row['items.orders.customers.name']}: {row['items._count']} items"
    )
# 1-hop: items -> products
result = engine.execute_sync(
    query={
        "source_model": "items",
        "measures": ["*:count"],
        "dimensions": ["products.name"],
        "order": [{"column": {"name": "_count"}, "direction": "desc"}],
        "limit": 5,
    }
)

print("=== 1-hop: items -> products ===")
for row in result.data:
    print(
        f"  {row['items.products.name']}: {row['items._count']} items"
    )

# Multi-hop: items -> orders -> customers
result = engine.execute_sync(
    query={
        "source_model": "items",
        "measures": ["*:count"],
        "dimensions": ["orders.customers.name"],
        "order": [{"column": {"name": "_count"}, "direction": "desc"}],
        "limit": 5,
    }
)

print("\n=== Multi-hop: items -> orders -> customers ===")
for row in result.data:
    print(
        f"  {row['items.orders.customers.name']}: {row['items._count']} items"
    )

=== 1-hop: items -> products ===
  chai and mighty: 151699 items
  adele-ade: 151175 items
  vanilla ice: 150853 items
  for richer or pourover : 150795 items
  tangaroo: 150723 items

=== Multi-hop: items -> orders -> customers ===
  Nicole Jackson: 1706 items
  Jacob Brown: 1464 items
  Michael Conner: 1440 items
  Kyle Evans: 1431 items
  Timothy Evans: 1390 items

Summary¶

Auto-ingestion discovers the database structure and generates a complete semantic layer:

FK graph identifies table relationships and validates they're acyclic
Transitive closure determines which tables are reachable (used for column introspection)
Direct joins — one ModelJoin per FK on the source table; multi-hop paths are resolved at query time by walking the graph
Dimensions and measures are generated with proper naming conventions

No join-related SQL is baked into the models — JOINs are constructed dynamically at query time from each model's direct join metadata.

Next: See the Example Queries notebook to query this data, or the Joins notebook for a deep dive into join mechanics.