Columns and Aggregations Are Separate Things

Most semantic layers force you to bake the aggregation into the column definition. You want revenue? Define revenue_sum. Want average revenue too? Define revenue_avg. Five aggregation types per numeric column, times twenty columns, and you're staring at a hundred definitions before you've written a single query.

SLayer takes a different approach: a column is just a named SQL expression — a row-level fact about your data. The aggregation — how you want to roll it up — is specified when you query, not when you define the model.

What this looks like

A model defines columns as bare expressions:

columns:
  - name: subtotal
    sql: subtotal
    type: number
  - name: tax_paid
    sql: tax_paid
    type: number
  - name: order_total
    sql: order_total
    type: number

No type: sum or type: avg. Just what the column is.

At query time, you pick the aggregation with colon syntax:

{
  "source_model": "orders",
  "measures": ["subtotal:sum", "subtotal:avg", "order_total:min", "order_total:max"],
  "dimensions": ["stores.name"]
}

subtotal:sum means "take the subtotal column and SUM it." order_total:min means "take the order_total column and find the MIN." One column definition, as many aggregations as you need.

COUNT(*) and the star measure

COUNT(*) doesn't aggregate a specific column — it counts rows. In SLayer, * is the "all rows" placeholder:

{
  "measures": ["*:count", "revenue:sum"]
}

*:count produces COUNT(*). Result column: orders._count (the underscore prefix distinguishes it from any dimension that might happen to be called count).

Note: * can only be used with count. Combinations like *:sum or *:avg are invalid — use a named measure instead.

You can also count non-null values of a specific column: email:count produces COUNT(email). And customer_id:count_distinct gives you COUNT(DISTINCT customer_id).

Built-in aggregations

These are always available — no definition needed:

Aggregation What it does
sum SUM(expr)
avg AVG(expr)
sum(window='90d') / avg(window='90d') trailing range SUM/AVG ending at each output bucket
min / max MIN/MAX(expr)
count COUNT(expr), or COUNT(*) with *
count_distinct COUNT(DISTINCT expr)
first / last Value from the earliest/latest record per group (by time)
weighted_avg SUM(expr * weight) / SUM(weight)
median PERCENTILE_CONT(0.5) — see database support below
percentile PERCENTILE_CONT(p) — specify p as an argument; see database support below
stddev_samp / stddev_pop Sample / population standard deviation
var_samp / var_pop Sample / population variance
corr price:corr(other=quantity) — Pearson correlation between two columns
covar_samp / covar_pop price:covar_samp(other=quantity) — sample / population covariance

Database support for median / percentile

Engine Supported? How
Postgres yes Native PERCENTILE_CONT(p) WITHIN GROUP (ORDER BY x).
DuckDB yes sqlglot rewrites ordered-set percentiles to DuckDB's QUANTILE_CONT(x, p ORDER BY x) syntax.
SQLite yes Python aggregate UDFs registered on every connection by SLayer.
ClickHouse yes Native median(x) and parametric quantile(p)(x).
MySQL no No native function and no Python-UDF mechanism — SLayer raises NotImplementedError. Use MariaDB or compute client-side.

sum and avg can take a trailing time window when the query has a time dimension:

{
  "measures": [
    {"formula": "revenue:sum(window='30d')", "name": "revenue_30d"},
    {"formula": "revenue:avg(window='1y2m3w5d6h7min8s')", "name": "avg_window"}
  ],
  "time_dimensions": [{"dimension": "created_at", "granularity": "month"}]
}

Duration units are y, m, w, d, h, min, and s. The window is computed over raw source rows, so it may be larger, equal to, or smaller than the query's time granularity.

Custom aggregations

The built-ins cover common cases. When they don't, define your own:

aggregations:
  - name: trimmed_mean
    formula: "AVG(CASE WHEN {value} BETWEEN {lo} AND {hi} THEN {value} END)"
    params:
      - name: lo
        sql: "0"
      - name: hi
        sql: "1000"

{value} is the measure's SQL expression. {lo} and {hi} are parameters with defaults that can be overridden at query time:

{"formula": "score:trimmed_mean(lo=10, hi=90)"}

You can also override built-in aggregation defaults. If weighted_avg should default to a specific weight column in your model:

aggregations:
  - name: weighted_avg
    params:
      - name: weight
        sql: subtotal

Now tax_rate:weighted_avg uses subtotal as the weight without you specifying it every time. But you can still override: tax_rate:weighted_avg(weight=order_total).

Controlling which aggregations apply

Not every aggregation makes sense for every column. customer_id:avg? Probably not useful. The allowed_aggregations field lets you whitelist:

columns:
  - name: customer_id
    sql: customer_id
    type: number
    allowed_aggregations: [count, count_distinct]
  - name: revenue
    sql: amount
    type: number
    allowed_aggregations: [sum, avg, min, max, weighted_avg]

SLayer validates this at query time and at model creation — if you try customer_id:sum, you get a clear error listing the valid options.

first and last

first and last return the value from the earliest or latest record in each group, ordered by a time column. They need a time dimension to know what "earliest" and "latest" mean:

{
  "measures": ["balance:last", "balance:first"],
  "time_dimensions": [{"dimension": "updated_at", "granularity": "month"}]
}

If you want to use a specific time column (overriding the query's time dimension), pass it as an argument:

{"formula": "balance:last(created_at)"}

This explicit time argument takes priority over everything — query-level time_dimensions, main_time_dimension, and the model's default_time_dimension.

Don't confuse the last aggregation (balance:last) with the last() transform (last(revenue:sum)). The aggregation picks the latest record's value within each time bucket. The transform broadcasts the latest time bucket's aggregated value to every row. Different operations, different use cases.

Percentiles

median is built in, but you might want the 95th percentile, or Q1/Q3:

{
  "measures": [
    "latency:median",
    "latency:percentile(p=0.95)",
    "latency:percentile(p=0.25)"
  ]
}

Composing with transforms and arithmetic

Arithmetic:

{"formula": "revenue:sum / *:count", "name": "aov"}

Transforms:

{"formula": "cumsum(revenue:sum)"}
{"formula": "change(revenue:sum)"}
{"formula": "time_shift(revenue:sum, -1, 'year')"}

Cross-model:

{"formula": "customers.*:count"}
{"formula": "cumsum(customers.*:count)"}

Result column naming

The colon becomes an underscore in result keys:

Formula Result key
revenue:sum orders.revenue_sum
*:count orders._count
revenue:avg orders.revenue_avg
customers.*:count orders.customers._count

When a query is saved as a model (create_model with a query parameter), these canonical names become the new model's column names.


See the companion notebook for runnable code demonstrating all of the above.