Continuous Distributions

ProvSQL extends the probabilistic-database setting from discrete Bernoulli inputs (see Probabilities) to first-class continuous random variables. Columns can carry distributions such as Normal(μ, σ), Uniform(a, b), or Exponential(λ); arithmetic and comparison work natively; the planner rewrites WHERE, JOIN and UNION on random-variable columns transparently; and expected, variance, moment, support, rv_sample and rv_histogram query the resulting distributions, conditioning on filter predicates when asked.

Note

Continuous-distribution support requires shared_preload_libraries = 'provsql' in postgresql.conf; the planner hook performs all of the rewrites described below transparently. Discrete probabilities (Probabilities) and continuous random variables coexist in the same circuit and the same query.

Introduction

A random-variable column stores, in each row, a token referring to a probability distribution rather than a single value. The token is a random_variable, a thin wrapper around the UUID of a provenance gate, that fits in any CREATE TABLE:

CREATE TABLE sensor_readings(
  id      int PRIMARY KEY,
  reading random_variable);

INSERT INTO sensor_readings VALUES
  (1, normal(2.5, 0.5)),
  (2, uniform(1, 3)),
  (3, exponential(0.4));

SELECT add_provenance('sensor_readings');

The remainder of this chapter uses this sensors example as a running motivator. Each row carries a different kind of noise:

  • sensor 1 is a calibrated unit with Gaussian measurement error centred at 2.5;

  • sensor 2 is a cheap unit whose reading is uniformly distributed between 1 and 3;

  • sensor 3 is a drift-prone unit whose reading is exponentially distributed with rate 0.4.

Filtering against a numeric threshold, the planner rewrites the WHERE clause into a conditioning event on each row’s reading:

SELECT id FROM sensor_readings WHERE reading > 2;
-- id 1, 2, 3 selected with respective probabilities
--   1 - Φ((2 - 2.5) / 0.5)  ≈ 0.84   (Normal CDF)
--   (3 - 2) / (3 - 1)       =  0.50   (Uniform CDF)
--   exp(-0.4 · 2)           ≈ 0.45   (Exponential survival)

The numeric value of these probabilities is recovered through the provenance circuit (see probability_evaluate and provenance); the query itself is written and read as ordinary SQL.

Distribution Constructors

The constructors below all return a random_variable. Each call mints a fresh, independent random variable: two calls to normal(0, 1) produce two uncorrelated draws, not the same draw. Use mixture (below) when you need shared underlying randomness.

normal (mu, sigma)

Normal(μ, σ) with mean μ and standard deviation σ. Both arguments must be finite; σ must be non-negative. The degenerate σ = 0 case is silently routed through as_random to share the constant-token gate. See Wikipedia.

uniform (a, b)

Uniform[a, b] with bounds a b. Both bounds must be finite. The degenerate a = b case is routed through as_random. See Wikipedia.

exponential (lambda)

Exponential(λ) with rate λ > 0 and mean 1/λ. There is no degenerate form: λ = 0 raises. See Wikipedia.

erlang (k, lambda)

Erlang(k, λ) is the sum of k independent Exponential(λ) draws (equivalently, the Gamma with integer shape). It is materialised as a single gate_rv so the closed-form CDF and moments fire directly, rather than the sampler having to draw k exponentials per iteration. The degenerate k = 1 case is routed through exponential to share its gate. See Wikipedia.

categorical (probs, outcomes)

Discrete distribution over the values in outcomes with the corresponding probabilities in probs. Both arrays must have the same length, every probability must be in [0, 1], and the probabilities must sum to 1 within 1e-9. A single-outcome categorical reduces to as_random at construction. See Wikipedia.

mixture (p, x, y) (two overloads)

Bernoulli-weighted choice between two random variables: with probability p the result samples x, otherwise y.

  • The first overload takes p as the UUID of an existing Boolean gate (input / mulinput / plus / times / cmp / ). Two mixtures that share the same p token are coupled: they sample the same underlying coin per Monte-Carlo iteration. Use this overload when you want a family of mixtures that all switch together.

  • The second overload takes p as a plain double precision probability in [0, 1] and mints a fresh anonymous gate_input with that probability under the hood. Each call mints a new coin, so two mixture(0.5, X, Y) calls draw independently.

See Wikipedia.

as_random (c)

Lift a numeric constant into a deterministic random_variable (a Dirac point mass at c). Three overloads exist (double precision, integer, numeric) and the same constants share a UUID. c = -0.0 is canonicalised to +0.0 so the two zeros refer to the same gate. See Wikipedia.

Implicit casts integer random_variable, numeric random_variable and double precision random_variable are installed. Writing WHERE reading > 2 works without an explicit as_random(2) wrapper.

Arithmetic on Random Variables

The four arithmetic operators +, -, *, / and unary - are declared on (random_variable, random_variable) and return a fresh random_variable whose underlying gate is a gate_arith over the operand UUIDs. Mixing scalars and random variables resolves through the implicit casts above:

-- All of these are well-typed random_variable expressions.
SELECT reading + 1            FROM sensor_readings;
SELECT 2 * reading - 0.5      FROM sensor_readings;
SELECT -reading               FROM sensor_readings;
SELECT r1.reading / r2.reading
  FROM sensor_readings r1, sensor_readings r2
 WHERE r1.id < r2.id;

The arithmetic operators are structural: they record the operation in the circuit without evaluating it. Evaluation happens later, when the value is queried via expected, variance, moment, probability_evaluate, rv_sample, or rv_histogram. Two paths exist:

  • Closed-form, when ProvSQL’s hybrid evaluator recognises a family-preserving combination: a sum of independent normals is another normal; a scalar shift, scale, or negation of a normal preserves the family; the sum of k i.i.d. exponentials with the same rate is an Erlang; a linear combination of disjoint random variables has closed-form mean and variance. The result is exact.

  • Monte Carlo fallback, when no closed form applies, e.g. a product of two non-trivial random variables. The sampler draws independent values from each leaf, evaluates the arithmetic expression per iteration, and aggregates the results. See Configuration of the Monte Carlo sampler below for the controlling GUCs.

Comparison operators <, <=, =, <>, >=, > on (random_variable, random_variable) return boolean syntactically, but the planner hook intercepts every such operator at planning time and rewrites it into a conditioning event on the row’s provenance. End users never invoke the comparison procedures directly; the rewriter routes them through gate_cmp.

Probabilistic Queries

Filter predicates, joins, and unions on random_variable columns are rewritten transparently into operations on the provenance circuit. The user writes ordinary SQL:

SELECT id, provenance() AS prov
FROM sensor_readings
WHERE reading > 2;

The rewriter recognises reading > 2 as a comparison on an RV column, mints a gate_cmp for the comparison, and conjoins its UUID into the row’s provsql column. Querying the result returns one row per source row whose underlying random-variable event is satisfiable; the corresponding probability is recovered through:

SELECT id, probability_evaluate(provenance()) AS p
FROM sensor_readings
WHERE reading > 2;
--  id |    p
-- ----+--------
--   1 | 0.8413
--   2 | 0.5000
--   3 | 0.4493

Comparisons between two random-variable columns work the same way, with the rewriter conjoining a gate_cmp whose two children are the two operand gates. JOIN predicates on RV columns follow the standard ProvSQL rewriting, with the join condition contributing a gate_cmp to the joined row’s provenance. UNION ALL over RV-bearing relations produces the natural gate_plus over the two source rows’ provenance.

Configuration of the Monte Carlo Sampler

Two GUCs control the Monte Carlo fallback path. See Configuration Reference for the full configuration reference.

provsql.monte_carlo_seed (default: -1)

Seed for std::mt19937_64. The default -1 seeds from std::random_device for non-deterministic sampling. Any other value (including 0) is used as a literal seed and makes every Monte-Carlo result reproducible across runs and across the Bernoulli / continuous sampling paths.

provsql.rv_mc_samples (default: 10000)

Default sample count used by analytical fallbacks (expected, variance, moment, rv_histogram, rv_sample under conditioning) when they cannot decompose a sub-circuit and must fall back to Monte Carlo. Set to 0 to disable the fallback entirely: callers will raise rather than sample.

The sample count for probability_evaluate(..., 'monte-carlo', 'n') is independent and explicit in the third argument (the sample count is passed as a string, like every other probability_evaluate parameter).

Closed-Form Evaluation

Three pieces work together to keep evaluation analytical where possible. They run in this order, so each later pass benefits from the rewrites produced by the earlier ones:

  • HybridEvaluator (simplifier pass) rewrites the in-memory circuit first: linear closure on normals (a·X + b·Y + c is a single normal when X, Y are independent normals), i.i.d. exponentials sum to Erlang, affine shift on a single uniform (a·U(p, q) + c is a single uniform), closed-form negation of a bare Normal or Uniform (-N(μ, σ) = N(-μ, σ), -U(a, b) = U(-b, -a)), subtraction shapes (A B) canonicalise to addition (A + (−B)) so the same pipeline handles N c, c N, U c, c U, c·X-style shifts and scales thread through mixtures and categoricals, single-child arith roots and semiring identities collapse. Running first means the later passes see bare gate_rv leaves wherever a closed-form fold applied, instead of multi-gate gate_arith subtrees.

  • RangeCheck propagates support intervals through gate_arith and tests every gate_cmp against the propagated interval. A comparator that is decidable from the support alone collapses to a Bernoulli gate_input with probability 0 or 1; the rest of the circuit sees a plain Boolean leaf. Joint WHERE clauses are intersected per random variable: reading > 1 AND reading < 3 constrains a single normal once and runs the conjunction as one analytic CDF call. Equality and inequality on continuous distributions collapse here (X = X is identically true, X <> X identically false; X = Y is identically false whenever at least one side has a continuous distribution, even when neither is a bare leaf, including composites like Exp(λ_1) + Exp(λ_2) with distinct rates, or a Bernoulli mixture over two continuous arms). When both sides have statically-known discrete masses (categoricals, mixtures of as_random branches, …) and are independent, P(X = Y) is computed exactly by summing the per-outcome mass products; the disjoint-outcome case is the boundary where the sum is 0.

  • AnalyticEvaluator computes the exact CDF of a single distribution’s gate_cmp (e.g. Normal > 2, Uniform <= 1.5, Exponential >= λ⁻¹) via the standard CDFs of the supported families, for any gate_cmp that RangeCheck could not decide from the support alone.

A residual HybridEvaluator decomposer pass runs between RangeCheck and AnalyticEvaluator for continuous-island gate_cmp gates that no closed form has resolved; it marginalises them by Monte Carlo into Bernoulli leaves so the downstream Boolean machinery (independent / tree-decomposition / external compilers) becomes available.

provsql.simplify_on_load (default: on) folds the universal peephole pass at the moment a circuit is read into memory, so every downstream consumer (semiring evaluators, Monte Carlo, view_circuit, PROV export, ProvSQL Studio) sees the simplified form. Toggle it off only to inspect raw gate-creation structure for debugging.

Moments and Support

Five polymorphic dispatchers cover the moment surface; they accept random_variable, plain uuid, numeric, and agg_token inputs and dispatch internally.

expected (input [, prov [, method [, arguments]]])

Expectation E[input | prov]. For a random_variable, runs the Expectation semiring with structural-independence detection on gate_arith TIMES; for an agg_token, evaluates the discrete expectation over the gate’s underlying inclusion-indicator world. Defaults to the unconditional expectation when prov is omitted (the default is gate_one()).

variance (input [, prov [, method [, arguments]]])

Variance Var[input | prov]. The random_variable path computes the central moment of order two analytically when the closed form is available, falling back to Monte Carlo otherwise.

moment (input, k [, prov [, method [, arguments]]])

Raw moment E[input^k | prov]. k must be a non-negative integer. k = 0 returns 1; k = 1 is equivalent to expected.

central_moment (input, k [, prov [, method [, arguments]]])

Central moment E[(input E[input | prov])^k | prov]. k = 0 returns 1; k = 1 returns 0; k = 2 is equivalent to variance.

support (input [, prov [, method [, arguments]]])

Support interval [lo, hi]. For a random_variable, propagates each leaf’s support through gate_arith via interval arithmetic and intersects per-variable bounds from prov; for plain numeric input, returns the degenerate point [c, c]; for an agg_token, returns the closed-form support of the aggregation function.

End-to-end on the sensors fixture:

SELECT id,
       expected(reading)   AS mean,
       variance(reading)   AS var,
       support(reading)    AS supp
FROM sensor_readings;

The expectation, variance, and support of normal(2.5, 0.5) come out exactly as 2.5, 0.25, and (-Infinity, +Infinity); the uniform’s as 2, 1/3, and (1, 3); the exponential’s as 2.5, 6.25, and (0, +Infinity).

Structural-independence shortcuts. Sums of independent random variables have exact expectation and variance; products of random variables with disjoint footprints (the set of base gate_rv leaves reachable from each operand) have exact expectation (E[XY] = E[X]·E[Y]). The hybrid evaluator detects both shapes through a per-evaluation FootprintCache; otherwise it falls back to Monte Carlo with budget provsql.rv_mc_samples.

Conditional Inference

The moment dispatchers above all accept an optional prov uuid argument that conditions the moment on the provenance event prov. The natural source of prov in a tracked query is the provenance pseudo-column: every WHERE filter on a random-variable column has already been lifted into the row’s provenance, so passing provenance() conditions on the filter:

SELECT id,
       expected(reading, provenance()) AS cond_mean,
       variance(reading, provenance()) AS cond_var
FROM sensor_readings
WHERE reading > 2;

For sensor 1 (Normal(2.5, 0.5) truncated to > 2), the conditional mean is the textbook Mills-ratio formula μ + σ · φ(α) / (1 Φ(α)) with α = (2 μ)/σ; for sensor 2 (Uniform[1, 3] truncated to > 2), the conditional distribution is Uniform[2, 3] with mean 2.5; for sensor 3 (Exponential(0.4) truncated to > 2), the memoryless property gives conditional mean 2 + 1/0.4 = 4.5.

Three closed-form paths are wired:

  • Normal, truncated to any one-sided or two-sided interval, via the Mills-ratio formula and integration by parts.

  • Uniform, on the intersection of the support and the conditioning interval (mean and variance trivial in closed form).

  • Exponential, by memorylessness when the conditioning event is a lower bound, and by truncation to a finite interval.

When no closed form applies, the joint circuit between input and prov is loaded with shared gate_rv leaves correctly coupled, and the conditional moment is estimated by rejection sampling. The sample count is provsql.rv_mc_samples; if fewer than n accepted samples land within the budget (because the conditioning event is rare), a NOTICE is emitted suggesting the user widen the budget. Setting provsql.rv_mc_samples = 0 turns the notice into an error.

Passing gate_one() (the default) as prov is equivalent to the unconditional moment, so an unconditional call has no extra cost.

Sampling and Histograms

Two functions expose raw and binned samples for inspection or downstream analytics.

rv_sample (token, n [, prov]) RETURNS SETOF float8

Draw n samples from the scalar sub-circuit rooted at token, conditioning on the provenance event prov (defaulting to unconditional). The function is a set-returning function. Shared gate_rv leaves between token and prov are loaded into a single joint circuit so the conditioning event’s draw and the value’s draw share their per-iteration state.

When the root is a bare gate_rv of a supported family (Uniform / Normal / Exponential) and the event reduces to an interval constraint on it, the conditional distribution is sampled directly in closed form (uniform on the truncated interval; memoryless shift for exponential one-sided tails; inverse-CDF transform for two-sided exponential and normal). 100% acceptance: exactly n samples are returned even when the event is a tight tail like X > 9.5 over U(0, 10) that would degrade the rejection budget.

Otherwise the rejection path runs: provsql.rv_mc_samples iterations attempt to satisfy the event; a NOTICE is emitted when fewer than n accept, and the SRF returns whatever samples were accepted so the caller can proceed with a smaller batch.

rv_histogram (token, bins [, prov]) RETURNS jsonb

Empirical histogram of the same scalar sub-circuit as rv_sample, returned as a JSON array of {bin_lo, bin_hi, count} objects. The number of bins is bins (default 30); the bin range covers the observed [min, max] of the draws; the sample count is provsql.rv_mc_samples. Pin provsql.monte_carlo_seed for reproducibility.

Accepted root gates are the scalar ones: gate_value (single bin), gate_rv, and gate_arith. Any other gate kind raises.

The same closed-form truncated sampler as rv_sample applies when the shape qualifies, so a tight provsql.rv_mc_samples budget no longer fails with conditional MC accepted 0 of N on conditioning events that the closed-form path can handle.

Example, drawing 200 samples from the truncated sensor-1 reading (conditioned on reading > 2.5, which the planner lifts into the row’s provenance as a gate_cmp):

SET provsql.monte_carlo_seed = 42;
SELECT s
FROM (SELECT reading, provenance() AS prov
        FROM sensor_readings
       WHERE id = 1 AND reading > 2.5) q,
     LATERAL rv_sample(q.reading, 200, q.prov) AS t(s);

Mixtures and Categorical Random Variables

Probabilistic mixtures and categorical random variables are the discrete-by-mixture side of the surface.

A Bernoulli mixture selects between two random variables based on a Boolean coin. The two overloads of mixture differ in whether the coin is shared:

-- Two mixtures coupled through a shared coin: they always
-- pick the same side per Monte-Carlo iteration.
WITH coin AS (
  SELECT create_input_gate(0.3) AS p)
SELECT
  mixture((SELECT p FROM coin),
          normal(0, 1),
          normal(10, 1))   AS shared_a,
  mixture((SELECT p FROM coin),
          uniform(-1, 1),
          uniform(9, 11))  AS shared_b;

-- Two ad-hoc mixtures: each mints its own fresh coin.
SELECT
  mixture(0.3, normal(0, 1),
               normal(10, 1)) AS independent_a,
  mixture(0.3, uniform(-1, 1),
               uniform(9, 11)) AS independent_b;

A categorical random variable assigns explicit probabilities to a list of outcomes:

-- 0 with probability 0.2, 1 with probability 0.5, 2 with 0.3
SELECT categorical(
         ARRAY[0.2, 0.5, 0.3]::double precision[],
         ARRAY[0, 1, 2]::double precision[]);

Each categorical(probs, outcomes) call mints a fresh block anchor, so two calls with the same arrays produce two independent categorical draws. (Exception: a single-outcome categorical, where exactly one entry of probs is positive, collapses to as_random of the corresponding outcome at construction time; two such calls with the same outcome value then share the v5-keyed as_random gate.)

Note

The simplifier does not auto-collapse a cascade of Dirac mixtures into a single categorical: that conversion is reserved for explicit user calls to categorical. If you want a categorical, ask for one; if you build a tower of mixtures, the circuit keeps the tower shape so its structural sharing remains intact.

Aggregation Over Random Variables

Three aggregates lift the standard arithmetic aggregates from deterministic scalars to random_variable columns:

sum (random_variable) RETURNS random_variable

Provenance-weighted sum \sum_i \mathbf{1}\{\varphi_i\} \cdot X_i, materialised as a single gate_arith PLUS over the per-row mixture gates. The empty-group identity is as_random (0) (the additive identity).

avg (random_variable) RETURNS random_variable

Provenance-weighted average (\sum_i \mathbf{1}\{\varphi_i\} \cdot X_i) /
(\sum_i \mathbf{1}\{\varphi_i\}), materialised as a single gate_arith DIV over two gate_arith PLUS subtrees. The empty-group identity is SQL NULL (matching standard SQL AVG).

product (random_variable) RETURNS random_variable

Provenance-weighted product \prod_{i : \varphi_i} X_i, materialised as a gate_arith TIMES over per-row mixtures whose else-branch is as_random (1) (the multiplicative identity, so rows with false provenance contribute 1). The empty-group identity is as_random (1).

Note

AVG returns NaN when every row’s provenance is false (zero divided by zero). The numerator and denominator are structurally correct; the result is the natural floating-point 0/0 rather than an error. If you need NULL on empty effective groups, filter by probability_evaluate(provenance()) > 0 before averaging.

COUNT over a tracked random_variable column goes through the standard COUNT path on the provsql UUID column.

Studio Integration

ProvSQL Studio (ProvSQL Studio) surfaces three Circuit-mode features specifically for continuous distributions:

  • Distribution profile: μ and σ² headline stats with an inline-SVG histogram, a PDF/CDF toggle, per-bar tooltip, and wheel zoom. Backed server-side by rv_histogram.

  • Conditioning input with row-provenance auto-preset: clicking a result cell stamps the row’s provenance into the Condition on input so every subsequent moment, sample or histogram evaluates the conditional shape automatically. Toggle the Conditioned by badge off to fall back to the unconditional answer.

  • Simplified-circuit rendering driven by provsql.simplify_on_load so the in-memory peephole-folded graph is what you see.

See ProvSQL Studio for the full feature surface.

Out of Scope / Open Follow-ups

The following are deliberately out of scope at the time of writing and tracked as separate follow-ups:

  • EXCEPT and SELECT DISTINCT on relations that carry random_variable columns.

  • MIN, MAX, percentile aggregates over random_variable, and the broader covariance family (covar_pop, stddev …).

  • Where-provenance crossed with random variables (the column-level tracking layered on top of an RV-bearing query is not yet defined).

  • An in-Studio distribution editor.