Continuous Distributions
ProvSQL extends the probabilistic-database setting from discrete
Bernoulli inputs (see Probabilities) to first-class
continuous random variables. Columns can carry distributions such
as Normal(μ, σ), Uniform(a, b), or Exponential(λ);
arithmetic and comparison work natively; the planner rewrites
WHERE, JOIN and UNION on random-variable columns
transparently; and expected, variance, moment,
support, rv_sample and rv_histogram query the resulting
distributions, conditioning on filter predicates when asked.
Note
Continuous-distribution support requires
shared_preload_libraries = 'provsql' in postgresql.conf;
the planner hook performs all of the rewrites described below
transparently. Discrete probabilities (Probabilities) and
continuous random variables coexist in the same circuit and the
same query.
Introduction
A random-variable column stores, in each row, a token referring to a
probability distribution rather than a single value. The token is a
random_variable, a thin wrapper around the UUID of a provenance
gate, that fits in any CREATE TABLE:
CREATE TABLE sensor_readings(
id int PRIMARY KEY,
reading random_variable);
INSERT INTO sensor_readings VALUES
(1, normal(2.5, 0.5)),
(2, uniform(1, 3)),
(3, exponential(0.4));
SELECT add_provenance('sensor_readings');
The remainder of this chapter uses this sensors example as a running motivator. Each row carries a different kind of noise:
sensor
1is a calibrated unit with Gaussian measurement error centred at2.5;sensor
2is a cheap unit whose reading is uniformly distributed between1and3;sensor
3is a drift-prone unit whose reading is exponentially distributed with rate0.4.
Filtering against a numeric threshold, the planner rewrites the
WHERE clause into a conditioning event on each row’s reading:
SELECT id FROM sensor_readings WHERE reading > 2;
-- id 1, 2, 3 selected with respective probabilities
-- 1 - Φ((2 - 2.5) / 0.5) ≈ 0.84 (Normal CDF)
-- (3 - 2) / (3 - 1) = 0.50 (Uniform CDF)
-- exp(-0.4 · 2) ≈ 0.45 (Exponential survival)
The numeric value of these probabilities is recovered through the
provenance circuit (see probability_evaluate and
provenance); the query itself is written and read as
ordinary SQL.
Distribution Constructors
The constructors below all return a random_variable. Each call
mints a fresh, independent random variable: two calls to
normal(0, 1) produce two uncorrelated draws, not the
same draw. Use mixture (below) when you need
shared underlying randomness.
normal(mu, sigma)Normal(μ, σ)with meanμand standard deviationσ. Both arguments must be finite;σmust be non-negative. The degenerateσ = 0case is silently routed throughas_randomto share the constant-token gate. See Wikipedia.uniform(a, b)Uniform[a, b]with boundsa ≤ b. Both bounds must be finite. The degeneratea = bcase is routed throughas_random. See Wikipedia.exponential(lambda)Exponential(λ)with rateλ > 0and mean1/λ. There is no degenerate form:λ = 0raises. See Wikipedia.erlang(k, lambda)Erlang(k, λ)is the sum ofkindependentExponential(λ)draws (equivalently, the Gamma with integer shape). It is materialised as a singlegate_rvso the closed-form CDF and moments fire directly, rather than the sampler having to drawkexponentials per iteration. The degeneratek = 1case is routed throughexponentialto share its gate. See Wikipedia.categorical(probs, outcomes)Discrete distribution over the values in
outcomeswith the corresponding probabilities inprobs. Both arrays must have the same length, every probability must be in[0, 1], and the probabilities must sum to1within1e-9. A single-outcome categorical reduces toas_randomat construction. See Wikipedia.mixture(p, x, y)(two overloads)Bernoulli-weighted choice between two random variables: with probability
pthe result samplesx, otherwisey.The first overload takes
pas the UUID of an existing Boolean gate (input/mulinput/plus/times/cmp/…). Two mixtures that share the sameptoken are coupled: they sample the same underlying coin per Monte-Carlo iteration. Use this overload when you want a family of mixtures that all switch together.The second overload takes
pas a plaindouble precisionprobability in[0, 1]and mints a fresh anonymousgate_inputwith that probability under the hood. Each call mints a new coin, so twomixture(0.5, X, Y)calls draw independently.
See Wikipedia.
as_random(c)Lift a numeric constant into a deterministic
random_variable(a Dirac point mass atc). Three overloads exist (double precision,integer,numeric) and the same constants share a UUID.c = -0.0is canonicalised to+0.0so the two zeros refer to the same gate. See Wikipedia.
Implicit casts integer → random_variable, numeric →
random_variable and double precision → random_variable
are installed. Writing WHERE reading > 2 works without an
explicit as_random(2) wrapper.
Arithmetic on Random Variables
The four arithmetic operators +, -, *, / and unary
- are declared on (random_variable, random_variable) and
return a fresh random_variable whose underlying gate is a
gate_arith over the operand UUIDs. Mixing scalars and random
variables resolves through the implicit casts above:
-- All of these are well-typed random_variable expressions.
SELECT reading + 1 FROM sensor_readings;
SELECT 2 * reading - 0.5 FROM sensor_readings;
SELECT -reading FROM sensor_readings;
SELECT r1.reading / r2.reading
FROM sensor_readings r1, sensor_readings r2
WHERE r1.id < r2.id;
The arithmetic operators are structural: they record the
operation in the circuit without evaluating it. Evaluation happens
later, when the value is queried via
expected, variance, moment,
probability_evaluate, rv_sample, or
rv_histogram. Two paths exist:
Closed-form, when ProvSQL’s hybrid evaluator recognises a family-preserving combination: a sum of independent normals is another normal; a scalar shift, scale, or negation of a normal preserves the family; the sum of
ki.i.d. exponentials with the same rate is an Erlang; a linear combination of disjoint random variables has closed-form mean and variance. The result is exact.Monte Carlo fallback, when no closed form applies, e.g. a product of two non-trivial random variables. The sampler draws independent values from each leaf, evaluates the arithmetic expression per iteration, and aggregates the results. See Configuration of the Monte Carlo sampler below for the controlling GUCs.
Comparison operators <, <=, =, <>, >=, >
on (random_variable, random_variable) return boolean
syntactically, but the planner hook intercepts every such
operator at planning time and rewrites it into a conditioning
event on the row’s provenance. End users never invoke the
comparison procedures directly; the rewriter routes them through
gate_cmp.
Probabilistic Queries
Filter predicates, joins, and unions on random_variable
columns are rewritten transparently into operations on the
provenance circuit. The user writes ordinary SQL:
SELECT id, provenance() AS prov
FROM sensor_readings
WHERE reading > 2;
The rewriter recognises reading > 2 as a comparison on an
RV column, mints a gate_cmp for the comparison, and conjoins
its UUID into the row’s provsql column. Querying the result
returns one row per source row whose underlying random-variable
event is satisfiable; the corresponding probability is recovered
through:
SELECT id, probability_evaluate(provenance()) AS p
FROM sensor_readings
WHERE reading > 2;
-- id | p
-- ----+--------
-- 1 | 0.8413
-- 2 | 0.5000
-- 3 | 0.4493
Comparisons between two random-variable columns work the same
way, with the rewriter conjoining a gate_cmp whose two children
are the two operand gates. JOIN predicates on RV columns
follow the standard ProvSQL rewriting, with the join condition
contributing a gate_cmp to the joined row’s provenance.
UNION ALL over RV-bearing relations produces the natural
gate_plus over the two source rows’ provenance.
Configuration of the Monte Carlo Sampler
Two GUCs control the Monte Carlo fallback path. See Configuration Reference for the full configuration reference.
provsql.monte_carlo_seed(default:-1)Seed for
std::mt19937_64. The default-1seeds fromstd::random_devicefor non-deterministic sampling. Any other value (including0) is used as a literal seed and makes every Monte-Carlo result reproducible across runs and across the Bernoulli / continuous sampling paths.provsql.rv_mc_samples(default:10000)Default sample count used by analytical fallbacks (
expected,variance,moment,rv_histogram,rv_sampleunder conditioning) when they cannot decompose a sub-circuit and must fall back to Monte Carlo. Set to0to disable the fallback entirely: callers will raise rather than sample.
The sample count for probability_evaluate(..., 'monte-carlo',
'n') is independent and explicit in the third argument (the
sample count is passed as a string, like every other
probability_evaluate parameter).
Closed-Form Evaluation
Three pieces work together to keep evaluation analytical where possible. They run in this order, so each later pass benefits from the rewrites produced by the earlier ones:
HybridEvaluator (simplifier pass) rewrites the in-memory circuit first: linear closure on normals (
a·X + b·Y + cis a single normal whenX,Yare independent normals), i.i.d. exponentials sum to Erlang, affine shift on a single uniform (a·U(p, q) + cis a single uniform), closed-form negation of a bare Normal or Uniform (-N(μ, σ) = N(-μ, σ),-U(a, b) = U(-b, -a)), subtraction shapes (A − B) canonicalise to addition (A + (−B)) so the same pipeline handlesN − c,c − N,U − c,c − U,c·X-style shifts and scales thread through mixtures and categoricals, single-child arith roots and semiring identities collapse. Running first means the later passes see baregate_rvleaves wherever a closed-form fold applied, instead of multi-gategate_arithsubtrees.RangeCheck propagates support intervals through
gate_arithand tests everygate_cmpagainst the propagated interval. A comparator that is decidable from the support alone collapses to a Bernoulligate_inputwith probability0or1; the rest of the circuit sees a plain Boolean leaf. JointWHEREclauses are intersected per random variable:reading > 1 AND reading < 3constrains a single normal once and runs the conjunction as one analytic CDF call. Equality and inequality on continuous distributions collapse here (X = Xis identically true,X <> Xidentically false;X = Yis identically false whenever at least one side has a continuous distribution, even when neither is a bare leaf, including composites likeExp(λ_1) + Exp(λ_2)with distinct rates, or a Bernoulli mixture over two continuous arms). When both sides have statically-known discrete masses (categoricals, mixtures ofas_randombranches, …) and are independent,P(X = Y)is computed exactly by summing the per-outcome mass products; the disjoint-outcome case is the boundary where the sum is 0.AnalyticEvaluator computes the exact CDF of a single distribution’s
gate_cmp(e.g.Normal > 2,Uniform <= 1.5,Exponential >= λ⁻¹) via the standard CDFs of the supported families, for anygate_cmpthat RangeCheck could not decide from the support alone.
A residual HybridEvaluator decomposer pass runs between
RangeCheck and AnalyticEvaluator for continuous-island gate_cmp
gates that no closed form has resolved; it marginalises them by
Monte Carlo into Bernoulli leaves so the downstream Boolean
machinery (independent / tree-decomposition / external
compilers) becomes available.
provsql.simplify_on_load (default: on) folds the universal
peephole pass at the moment a circuit is read into memory, so
every downstream consumer (semiring evaluators, Monte Carlo,
view_circuit, PROV export, ProvSQL Studio) sees the
simplified form. Toggle it off only to inspect raw gate-creation
structure for debugging.
Moments and Support
Five polymorphic dispatchers cover the moment surface; they accept
random_variable, plain uuid, numeric, and agg_token
inputs and dispatch internally.
expected(input [, prov [, method [, arguments]]])Expectation
E[input | prov]. For arandom_variable, runs theExpectationsemiring with structural-independence detection ongate_arith TIMES; for anagg_token, evaluates the discrete expectation over the gate’s underlying inclusion-indicator world. Defaults to the unconditional expectation whenprovis omitted (the default isgate_one()).variance(input [, prov [, method [, arguments]]])Variance
Var[input | prov]. Therandom_variablepath computes the central moment of order two analytically when the closed form is available, falling back to Monte Carlo otherwise.moment(input, k [, prov [, method [, arguments]]])Raw moment
E[input^k | prov].kmust be a non-negative integer.k = 0returns1;k = 1is equivalent toexpected.central_moment(input, k [, prov [, method [, arguments]]])Central moment
E[(input − E[input | prov])^k | prov].k = 0returns1;k = 1returns0;k = 2is equivalent tovariance.support(input [, prov [, method [, arguments]]])Support interval
[lo, hi]. For arandom_variable, propagates each leaf’s support throughgate_arithvia interval arithmetic and intersects per-variable bounds fromprov; for plain numeric input, returns the degenerate point[c, c]; for anagg_token, returns the closed-form support of the aggregation function.
End-to-end on the sensors fixture:
SELECT id,
expected(reading) AS mean,
variance(reading) AS var,
support(reading) AS supp
FROM sensor_readings;
The expectation, variance, and support of normal(2.5,
0.5) come out exactly as 2.5, 0.25, and
(-Infinity, +Infinity); the uniform’s as 2, 1/3, and
(1, 3); the exponential’s as 2.5, 6.25, and
(0, +Infinity).
Structural-independence shortcuts. Sums of independent random
variables have exact expectation and variance; products of
random variables with disjoint footprints (the set of base
gate_rv leaves reachable from each operand) have exact
expectation (E[XY] = E[X]·E[Y]). The hybrid evaluator detects
both shapes through a per-evaluation FootprintCache; otherwise
it falls back to Monte Carlo with budget
provsql.rv_mc_samples.
Conditional Inference
The moment dispatchers above all accept an optional prov uuid
argument that conditions the moment on the provenance event
prov. The natural source of prov in a tracked query is
the provenance pseudo-column: every WHERE filter
on a random-variable column has already been lifted into the
row’s provenance, so passing provenance() conditions on the
filter:
SELECT id,
expected(reading, provenance()) AS cond_mean,
variance(reading, provenance()) AS cond_var
FROM sensor_readings
WHERE reading > 2;
For sensor 1 (Normal(2.5, 0.5) truncated to > 2),
the conditional mean is the textbook Mills-ratio formula
μ + σ · φ(α) / (1 − Φ(α)) with α = (2 − μ)/σ; for sensor
2 (Uniform[1, 3] truncated to > 2), the conditional
distribution is Uniform[2, 3] with mean 2.5; for sensor
3 (Exponential(0.4) truncated to > 2), the
memoryless property gives conditional mean 2 + 1/0.4 = 4.5.
Three closed-form paths are wired:
Normal, truncated to any one-sided or two-sided interval, via the Mills-ratio formula and integration by parts.
Uniform, on the intersection of the support and the conditioning interval (mean and variance trivial in closed form).
Exponential, by memorylessness when the conditioning event is a lower bound, and by truncation to a finite interval.
When no closed form applies, the joint circuit between input
and prov is loaded with shared gate_rv leaves correctly
coupled, and the conditional moment is estimated by rejection
sampling. The sample count is provsql.rv_mc_samples; if
fewer than n accepted samples land within the budget
(because the conditioning event is rare), a NOTICE is
emitted suggesting the user widen the budget. Setting
provsql.rv_mc_samples = 0 turns the notice into an error.
Passing gate_one() (the default) as prov is equivalent to
the unconditional moment, so an unconditional call has no extra
cost.
Sampling and Histograms
Two functions expose raw and binned samples for inspection or downstream analytics.
rv_sample(token, n [, prov])RETURNS SETOF float8Draw
nsamples from the scalar sub-circuit rooted attoken, conditioning on the provenance eventprov(defaulting to unconditional). The function is a set-returning function. Sharedgate_rvleaves betweentokenandprovare loaded into a single joint circuit so the conditioning event’s draw and the value’s draw share their per-iteration state.When the root is a bare
gate_rvof a supported family (Uniform / Normal / Exponential) and the event reduces to an interval constraint on it, the conditional distribution is sampled directly in closed form (uniform on the truncated interval; memoryless shift for exponential one-sided tails; inverse-CDF transform for two-sided exponential and normal). 100% acceptance: exactlynsamples are returned even when the event is a tight tail likeX > 9.5overU(0, 10)that would degrade the rejection budget.Otherwise the rejection path runs:
provsql.rv_mc_samplesiterations attempt to satisfy the event; aNOTICEis emitted when fewer thannaccept, and the SRF returns whatever samples were accepted so the caller can proceed with a smaller batch.rv_histogram(token, bins [, prov])RETURNS jsonbEmpirical histogram of the same scalar sub-circuit as
rv_sample, returned as a JSON array of{bin_lo, bin_hi, count}objects. The number of bins isbins(default30); the bin range covers the observed[min, max]of the draws; the sample count isprovsql.rv_mc_samples. Pinprovsql.monte_carlo_seedfor reproducibility.Accepted root gates are the scalar ones:
gate_value(single bin),gate_rv, andgate_arith. Any other gate kind raises.The same closed-form truncated sampler as
rv_sampleapplies when the shape qualifies, so a tightprovsql.rv_mc_samplesbudget no longer fails withconditional MC accepted 0 of Non conditioning events that the closed-form path can handle.
Example, drawing 200 samples from the truncated sensor-1 reading
(conditioned on reading > 2.5, which the planner lifts into
the row’s provenance as a gate_cmp):
SET provsql.monte_carlo_seed = 42;
SELECT s
FROM (SELECT reading, provenance() AS prov
FROM sensor_readings
WHERE id = 1 AND reading > 2.5) q,
LATERAL rv_sample(q.reading, 200, q.prov) AS t(s);
Mixtures and Categorical Random Variables
Probabilistic mixtures and categorical random variables are the discrete-by-mixture side of the surface.
A Bernoulli mixture selects between two random variables based
on a Boolean coin. The two overloads of mixture
differ in whether the coin is shared:
-- Two mixtures coupled through a shared coin: they always
-- pick the same side per Monte-Carlo iteration.
WITH coin AS (
SELECT create_input_gate(0.3) AS p)
SELECT
mixture((SELECT p FROM coin),
normal(0, 1),
normal(10, 1)) AS shared_a,
mixture((SELECT p FROM coin),
uniform(-1, 1),
uniform(9, 11)) AS shared_b;
-- Two ad-hoc mixtures: each mints its own fresh coin.
SELECT
mixture(0.3, normal(0, 1),
normal(10, 1)) AS independent_a,
mixture(0.3, uniform(-1, 1),
uniform(9, 11)) AS independent_b;
A categorical random variable assigns explicit probabilities to a list of outcomes:
-- 0 with probability 0.2, 1 with probability 0.5, 2 with 0.3
SELECT categorical(
ARRAY[0.2, 0.5, 0.3]::double precision[],
ARRAY[0, 1, 2]::double precision[]);
Each categorical(probs, outcomes) call mints a fresh block
anchor, so two calls with the same arrays produce two
independent categorical draws. (Exception: a single-outcome
categorical, where exactly one entry of probs is positive,
collapses to as_random of the corresponding outcome at
construction time; two such calls with the same outcome value
then share the v5-keyed as_random gate.)
Note
The simplifier does not auto-collapse a cascade of Dirac
mixtures into a single categorical: that conversion is
reserved for explicit user calls to
categorical. If you want a categorical, ask
for one; if you build a tower of mixtures, the circuit keeps
the tower shape so its structural sharing remains intact.
Aggregation Over Random Variables
Three aggregates lift the standard arithmetic aggregates from
deterministic scalars to random_variable columns:
sum(random_variable)RETURNS random_variableProvenance-weighted sum
, materialised as a single
gate_arith PLUSover the per-row mixture gates. The empty-group identity isas_random(0)(the additive identity).avg(random_variable)RETURNS random_variableProvenance-weighted average
, materialised as a single
gate_arith DIVover twogate_arith PLUSsubtrees. The empty-group identity is SQLNULL(matching standard SQLAVG).product(random_variable)RETURNS random_variableProvenance-weighted product
, materialised as a
gate_arith TIMESover per-row mixtures whose else-branch isas_random(1)(the multiplicative identity, so rows with false provenance contribute1). The empty-group identity isas_random(1).
Note
AVG returns NaN when every row’s provenance is false
(zero divided by zero). The numerator and denominator are
structurally correct; the result is the natural floating-point
0/0 rather than an error. If you need NULL on empty
effective groups, filter by probability_evaluate(provenance())
> 0 before averaging.
COUNT over a tracked random_variable column goes through
the standard COUNT path on the provsql UUID column.
Studio Integration
ProvSQL Studio (ProvSQL Studio) surfaces three Circuit-mode features specifically for continuous distributions:
Distribution profile:
μandσ²headline stats with an inline-SVG histogram, a PDF/CDF toggle, per-bar tooltip, and wheel zoom. Backed server-side byrv_histogram.Conditioning input with row-provenance auto-preset: clicking a result cell stamps the row’s provenance into the Condition on input so every subsequent moment, sample or histogram evaluates the conditional shape automatically. Toggle the Conditioned by badge off to fall back to the unconditional answer.
Simplified-circuit rendering driven by
provsql.simplify_on_loadso the in-memory peephole-folded graph is what you see.
See ProvSQL Studio for the full feature surface.
Out of Scope / Open Follow-ups
The following are deliberately out of scope at the time of writing and tracked as separate follow-ups:
EXCEPTandSELECT DISTINCTon relations that carryrandom_variablecolumns.MIN,MAX, percentile aggregates overrandom_variable, and the broader covariance family (covar_pop,stddev…).Where-provenance crossed with random variables (the column-level tracking layered on top of an RV-bearing query is not yet defined).
An in-Studio distribution editor.