Case Study: ProvSQL as a Probability Calculator
This case study uses ProvSQL as an exact, correlation-aware probability calculator that you drive in SQL. Classic probability problems – base rates, correlated events, conditional expectation, truncated distributions, denial constraints – become ordinary queries, and the answers are computed exactly (not by sampling, unless you ask) and correlation-aware (the provenance circuit tracks shared events, so joint and conditional probabilities come out right without independence assumptions or hand-rolled inclusion–exclusion).
The thread tying the problems together is the conditioning operator |:
once a model is loaded, A | B reads as “A given B”, for discrete
events, for continuous random variables, and for probabilistic aggregates
alike.
Tip
Best run as a notebook. This case study is meant to be run interactively in ProvSQL Studio in notebook mode – try it in the ProvSQL Playground, no install, as a runnable notebook and step through the cells. See the Playground note.
The Scenario
An epidemiology desk at a public-health agency keeps a small probabilistic model of a screening programme and reaches for ProvSQL whenever a question needs probabilistic evaluation. Five such questions follow; each is a recognisable textbook problem, and we work through each one step by step, building its model and then asking the calculator.
This case study assumes a working ProvSQL installation (see Getting ProvSQL). It needs no setup script: each problem creates its own little model as part of the story. Start a fresh database and enable the extension once,
createdb cs8
psql -d cs8
CREATE EXTENSION IF NOT EXISTS provsql CASCADE;
SET search_path TO public, provsql;
then paste the blocks below as you read.
Problem 1: The Base-Rate Fallacy
Screening tests are imperfect, but a good one is quite accurate. Ours catches 90% of true cases (its sensitivity) and correctly clears 95% of healthy people (its specificity); the disease itself is rare, affecting 1% of the population.
A patient tests positive. What is the probability they actually have the
disease? The tempting answer – “around 90%, since the test is that
accurate” – is wrong, and the mistake is common enough to have a name: the
base-rate fallacy. It
forgets how rare the disease is. Among the 99% who are healthy, even a 5%
false-positive rate produces a flood of positives, and that flood swamps
the genuine cases – so most positive results are in fact false alarms. The
quantity we really want, ,
comes out far below 90%. Let us compute it exactly.
The model is the joint sample space: the four possible worlds for one screened patient – diseased or healthy, each crossed with a positive or negative test. We build it in small steps. First, keep the test’s three numbers in one place, named once and easy to change:
DROP TABLE IF EXISTS params CASCADE;
CREATE TABLE params(prevalence float, sensitivity float, specificity float);
INSERT INTO params VALUES (0.01, 0.90, 0.95);
Now spell out the four worlds. Each world’s probability is the prevalence of its health state times the test’s rate in that state – the chain rule, written once per world:
DROP TABLE IF EXISTS screening CASCADE;
CREATE TABLE screening(disease boolean, positive boolean, p float);
INSERT INTO screening
SELECT true, true, prevalence * sensitivity FROM params -- true positive
UNION ALL
SELECT true, false, prevalence * (1 - sensitivity) FROM params -- false negative
UNION ALL
SELECT false, true, (1 - prevalence) * (1 - specificity) FROM params -- false positive
UNION ALL
SELECT false, false, (1 - prevalence) * specificity FROM params; -- true negative
Have a look: each row carries the probability of its world, and the four sum to 1.
SELECT * FROM screening;
These four worlds are mutually exclusive: for any one patient exactly one
of them is the true state of affairs. repair_key is how ProvSQL
records that – it turns a set of rows into a block of mutually-exclusive
alternatives (a “repair” of a key), exactly one of which holds at a time.
Passing an empty key makes the whole table one such block, and
set_prob then gives each world the probability in its p
column:
SELECT repair_key('screening', '');
SELECT disease, positive, set_prob(provenance(), p) FROM screening;
The model is loaded. Now let us see what a probabilistic event looks like
to ProvSQL. Every row a query returns over a provenance-tracked table
carries a provenance token in a hidden provsql column: a handle to
the little circuit explaining how that row came to be. Ask for the distinct
values of positive:
SELECT DISTINCT positive FROM screening;
Two rows come back, true and false, each with a token. Click the
token on the true row. You stay in Notebook mode; Studio just adds
a cell below showing that token’s circuit – a single plus gate joining
the two mutually-exclusive worlds in which the test reads positive (the
diseased true positive and the healthy false positive). That little circuit
is the event “the test is positive”.
On that circuit cell, press Evaluate to add an evaluation cell bound to the circuit above. Leave the scheme on Probability and press Run: ProvSQL walks the circuit and returns about 0.0585 – a positive result is uncommon.
That click-path is one SQL query. probability_evaluate takes a
token and returns its probability, and provenance hands back the
token of a query’s rows – so the same number comes from one cell:
SELECT probability_evaluate(provenance())
FROM (SELECT DISTINCT positive FROM screening) AS s
WHERE positive;
The question we actually care about is conditional: given a positive
test, how likely is the disease? ProvSQL writes “A given B” as
A | B. We name both events – diseased and positive – and ask for the
two conditionals (GROUP BY () collapses each event’s worlds into a
single token, just as DISTINCT did above):
WITH e AS (
SELECT (SELECT provenance() FROM screening WHERE disease GROUP BY ()) AS d,
(SELECT provenance() FROM screening WHERE positive GROUP BY ()) AS pos)
SELECT probability_evaluate(pos | d) AS pos_given_disease,
probability_evaluate(d | pos) AS disease_given_pos
FROM e;
The first number is about 0.9: recovers the test’s sensitivity, as it must. The second is
the punchline –
is only
about 0.15, not 0.9. Most positive results really are false alarms,
exactly as the base-rate fallacy warned. ProvSQL computed
– Bayes’ rule – over the shared circuit, with no
arithmetic on your part.
Problem 2: Correlation That Matters
Risk factors rarely act alone. Obesity, for instance, raises the risk of both type 2 diabetes and hypertension. Because the two conditions share that common cause, they are correlated: a patient who has one is more likely to have the other. Suppose we want the probability that a patient develops at least one of the two. Treating them as independent overestimates it – let us see by how much.
Note
This model – the factors, the probabilities, and the “obesity plus one extra trigger” rule for each condition – is invented and deliberately oversimplified to illustrate correlation. It is not medical guidance.
Model three independent base factors: obesity (the shared cause) and one
extra trigger for each condition, each present with its own probability.
add_provenance makes the table provenance-tracked (one
independent token per row) and set_prob reads each row’s
probability from its p column:
DROP TABLE IF EXISTS risk CASCADE;
CREATE TABLE risk(factor text, p float);
INSERT INTO risk VALUES
('obesity', 0.5), -- the shared common cause
('insulin_resistance', 0.6), -- extra trigger for diabetes
('high_salt', 0.7); -- extra trigger for hypertension
SELECT add_provenance('risk');
SELECT factor, set_prob(provenance(), p) FROM risk;
A patient develops diabetes when they are both obese and
insulin-resistant. In SQL “both” is a join – ask for the two facts at once
– and ProvSQL records the result’s provenance as the product of their
tokens (a provenance_times, formed for you). Its probability:
SELECT probability_evaluate(provenance())
FROM risk o, risk x
WHERE o.factor = 'obesity' AND x.factor = 'insulin_resistance';
0.3 (that is ). Hypertension is the same query with
high_salt in place of insulin_resistance – 0.35. Now “diabetes
or hypertension”: match a patient who is obese and has either trigger,
and fold the two matches into one event with GROUP BY (). The provenance
of a set of alternatives is their sum (a provenance_plus), so
this single query is the “at least one condition” event. We do not even
need to select anything in particular – the provsql token rides along in
every result row, so a constant label is enough to read:
SELECT 'diabetes or hypertension' AS event
FROM risk o, risk x
WHERE o.factor = 'obesity'
AND x.factor IN ('insulin_resistance', 'high_salt')
GROUP BY ();
Click the token to see its circuit. obesity appears as a single
gate feeding both branches, not two separate copies: ProvSQL addresses
every gate by its content, so the shared cause is literally the same node
everywhere – and that overlap is exactly what an independence assumption
ignores. Press Evaluate and Run: about
0.44.
Compare that with the independence estimate, plain arithmetic
:
SELECT 1 - (1 - 0.3) * (1 - 0.35) AS independence_estimate;
0.545 – a 24% overstatement. ProvSQL’s 0.44 is correct because the
disjunction circuit accounted for the shared obesity on its own. You never
had to write the inclusion–exclusion correction .
The same shared gate also lets us turn the question around and condition on
it. Problem 1 conditioned one event on another with the binary A | B;
the unary | B (the prefix form of given) conditions a
whole query’s rows at once – a directive in the SELECT list that is
stripped from the output, leaving each row’s provenance conditioned on B.
Condition every factor on obesity being present:
SELECT factor,
round(probability_evaluate(provenance())::numeric, 4) AS p_given_obesity
FROM (
SELECT factor, | (SELECT provenance() FROM risk WHERE factor = 'obesity')
FROM risk
) s
ORDER BY factor;
obesity comes back at 1.0 – an event given itself is certain – while
insulin_resistance (0.6) and high_salt (0.7) are unchanged from
their priors: they share no gate with obesity, so conditioning on it tells us
nothing about them. The same content-addressed circuit that made the
disjunction correct makes the conditional correct.
See what conditioning built. Select one conditioned tuple on its own and
click its provsql token:
SELECT factor, | (SELECT provenance() FROM risk WHERE factor = 'obesity')
FROM risk WHERE factor = 'insulin_resistance';
You stay in Notebook mode; Studio adds a circuit cell below showing a ∣
(conditioned) gate with three labelled children – the target event
(insulin_resistance), the evidence (obesity), and their joint
target ⊗ evidence. Its probability is read straight off that gate as
–
Bayes’ rule over the shared circuit, the same ratio Problem 1 used. The
|
operator, in both its binary and unary forms (with the function spellings
cond and given), is documented in
the conditioning chapter.
Problem 3: The Right Method, Chosen for You
A probability can be computed from a circuit in many ways – some exact
(factoring out independent parts, inclusion–exclusion,
knowledge compilation,
tree decomposition),
some by random sampling. You rarely choose: the evaluation strip (and
probability_evaluate under it) picks the cheapest method that
fits the circuit – and “fits” matters, because not every method works on
every circuit.
Re-run the “at least one condition” query from Problem 2 and click its token to bring the circuit back up:
SELECT 'diabetes or hypertension' AS event
FROM risk o, risk x
WHERE o.factor = 'obesity'
AND x.factor IN ('insulin_resistance', 'high_salt')
GROUP BY ();
Press Evaluate, leave the method on its default, and Run: 0.44, and the strip notes via possible-worlds – the exact method the chooser judged cheapest here. Now change the method dropdown in the strip and re-run:
Monte-Carlo works on any circuit, but only approximately: it samples at random, so its estimate lands near 0.44, not exactly on it. Give it an accuracy target in the strip’s arguments box –
eps=0.1, delta=0.05– and it reports the guarantee it met: the interval the true probability lies within, at the requested confidence (95% here).independentrefuses –Not an independent circuit– because it assumes no input is shared between the parts it multiplies, and hereobesityfeeds both conditions (the shared gate in that circuit).
That refusal is the point. The chooser never offered independent for
this circuit; the bare default stays correct as your circuits gain
structure, and a method that does not fit is refused, never quietly wrong.
Problem 4: A Continuous Quantity
Every event so far was discrete – true or false. But a lab biomarker is a continuous number, and
ProvSQL represents one as a random_variable: a
value that is not a single number but a whole distribution. Conditioning on
it works exactly as | did for discrete events, and the result is again
a distribution, whose mean, variance, and range you can read off with
expected, variance, and support.
We want exact, closed-form answers, not sampled ones. In the Config
panel, set provsql.rv_mc_samples to 0: that turns off the Monte-Carlo
fallback, so ProvSQL either answers a continuous query in closed form or
tells you it cannot, instead of silently sampling. The panel applies the
setting to every cell you run (outside Studio, the equivalent is
SET provsql.rv_mc_samples = 0).
Build a biomarker that follows a Normal distribution with mean 20 and standard deviation 5, and look at it:
SELECT normal(20, 5) AS biomarker;
One row comes back with a random_variable token. Click it, then on
the circuit cell press Circuit mode (this leaves
the notebook; you will come back through the Notebook
tab in the mode switcher at the top). In Circuit mode, press
Run on the evaluation strip: for a continuous variable the default
evaluation is its distribution profile, and Studio draws the bell curve
of Normal(20, 5), centred on 20. Have a look, then return through the
Notebook tab – the notebook reopens where you left it.
Its mean is no surprise:
SELECT expected(normal(20, 5));
20. Now suppose a patient is referred – their biomarker came back above a
threshold of 25. Conditioning the variable on that event, x | (x > 25),
chops off everything below 25 and renormalises; the result is itself a
random_variable. Its mean is no longer 20:
WITH r AS (SELECT normal(20, 5) AS x)
SELECT expected(x | (x > 25)) FROM r;
About 27.6 – a referred patient’s biomarker is expected well above the threshold, not at the population average. The conditioned variable has a spread and a range too:
WITH r AS (SELECT normal(20, 5) AS x)
SELECT variance(x | (x > 25)) AS variance,
(support(x | (x > 25))).lo AS lowest_value
FROM r;
The variance has shrunk to about 5 (from 25 for the unconditioned Normal), and the lowest value the conditioned biomarker can take is exactly 25 – truncation moved the floor up to the threshold.
We took those moments one at a time, but the conditioned biomarker is a
single object – a random_variable you can hand onward, store, or
inspect whole. Print it:
WITH r AS (SELECT normal(20, 5) AS x)
SELECT (x | (x > 25)) AS referred_biomarker
FROM r;
Click its token, open Circuit mode, and press Run. Its distribution profile is the truncated bell curve – everything below 25 cut away, the rest renormalised – with its support, mean, and variance reported alongside: the same conditioned variable you just took moments of, now seen as a whole distribution.
Problem 5: A Probabilistic Total
The last carrier is an aggregate. When the rows you add up are themselves
uncertain, their total is a random quantity, and ProvSQL tracks it as an
agg_token – the aggregate counterpart of a provenance token.
Each row of cases reports a day’s case count for a region, present only
with some probability:
DROP TABLE IF EXISTS cases CASCADE;
CREATE TABLE cases(day int, region text, n int, p float);
INSERT INTO cases VALUES (1, 'North', 3, 0.5), (1, 'North', 4, 0.5), (1, 'South', 2, 0.8);
SELECT add_provenance('cases');
SELECT day, region, n, p, set_prob(provenance(), p) FROM cases;
Sum each region’s counts with an ordinary GROUP BY. Because the inputs
are uncertain, each total comes out as an agg_token rather than a plain
number – look at the per-region totals:
DROP TABLE IF EXISTS casesum CASCADE;
CREATE TABLE casesum AS SELECT region, sum(n) AS total FROM cases GROUP BY region;
SELECT * FROM casesum;
The North has two possible contributions, 3 and 4, each present with probability 0.5. Its expected total – the average over all the ways the days could turn out – is:
SELECT expected(total) FROM casesum WHERE region = 'North';
3.5 (that is ). Now condition on an
observation: suppose we know the high-count day (
n = 4) really
happened. The expected total given that:
SELECT expected(total | (SELECT provenance() FROM cases WHERE n = 4))
FROM casesum WHERE region = 'North';
5.5 – a certain 4 plus the still-uncertain 3 (worth on average). The
| operator conditioned the aggregate exactly as it
conditioned the discrete events in Problem 1 and the continuous variable in
Problem 4.
Like the conditioned biomarker, the conditioned total is a value in its own
right – here an agg_token. Select it directly to get its token:
SELECT total | (SELECT provenance() FROM cases WHERE n = 4) AS conditioned_total
FROM casesum WHERE region = 'North';
Click that token to see its circuit – the same ∣ gate as before,
now over the aggregate and the evidence. In
Circuit mode, the evaluation strip defaults to the distribution profile;
switch it to the moment evaluator (order k = 1, raw) and press
Run: its expected value comes back as 5.5, the same answer
as above. A moment of an aggregate is computed exactly – by enumerating the
rows’ contributions and weighting each by its probability – so it needs no
sampling and is unaffected by the provsql.rv_mc_samples = 0 you set in
Problem 4 (its full distribution and individual samples, by contrast, are
estimated by Monte Carlo, and would need sampling switched back on).
Problem 6: Ruling Worlds Out
repair_key in Problem 1 imposed one kind of constraint – at most
one row of a key group is real. Many real rules are not keys but relations
between pairs of rows: “no two reported doses of one vaccine fall within 21
days”, “no two outbreak cases at one site are reported within the incubation
window”. No key declaration captures these. They are denial constraints –
a query describing a forbidden pattern – and ProvSQL conditions on their
non-occurrence with the event-negation operator !
(provenance_not).
An immunization registry merges dose reports from several sources, so each reported dose is only probably a real administration. A data-quality rule says two doses of the same vaccine must be at least 21 days apart. Here are four uncertain dose reports for one patient – the date each was administered and the probability it is genuine. The Mar 14 report sits close to two others: it is within 21 days of both Mar 4 (10 days earlier) and Mar 28 (14 days later).
DROP TABLE IF EXISTS doses CASCADE;
CREATE TABLE doses(id int, administered date, p float);
INSERT INTO doses VALUES
(1, '2024-03-04', 0.5),
(2, '2024-03-14', 0.5),
(3, '2024-03-28', 0.5),
(4, '2024-04-30', 0.8);
SELECT add_provenance('doses');
SELECT id, administered, p, set_prob(provenance(), p) FROM doses;
The forbidden pattern – “some two doses are fewer than 21 days apart” – is an
ordinary self-join. Materialise it, collapsing all the witnessing pairs into a
single DISTINCT row: that one row’s provenance is the violation event
W, “the record has a too-close pair”. Two pairs qualify – (Mar 4, Mar 14)
and (Mar 14, Mar 28) – and they share the Mar 14 dose, so W is not a
simple product of independent pairs; ProvSQL tracks the shared gate and gets
the overlap right (the same correlation-awareness as Problem 2). So
provenance() over the violation table is W, and !provenance() is
the complementary “valid record” event:
DROP TABLE IF EXISTS violation;
CREATE TEMP TABLE violation AS
SELECT DISTINCT 1
FROM doses a JOIN doses b
ON a.id < b.id AND abs(a.administered - b.administered) < 21;
SELECT probability_evaluate(provenance()) AS p_violation,
probability_evaluate(!provenance()) AS p_valid
FROM violation;
The clash has probability 0.375, so a valid record (!provenance()) has
probability 0.625.
Now condition each dose on the record being valid – one row per dose. Prior and
posterior are the same row token, provenance(), evaluated two ways:
unconditioned, and conditioned on !W. Each row’s own provenance stays the
dose itself; the violation event is pulled in by an inert
(SELECT provenance() FROM violation) – naming W once, without coupling
it into the row’s lineage:
SELECT d.id,
probability_evaluate(provenance()) AS prior,
probability_evaluate(provenance() | !(SELECT provenance() FROM violation)) AS posterior
FROM doses d
ORDER BY d.id;
Same prior, different posterior. A valid record is evidence against exactly the
doses the constraint could have caught, in proportion to how implicated each is:
the Mar 14 dose (dose 2), which would clash with either neighbour, drops the
furthest – from 0.5 to 0.2 – while doses 1 and 3, each in only one of the
two possible violations, drop to 0.4; dose 4, far from the rest and in no
possible clash, is untouched at 0.8. The constraint here was an arbitrary
query: ! turns any forbidden pattern into evidence, the way
repair_key turns a key into mutual exclusion, but without being
limited to keys.
Recap
The six problems used one operator, |, with a single meaning
throughout – conditional probability, – over three kinds of value: discrete events (Problems 1-3
and 6), a continuous
random_variable (Problem 4), and a probabilistic
aggregate
agg_token (Problem 5). A few mechanics recurred:
Each model was built and stored in the database.
add_provenanceregisters a table for tuple-independent tracking andset_probattaches a probability to each row;repair_keyis the alternative registration, making the rows of a key group mutually exclusive outcomes, and the negation operator!conditions on the non-occurrence of an arbitrary forbidden pattern (Problem 6) – a denial constraint beyond keys. The model is ordinary SQL data: it persists, and is queried and updated, across sessions.Provenance is recorded per result row as a circuit of gates, and equal sub-expressions are the same gate. A base event shared between two queries is therefore one node, so joint, disjoint, and conditional probabilities come out consistent without assuming independence. That is the gap between ProvSQL’s 0.44 and the independence estimate 0.545 in Problem 2.
probability_evaluatereturns an exact probability and selects the evaluation method itself; Studio’s evaluation strip reports which one ran. Sampling appears only where no exact method applies (the Monte-Carlo paths in Problems 4 and 5), and then with a stated error guarantee.
None of this required leaving SQL: the questions were ordinary queries, and the probabilistic answers were functions applied to their provenance.