Case Study: The Wildlife Photo Archive
This case study, loosely inspired by a ProvSQL demonstration at EDBT 2025
[Yunus et al., 2025], applies ProvSQL to a database of wildlife
photographs annotated by a species-detection model. It demonstrates the
VALUES clause, repair_key and the mulinput gate, ranking
by probability versus thresholding, EXCEPT, common table expressions, and
expected aggregates.
The Scenario
A naturalist organisation maintains a database of wildlife photographs taken at four field stations in the Scottish Highlands. Each photo has been processed by a species-detection model that draws one or more bounding boxes around things it thinks are animals, and for each box reports a list of candidate species with a confidence score. A box can therefore appear with several species candidates (e.g. a partly-occluded shape might score 0.40 as red deer and 0.30 as roe deer); a photo can contain several boxes of the same species (e.g. three deer in a meadow shot).
Your tasks:
find photos that contain specific combinations of species,
rank results by the probability that the combination is truly present,
compare probabilistic ranking against naive confidence thresholding,
exclude photos that contain unwanted species,
compute expected species counts per photo.
Setup
This case study assumes a working ProvSQL installation (see
Getting ProvSQL). Download setup.sql
and load it into a fresh PostgreSQL database:
psql -d mydb -f setup.sql
This creates three tables:
photo– 30 wildlife photographs, each tagged with a station name (Loch Torridon, Glen Affric, Rannoch Moor, or Cairngorms) and a datespecies– 13 species across mammals, birds, and reptilesdetection– about 60 model-produced (bounding-box, species) candidate rows, each linking a photo and a bounding-box index to a candidate species with a confidence score; multiple rows for the same (photo_id,bbox_id) pair represent the classifier’s alternative species hypotheses for that single bounding box
Step 1: Explore the Database
At the start of every session, set the search path:
SET search_path TO public, provsql;
Inspect the tables. Note that detection is not keyed on
(photo_id, bbox_id): a single bounding box can appear in
several rows, one per candidate species the classifier considered.
SELECT * FROM photo ORDER BY id LIMIT 5;
SELECT * FROM species ORDER BY id;
SELECT photo_id, bbox_id, species_id, confidence
FROM detection
WHERE photo_id IN (5, 9, 14, 22)
ORDER BY photo_id, bbox_id, confidence DESC;
Step 2: Enable Provenance and Create a Name Mapping
Enable provenance tracking on detection. Each row receives a UUID
circuit token that propagates through any downstream query.
SELECT add_provenance('detection');
To get readable formulas, we want to associate each detection’s
provenance token with its species name. A provenance mapping in
ProvSQL is nothing more than a regular table with two columns named
value and provenance (plus, for performance, an index on
provenance). The convenience function create_provenance_mapping
builds such a table from one column of a provenance-enabled relation,
but nothing prevents us from constructing the table by hand:
CREATE TABLE species_mapping AS
SELECT s.name AS value, d.provsql AS provenance
FROM detection d JOIN species s ON s.id = d.species_id;
SELECT remove_provenance('species_mapping');
CREATE INDEX ON species_mapping(provenance);
The CREATE TABLE AS query inherits a provsql column from
detection via ProvSQL’s planner hook;
remove_provenance strips that extra column so only the
(value, provenance) pair remains. Because the schema is fully
under our control, we can populate the table from any expression –
combine columns, filter rows, derive computed values – and any
semiring-evaluation function (sr_formula, sr_why, …) will
happily consume the result.
Step 3: Inline Lookup with VALUES
A VALUES clause defines an inline relation directly inside a query.
ProvSQL’s planner hook treats it like any other source: rows it produces
have no provenance of their own, but they carry through joins so that the
result inherits provenance from the joined provenance-enabled rows.
Here we use VALUES to define a small ad-hoc watchlist: a couple of
species we want to look up by hand together with a human-readable
label, instead of pulling them from the species table. Suppose we
are interested in Red Deer (species_id 1, the dominant grazer
whose density we want to track) and Red Fox (species_id 3, a
generalist predator), and we want to tag the rows accordingly:
SELECT v.label, p.id, p.station, p.date,
sr_formula(provenance(), 'species_mapping') AS formula
FROM (VALUES (1, 'mammal of interest'),
(3, 'predator of interest')) AS v(species_id, label),
detection d, photo p
WHERE d.species_id = v.species_id AND d.photo_id = p.id
ORDER BY p.id, v.label;
Each output row carries the provenance of the underlying detection
row alone: the formula is a single species token, even though the row
also references photo and VALUES. Tables without provenance –
including the VALUES rows – contribute no tokens.
Step 4: Conjunctive Query (Naive)
Find photos that contain both Red Deer (species_id 1) and Red Fox
(species_id 3) using a self-join on detection:
SELECT p.id, p.station, p.date,
sr_formula(provenance(), 'species_mapping') AS formula
FROM detection d1
JOIN detection d2 ON d1.photo_id = d2.photo_id
JOIN photo p ON p.id = d1.photo_id
WHERE d1.species_id = 1 AND d2.species_id = 3
GROUP BY p.id, p.station, p.date
ORDER BY p.id;
Look at photo 5: the classifier produced three Red Deer candidate rows
(in three different bounding boxes) and two Red Fox candidate rows (in
two more boxes). Its formula is the ⊕-sum of all six (deer, fox) pair
products – every candidate row is an independent input gate. This
matches the structure of the underlying detection table but
mis-models the data: each bounding box can correspond to at most one
real animal, so candidate rows that share a (photo_id, bbox_id)
pair should be mutually exclusive rather than independent. Nothing in
the schema enforces that today, and the formula reflects the
mismatch.
Step 5: Mutually Exclusive Candidates with repair_key
repair_key rewrites the provenance so that rows sharing a key
become alternatives under a single mulinput (multivalued input) gate
– i.e. exactly one of them is true. Applied with the key
(photo_id, bbox_id), every bounding box becomes one mulinput
variable whose values are the candidate species the classifier
considered for that box.
repair_key takes a single key attribute, so add a synthetic
key combining photo and bounding-box index first. repair_key
reinstalls the provsql column itself, so also drop the old mapping
(whose tokens are about to become stale):
DROP TABLE species_mapping;
SELECT remove_provenance('detection');
ALTER TABLE detection ADD COLUMN photo_bbox text;
UPDATE detection SET photo_bbox = photo_id || '/' || bbox_id;
SELECT repair_key('detection', 'photo_bbox');
CREATE TABLE species_mapping AS
SELECT s.name AS value, d.provsql AS provenance
FROM detection d JOIN species s ON s.id = d.species_id;
SELECT remove_provenance('species_mapping');
CREATE INDEX ON species_mapping(provenance);
Re-running the conjunctive query from Step 4 with sr_formula
would not be illuminating: mutually exclusive events have no meaningful
representation in the symbolic-formula semiring (each mulinput just
collapses to 𝟙). To visualize them we use sr_boolexpr
instead, which renders the underlying Boolean formula with internal
variable names and exposes each mulinput explicitly:
SELECT p.id, sr_boolexpr(provenance()) AS bexpr
FROM detection d1
JOIN detection d2 ON d1.photo_id = d2.photo_id
JOIN photo p ON p.id = d1.photo_id
WHERE d1.species_id = 1 AND d2.species_id = 3 AND p.id IN (2, 5)
GROUP BY p.id
ORDER BY p.id;
Each input now appears as a mulinput (the {i=v}[p] notation
denotes “variable i takes value v with probability p”). In
this query every variable happens to have a single value, so the
mutually-exclusive structure is not visible yet. We will see a genuine
multi-valued mulinput in the next step, where the deer/roe-deer
candidates of one bounding box compose under the mutex constraint.
Note
view_circuit cannot render mulinput gates either and
refuses to evaluate. Use sr_boolexpr to inspect circuits
that contain mulinput gates, and probability_evaluate
to score them.
Step 6: Assign Probabilities and Verify Mutual Exclusion
Each candidate row’s confidence becomes the probability that that classifier candidate is the true species for its bounding box:
DO $$ BEGIN
PERFORM set_prob(provenance(), confidence) FROM detection;
END $$;
To see that repair_key made a numerical difference, ask: what is
the probability that bounding box 1 of photo 5 corresponds to a
deer-like animal (species_id 1 = Red Deer or 2 = Roe Deer)? In the
data, that bounding box has both candidate species recorded with
confidences 0.40 and 0.30:
SELECT photo_id, bbox_id,
sr_boolexpr(provenance()) AS bexpr,
ROUND(probability_evaluate(provenance(), 'tree-decomposition')::numeric, 4) AS p
FROM detection
WHERE photo_id = 5 AND bbox_id = 1 AND species_id IN (1, 2)
GROUP BY photo_id, bbox_id;
The Boolean expression is now {0=1}[0.400000] ∨ {0=2}[0.300000]:
a single mulinput variable (variable 0) with two mutually exclusive
values, value 1 (Red Deer) with probability 0.40 and value 2
(Roe Deer) with probability 0.30. Probability evaluation gives
0.7000, the sum of the two confidences, since combining mutually
exclusive events with ⊕ is just addition. Had we kept the original
add_provenance setup with each row as an independent input
gate, the same query would have given
1 - (1 - 0.40) × (1 - 0.30) = 0.58 instead. The 0.12 gap is the
practical effect of telling the engine “these candidates cannot both
be true at once”.
Step 7: Probabilistic Ranking vs. Threshold Filtering
Run the conjunctive query under two ranking strategies. First, by probability that both species are truly present:
SELECT p.id, p.station, p.date,
ROUND(probability_evaluate(provenance())::numeric, 4) AS prob
FROM detection d1
JOIN detection d2 ON d1.photo_id = d2.photo_id
JOIN photo p ON p.id = d1.photo_id
WHERE d1.species_id = 1 AND d2.species_id = 3
GROUP BY p.id, p.station, p.date
ORDER BY prob DESC, p.id;
Second, by raw confidence threshold (every contributing detection must score at least 0.5):
SELECT DISTINCT p.id, p.station, p.date
FROM detection d1
JOIN detection d2 ON d1.photo_id = d2.photo_id
JOIN photo p ON p.id = d1.photo_id
WHERE d1.species_id = 1 AND d2.species_id = 3
AND d1.confidence >= 0.5 AND d2.confidence >= 0.5
ORDER BY p.id;
Photo 5 is missed by thresholding (every individual candidate there
scores below 0.5) but ranks reasonably under
probability_evaluate, because the weak deer candidates in
several boxes plus the weak fox candidates combine into a non-trivial
probability that some deer and some fox are truly there.
Conversely, a photo whose top candidates only barely cross 0.5 passes
the threshold but ends up low in the probability ranking.
Step 8: Absence Constraint with EXCEPT
Find photos that contain a Red Deer but no Domestic Dog (species_id
13). EXCEPT is implemented in ProvSQL via the ⊖ (monus) operator on
the provenance circuit:
SELECT p.id, p.station, p.date,
ROUND(probability_evaluate(provenance())::numeric, 4) AS prob
FROM (
SELECT photo_id FROM detection WHERE species_id = 1
EXCEPT
SELECT photo_id FROM detection WHERE species_id = 13
) t
JOIN photo p ON p.id = t.photo_id
GROUP BY p.id, p.station, p.date
ORDER BY prob DESC, p.id;
Photos that contain a dog still appear in the output – EXCEPT is
not a hard filter. Photo 9, with a high-confidence dog detection,
ranks lower because the monus discounts strongly. Photo 14, where the
dog detection has very low confidence, ranks higher: it is probably
in the result, but not certainly. ProvSQL preserves both possibilities
in the circuit and lets probability_evaluate weigh them.
Step 9: Multi-Condition Query via a CTE
Combine Steps 7 and 8: photos with both Red Deer and Red Fox, with no Domestic Dog, ranked by probability. The query has three logical layers (co-occurrence, absence, ranking) and reads naturally as a CTE:
WITH deer_and_fox AS (
SELECT d1.photo_id
FROM detection d1
JOIN detection d2 ON d1.photo_id = d2.photo_id
WHERE d1.species_id = 1 AND d2.species_id = 3
GROUP BY d1.photo_id
),
no_dogs AS (
SELECT photo_id FROM deer_and_fox
EXCEPT
SELECT photo_id FROM detection WHERE species_id = 13
)
SELECT p.id, p.station, p.date,
ROUND(probability_evaluate(provenance())::numeric, 4) AS prob
FROM no_dogs t
JOIN photo p ON p.id = t.photo_id
ORDER BY prob DESC, p.id;
ProvSQL’s planner hook fires on the expanded query: CTEs are inlined and provenance propagates through them transparently. The same answer can be written with nested subqueries; the CTE form is purely a readability choice.
Step 10: Expected Species Counts with expected
How many distinct detections do we expect to be true positives in each photo?
SELECT p.id, p.station,
ROUND(expected(COUNT(*))::numeric, 4) AS exp_detections
FROM detection d
JOIN photo p ON p.id = d.photo_id
GROUP BY p.id, p.station
ORDER BY exp_detections DESC, p.id;
By linearity of expectation, expected(COUNT(*)) over a group is
\(\sum_i P(\text{detection}_i \text{ is true})\). The same linearity
applies to SUM aggregates: the expected total confidence mass per
photo:
SELECT p.id, p.station,
ROUND(expected(SUM(d.confidence))::numeric, 4) AS exp_total_conf
FROM detection d
JOIN photo p ON p.id = d.photo_id
GROUP BY p.id, p.station
ORDER BY exp_total_conf DESC, p.id;
Both queries use ProvSQL’s expected operator, which computes
the expected value of a SQL aggregate over the probabilistic database
defined by the per-row probabilities set in Step 6. Photos with many
high-confidence detections rank highest on both metrics.