Case Study: The Wildlife Photo Archive

This case study, loosely inspired by a ProvSQL demonstration at EDBT 2025 [Yunus et al., 2025], applies ProvSQL to a database of wildlife photographs annotated by a species-detection model. It demonstrates the VALUES clause, repair_key and the mulinput gate, ranking by probability versus thresholding, EXCEPT, common table expressions, and expected aggregates.

The Scenario

A naturalist organisation maintains a database of wildlife photographs taken at four field stations in the Scottish Highlands. Each photo has been processed by a species-detection model that draws one or more bounding boxes around things it thinks are animals, and for each box reports a list of candidate species with a confidence score. A box can therefore appear with several species candidates (e.g. a partly-occluded shape might score 0.40 as red deer and 0.30 as roe deer); a photo can contain several boxes of the same species (e.g. three deer in a meadow shot).

Your tasks:

find photos that contain specific combinations of species,
rank results by the probability that the combination is truly present,
compare probabilistic ranking against naive confidence thresholding,
exclude photos that contain unwanted species,
compute expected species counts per photo.

Setup

This case study assumes a working ProvSQL installation (see Getting ProvSQL). Download setup.sql and load it into a fresh PostgreSQL database:

psql -d mydb -f setup.sql

This creates three tables:

photo – 30 wildlife photographs, each tagged with a station name (Loch Torridon, Glen Affric, Rannoch Moor, or Cairngorms) and a date
species – 13 species across mammals, birds, and reptiles
detection – about 60 model-produced (bounding-box, species) candidate rows, each linking a photo and a bounding-box index to a candidate species with a confidence score; multiple rows for the same (photo_id, bbox_id) pair represent the classifier’s alternative species hypotheses for that single bounding box

Step 1: Explore the Database

At the start of every session, set the search path:

SET search_path TO public, provsql;

Inspect the tables. Note that detection is not keyed on (photo_id, bbox_id): a single bounding box can appear in several rows, one per candidate species the classifier considered.

SELECT * FROM photo  ORDER BY id LIMIT 5;
SELECT * FROM species ORDER BY id;
SELECT photo_id, bbox_id, species_id, confidence
FROM detection
WHERE photo_id IN (5, 9, 14, 22)
ORDER BY photo_id, bbox_id, confidence DESC;

Step 2: Enable Provenance and Create a Name Mapping

Enable provenance tracking on detection. Each row receives a UUID circuit token that propagates through any downstream query.

SELECT add_provenance('detection');

To get readable formulas, we want to associate each detection’s provenance token with its species name. A provenance mapping in ProvSQL is nothing more than a regular table with two columns named value and provenance (plus, for performance, an index on provenance). The convenience function create_provenance_mapping builds such a table from one column of a provenance-enabled relation, but nothing prevents us from constructing the table by hand:

CREATE TABLE species_mapping AS
  SELECT s.name AS value, d.provsql AS provenance
  FROM detection d JOIN species s ON s.id = d.species_id;

SELECT remove_provenance('species_mapping');
CREATE INDEX ON species_mapping(provenance);

The CREATE TABLE AS query inherits a provsql column from detection via ProvSQL’s planner hook; remove_provenance strips that extra column so only the (value, provenance) pair remains. Because the schema is fully under our control, we can populate the table from any expression – combine columns, filter rows, derive computed values – and any semiring-evaluation function (sr_formula, sr_why, …) will happily consume the result.

Step 3: Inline Lookup with `VALUES`

A VALUES clause defines an inline relation directly inside a query. ProvSQL’s planner hook treats it like any other source: rows it produces have no provenance of their own, but they carry through joins so that the result inherits provenance from the joined provenance-enabled rows.

Here we use VALUES to define a small ad-hoc watchlist: a couple of species we want to look up by hand together with a human-readable label, instead of pulling them from the species table. Suppose we are interested in Red Deer (species_id 1, the dominant grazer whose density we want to track) and Red Fox (species_id 3, a generalist predator), and we want to tag the rows accordingly:

SELECT v.label, p.id, p.station, p.date,
       sr_formula(provenance(), 'species_mapping') AS formula
FROM (VALUES (1, 'mammal of interest'),
             (3, 'predator of interest')) AS v(species_id, label),
     detection d, photo p
WHERE d.species_id = v.species_id AND d.photo_id = p.id
ORDER BY p.id, v.label;

Each output row carries the provenance of the underlying detection row alone: the formula is a single species token, even though the row also references photo and VALUES. Tables without provenance – including the VALUES rows – contribute no tokens.

Step 4: Conjunctive Query (Naive)

Find photos that contain both Red Deer (species_id 1) and Red Fox (species_id 3) using a self-join on detection:

SELECT p.id, p.station, p.date,
       sr_formula(provenance(), 'species_mapping') AS formula
FROM detection d1
JOIN detection d2 ON d1.photo_id = d2.photo_id
JOIN photo p ON p.id = d1.photo_id
WHERE d1.species_id = 1 AND d2.species_id = 3
GROUP BY p.id, p.station, p.date
ORDER BY p.id;

Look at photo 5: the classifier produced three Red Deer candidate rows (in three different bounding boxes) and two Red Fox candidate rows (in two more boxes). Its formula is the ⊕-sum of all six (deer, fox) pair products – every candidate row is an independent input gate. This matches the structure of the underlying detection table but mis-models the data: each bounding box can correspond to at most one real animal, so candidate rows that share a (photo_id, bbox_id) pair should be mutually exclusive rather than independent. Nothing in the schema enforces that today, and the formula reflects the mismatch.

Step 5: Mutually Exclusive Candidates with `repair_key`

repair_key rewrites the provenance so that rows sharing a key become alternatives under a single mulinput (multivalued input) gate – i.e. exactly one of them is true. Applied with the key (photo_id, bbox_id), every bounding box becomes one mulinput variable whose values are the candidate species the classifier considered for that box.

repair_key takes a single key attribute, so add a synthetic key combining photo and bounding-box index first. repair_key reinstalls the provsql column itself, so also drop the old mapping (whose tokens are about to become stale):

DROP TABLE species_mapping;
SELECT remove_provenance('detection');

ALTER TABLE detection ADD COLUMN photo_bbox text;
UPDATE detection SET photo_bbox = photo_id || '/' || bbox_id;

SELECT repair_key('detection', 'photo_bbox');

CREATE TABLE species_mapping AS
  SELECT s.name AS value, d.provsql AS provenance
  FROM detection d JOIN species s ON s.id = d.species_id;
SELECT remove_provenance('species_mapping');
CREATE INDEX ON species_mapping(provenance);

Re-running the conjunctive query from Step 4 with sr_formula would not be illuminating: mutually exclusive events have no meaningful representation in the symbolic-formula semiring (each mulinput just collapses to 𝟙). To visualize them we use sr_boolexpr instead, which renders the underlying Boolean formula with internal variable names and exposes each mulinput explicitly:

SELECT p.id, sr_boolexpr(provenance()) AS bexpr
FROM detection d1
JOIN detection d2 ON d1.photo_id = d2.photo_id
JOIN photo p ON p.id = d1.photo_id
WHERE d1.species_id = 1 AND d2.species_id = 3 AND p.id IN (2, 5)
GROUP BY p.id
ORDER BY p.id;

Each input now appears as a mulinput (the {i=v}[p] notation denotes “variable i takes value v with probability p”). In this query every variable happens to have a single value, so the mutually-exclusive structure is not visible yet. We will see a genuine multi-valued mulinput in the next step, where the deer/roe-deer candidates of one bounding box compose under the mutex constraint.

Note

view_circuit cannot render mulinput gates either and refuses to evaluate. Use sr_boolexpr to inspect circuits that contain mulinput gates, and probability_evaluate to score them.

Step 6: Assign Probabilities and Verify Mutual Exclusion

Each candidate row’s confidence becomes the probability that that classifier candidate is the true species for its bounding box:

DO $$ BEGIN
  PERFORM set_prob(provenance(), confidence) FROM detection;
END $$;

To see that repair_key made a numerical difference, ask: what is the probability that bounding box 1 of photo 5 corresponds to a deer-like animal (species_id 1 = Red Deer or 2 = Roe Deer)? In the data, that bounding box has both candidate species recorded with confidences 0.40 and 0.30:

SELECT photo_id, bbox_id,
    sr_boolexpr(provenance()) AS bexpr,
    ROUND(probability_evaluate(provenance(), 'tree-decomposition')::numeric, 4) AS p
FROM detection
WHERE photo_id = 5 AND bbox_id = 1 AND species_id IN (1, 2)
GROUP BY photo_id, bbox_id;

The Boolean expression is now {0=1}[0.400000] ∨ {0=2}[0.300000]: a single mulinput variable (variable 0) with two mutually exclusive values, value 1 (Red Deer) with probability 0.40 and value 2 (Roe Deer) with probability 0.30. Probability evaluation gives 0.7000, the sum of the two confidences, since combining mutually exclusive events with ⊕ is just addition. Had we kept the original add_provenance setup with each row as an independent input gate, the same query would have given 1 - (1 - 0.40) × (1 - 0.30) = 0.58 instead. The 0.12 gap is the practical effect of telling the engine “these candidates cannot both be true at once”.

Step 7: Probabilistic Ranking vs. Threshold Filtering

Run the conjunctive query under two ranking strategies. First, by probability that both species are truly present:

SELECT p.id, p.station, p.date,
       ROUND(probability_evaluate(provenance())::numeric, 4) AS prob
FROM detection d1
JOIN detection d2 ON d1.photo_id = d2.photo_id
JOIN photo p ON p.id = d1.photo_id
WHERE d1.species_id = 1 AND d2.species_id = 3
GROUP BY p.id, p.station, p.date
ORDER BY prob DESC, p.id;

Second, by raw confidence threshold (every contributing detection must score at least 0.5):

SELECT DISTINCT p.id, p.station, p.date
FROM detection d1
JOIN detection d2 ON d1.photo_id = d2.photo_id
JOIN photo p ON p.id = d1.photo_id
WHERE d1.species_id = 1 AND d2.species_id = 3
  AND d1.confidence >= 0.5 AND d2.confidence >= 0.5
ORDER BY p.id;

Photo 5 is missed by thresholding (every individual candidate there scores below 0.5) but ranks reasonably under probability_evaluate, because the weak deer candidates in several boxes plus the weak fox candidates combine into a non-trivial probability that some deer and some fox are truly there. Conversely, a photo whose top candidates only barely cross 0.5 passes the threshold but ends up low in the probability ranking.

Step 8: Absence Constraint with `EXCEPT`

Find photos that contain a Red Deer but no Domestic Dog (species_id 13). EXCEPT is implemented in ProvSQL via the ⊖ (monus) operator on the provenance circuit:

SELECT p.id, p.station, p.date,
       ROUND(probability_evaluate(provenance())::numeric, 4) AS prob
FROM (
    SELECT photo_id FROM detection WHERE species_id = 1
  EXCEPT
    SELECT photo_id FROM detection WHERE species_id = 13
) t
JOIN photo p ON p.id = t.photo_id
GROUP BY p.id, p.station, p.date
ORDER BY prob DESC, p.id;

Photos that contain a dog still appear in the output – EXCEPT is not a hard filter. Photo 9, with a high-confidence dog detection, ranks lower because the monus discounts strongly. Photo 14, where the dog detection has very low confidence, ranks higher: it is probably in the result, but not certainly. ProvSQL preserves both possibilities in the circuit and lets probability_evaluate weigh them.

Step 9: Multi-Condition Query via a CTE

Combine Steps 7 and 8: photos with both Red Deer and Red Fox, with no Domestic Dog, ranked by probability. The query has three logical layers (co-occurrence, absence, ranking) and reads naturally as a CTE:

WITH deer_and_fox AS (
  SELECT d1.photo_id
  FROM detection d1
  JOIN detection d2 ON d1.photo_id = d2.photo_id
  WHERE d1.species_id = 1 AND d2.species_id = 3
  GROUP BY d1.photo_id
),
no_dogs AS (
  SELECT photo_id FROM deer_and_fox
  EXCEPT
  SELECT photo_id FROM detection WHERE species_id = 13
)
SELECT p.id, p.station, p.date,
       ROUND(probability_evaluate(provenance())::numeric, 4) AS prob
FROM no_dogs t
JOIN photo p ON p.id = t.photo_id
ORDER BY prob DESC, p.id;

ProvSQL’s planner hook fires on the expanded query: CTEs are inlined and provenance propagates through them transparently. The same answer can be written with nested subqueries; the CTE form is purely a readability choice.

Step 10: Expected Species Counts with `expected`

How many distinct detections do we expect to be true positives in each photo?

SELECT p.id, p.station,
       ROUND(expected(COUNT(*))::numeric, 4) AS exp_detections
FROM detection d
JOIN photo p ON p.id = d.photo_id
GROUP BY p.id, p.station
ORDER BY exp_detections DESC, p.id;

By linearity of expectation, expected(COUNT(*)) over a group is \(\sum_i P(\text{detection}_i \text{ is true})\). The same linearity applies to SUM aggregates: the expected total confidence mass per photo:

SELECT p.id, p.station,
       ROUND(expected(SUM(d.confidence))::numeric, 4) AS exp_total_conf
FROM detection d
JOIN photo p ON p.id = d.photo_id
GROUP BY p.id, p.station
ORDER BY exp_total_conf DESC, p.id;

Both queries use ProvSQL’s expected operator, which computes the expected value of a SQL aggregate over the probabilistic database defined by the per-row probabilities set in Step 6. Photos with many high-confidence detections rank highest on both metrics.