Case Study: The Intelligence Agency

This case study, largely extending the scenario introduced in [Senellart et al., 2018], demonstrates ProvSQL’s custom semiring capability, where-provenance, probability computation with multiple algorithms, and circuit export through a security-classification scenario.

The Scenario

An intelligence agency maintains a database of seven employees spread across three cities. Every employee holds a security clearance ranging from unclassified to top secret. Your tasks:

identify which cities are served by more than one agent,
determine the minimum clearance level needed to infer each result,
find cities with exactly one agent (sensitive: if the city leaks, the sole agent is exposed),
track where in the database each output value originated,
compute the probability that a city remains a single-agent post after accounting for possible-world uncertainty.

Setup

This case study assumes a working ProvSQL installation (see Getting ProvSQL). Download setup.sql and load it into a fresh PostgreSQL database:

psql -d mydb -f setup.sql

This creates:

classification_level – an ordered ENUM (unclassified < restricted < confidential < secret < top_secret < unavailable) where unavailable is a sentinel representing the semiring 𝟘 (no derivation possible)
personnel – 7 agents with name, position, city, and clearance level

Step 1: Explore the Database

At the start of every session, set the search path so that ProvSQL functions can be called without the provsql. prefix:

SET search_path TO public, provsql;

Inspect the personnel table:

SELECT * FROM personnel ORDER BY id;

You should see seven rows: Juma (Director, Nairobi), Paul (Janitor, Nairobi), David (Analyst, Paris), Ellen (Field agent, Beijing), Aaheli (Double agent, Paris), Nancy (HR, Paris), and Jing (Analyst, Beijing).

Step 2: Enable Provenance and Create a Name Mapping

Enable provenance tracking on personnel and create a mapping so that provenance tokens can be labelled with agent names:

SELECT add_provenance('personnel');
SELECT create_provenance_mapping('personnel_name', 'personnel', 'name');

After add_provenance, every row of personnel has a unique UUID token in its hidden provsql column. The mapping personnel_name associates each token with the corresponding agent’s name.

Step 3: Cities Shared by Multiple Agents

Which cities have at least two agents? Use a self-join with an id inequality to generate all unordered pairs, then apply sr_formula to see which agents contribute to each city:

SELECT p1.city,
       sr_formula(provenance(), 'personnel_name') AS formula
FROM personnel p1
JOIN personnel p2 ON p1.city = p2.city AND p1.id < p2.id
GROUP BY p1.city
ORDER BY p1.city;

The formula for Nairobi is Juma ⊗ Paul: both agents must be present for Nairobi to appear. Beijing similarly shows Ellen ⊗ Jing. Paris, with three agents, shows all three pairwise products joined by ⊕.

Step 4: Minimum Security Clearance (sr_minmax)

For each shared city, what is the minimum clearance level required to have inferred that the city has multiple agents? An analyst who knows the city only needs to see the lowest-cleared agent there.

This is the security shape of the min-max m-semiring, computed by the compiled function sr_minmax over the classification_level enum:

⊕ (OR combination) = min: to infer either agent was involved, you only need clearance for the less-classified one (one witness suffices to establish the disjunction).
⊗ (AND/join) = max: to confirm both agents are present, you need clearance for the more-classified one (you must be able to access both records to establish the join).

The third argument is a sample value of the carrier enum (used only for type inference); its value is ignored:

SELECT create_provenance_mapping('personnel_level',
                                 'personnel', 'classification');

SELECT p1.city,
       sr_minmax(provenance(), 'personnel_level',
                 'unclassified'::classification_level) AS min_clearance
FROM personnel p1
JOIN personnel p2 ON p1.city = p2.city AND p1.id < p2.id
GROUP BY p1.city
ORDER BY p1.city;

Results: Nairobi requires restricted (Paul is the more-classified of the two agents, and both must be accessed to confirm the pair). Beijing requires secret (both Ellen and Jing hold the same level). Paris requires confidential: the pair David–Nancy has MAX clearance confidential, which is the lowest maximum among all Paris pairs, so confidential clearance suffices to confirm at least one pair.

Step 5: Cities with Exactly One Agent (EXCEPT / Monus)

A city with a single agent is sensitive: knowing the city immediately identifies the agent. Find cities where all agents are alone using EXCEPT:

SELECT city,
       sr_formula(provenance(), 'personnel_name') AS formula
FROM (
    SELECT DISTINCT city FROM personnel
  EXCEPT
    SELECT p1.city
    FROM personnel p1
    JOIN personnel p2 ON p1.city = p2.city AND p1.id < p2.id
    GROUP BY p1.city
) t
ORDER BY city;

Note

ProvSQL’s EXCEPT uses the monus operator ⊖ of the provenance semiring rather than plain set difference. Every city appears in the result with a provenance formula; the formula evaluates to 𝟘 for cities that are definitely shared and to a non-trivial expression for cities that could be single-agent in some possible world. Nairobi’s formula (Juma ⊕ Paul) ⊖ (Juma ⊗ Paul) reads: “at least one of Juma or Paul is present, minus the event where both are.”

Step 6: Where-Provenance

Where-provenance tracks which column of which input row produced each output value. Enable it and re-run the shared-city query:

SET provsql.where_provenance = on;

SELECT p1.city,
       where_provenance(provenance()) AS source
FROM personnel p1
JOIN personnel p2 ON p1.city = p2.city AND p1.id < p2.id
GROUP BY p1.city
ORDER BY p1.city;

SET provsql.where_provenance = off;

Each city output value is traced back to the city column (column 4) of the personnel table for every agent in that city. The notation [personnel:〈token〉:4;...] shows the token and column index of each contributing row; the trailing [] is the untracked source column itself.

Step 7: Assign Probabilities

Suppose the existence of each agent in the database is uncertain. Assign each agent a probability equal to id / 10.0:

ALTER TABLE personnel ADD COLUMN probability DOUBLE PRECISION;
UPDATE personnel SET probability = id / 10.0;

DO $$ BEGIN
  PERFORM set_prob(provenance(), probability) FROM personnel;
END $$;

Now Juma has probability 0.1, Paul 0.2, …, Jing 0.7.

Step 8: Probability – Exact

Compute the exact probability that each city is a single-agent city:

SELECT city,
       ROUND(probability_evaluate(provenance())::numeric, 4) AS prob
FROM (
    SELECT DISTINCT city FROM personnel
  EXCEPT
    SELECT p1.city
    FROM personnel p1
    JOIN personnel p2 ON p1.city = p2.city AND p1.id < p2.id
    GROUP BY p1.city
) t
ORDER BY city;

Nairobi (agents with probabilities 0.1 and 0.2) has probability 0.1 × 0.8 + 0.9 × 0.2 = 0.26 of having exactly one agent. Beijing (0.4 and 0.7) scores 0.54. Paris (0.3, 0.5, 0.6) gives 0.41.

Step 9: Probability – Monte Carlo

For larger circuits, exact evaluation can be expensive. Monte Carlo sampling gives an approximate answer:

SELECT city,
       probability_evaluate(
           provenance(), 'monte-carlo', '10000') AS prob
FROM (
    SELECT DISTINCT city FROM personnel
  EXCEPT
    SELECT p1.city
    FROM personnel p1
    JOIN personnel p2 ON p1.city = p2.city AND p1.id < p2.id
    GROUP BY p1.city
) t
ORDER BY city;

With 10 000 samples the result is accurate to roughly ±0.01 (see Margin of error). Results will vary slightly between runs due to sampling. Use \timing in psql to compare the runtime against the exact method.

Step 10: Probability – Knowledge Compiler

Note

This step requires an external knowledge compiler such as d4 or dsharp to be installed and on your PATH (or in a directory listed in the provsql.tool_search_path GUC, see Configuration Reference). Skip it if neither is available.

A knowledge compiler converts the provenance circuit to a d-DNNF representation, which enables efficient exact probability evaluation on large circuits of specific forms:

SELECT city,
       ROUND(probability_evaluate(
           provenance(), 'compilation', 'd4')::numeric, 4) AS prob
FROM (
    SELECT DISTINCT city FROM personnel
  EXCEPT
    SELECT p1.city
    FROM personnel p1
    JOIN personnel p2 ON p1.city = p2.city AND p1.id < p2.id
    GROUP BY p1.city
) t
ORDER BY city;

Compare the runtime with \timing against the possible-worlds and Monte Carlo methods. On this small example, the external knowledge compiler will be slower than the other methods: invoking an external process and compiling the circuit carries significant overhead that only pays off on much larger circuits.

Step 11: Visualise a Provenance Circuit

view_circuit renders the provenance circuit as an ASCII box-art diagram using graph-easy (must be installed and on your PATH):

SELECT city, view_circuit(provenance(), 'personnel_name') AS circuit
FROM (
    SELECT DISTINCT city FROM personnel
  EXCEPT
    SELECT p1.city
    FROM personnel p1
    JOIN personnel p2 ON p1.city = p2.city AND p1.id < p2.id
    GROUP BY p1.city
) t
WHERE city = 'Nairobi';

The result shows the monus gate at the top, with Juma and Paul as leaf inputs, rendered in box-art notation in the terminal.

Step 12: Export to XML

to_provxml serialises the provenance circuit to PROV-XML, the W3C standard for provenance interchange, which can be processed by any standard XML tool:

SELECT city,
       to_provxml(provenance(), 'personnel_name') AS provxml
FROM (
    SELECT DISTINCT city FROM personnel
  EXCEPT
    SELECT p1.city
    FROM personnel p1
    JOIN personnel p2 ON p1.city = p2.city AND p1.id < p2.id
    GROUP BY p1.city
) t
WHERE city = 'Nairobi';

Step 13: Large Circuit Benchmark

To compare the three probability algorithms at scale, create a synthetic 100 × 100 random-probability matrix and enable provenance tracking on it:

CREATE TABLE matrix AS
SELECT ones.n + 10 * tens.n AS x,
       other.n + 10 * tens2.n AS y,
       random() AS prob
FROM (VALUES(0),(1),(2),(3),(4),(5),(6),(7),(8),(9)) ones(n),
     (VALUES(0),(1),(2),(3),(4),(5),(6),(7),(8),(9)) tens(n),
     (VALUES(0),(1),(2),(3),(4),(5),(6),(7),(8),(9)) other(n),
     (VALUES(0),(1),(2),(3),(4),(5),(6),(7),(8),(9)) tens2(n);

SELECT add_provenance('matrix');
DO $$ BEGIN
  PERFORM set_prob(provenance(), prob) FROM matrix;
END $$;

Now run the same path query with each method in turn (use \timing in psql to record runtimes):

-- Default method (independent evaluation, tree decomposition, or d4)
SELECT m1.x, m2.y,
       probability_evaluate(provenance()) AS prob
FROM matrix m1, matrix m2
WHERE m2.x = m1.y AND m1.x > 90 AND m2.x > 90 AND m2.y > 90
GROUP BY m1.x, m2.y
ORDER BY m1.x, m2.y;

-- Exact enumeration over all possible worlds
SELECT m1.x, m2.y,
       probability_evaluate(provenance(), 'possible-worlds') AS prob
FROM matrix m1, matrix m2
WHERE m2.x = m1.y AND m1.x > 90 AND m2.x > 90 AND m2.y > 90
GROUP BY m1.x, m2.y
ORDER BY m1.x, m2.y;

-- Monte Carlo sampling (9604 samples ≈ 1 % error at 95 % confidence)
SELECT m1.x, m2.y,
       probability_evaluate(provenance(), 'monte-carlo', '9604') AS prob
FROM matrix m1, matrix m2
WHERE m2.x = m1.y AND m1.x > 90 AND m2.x > 90 AND m2.y > 90
GROUP BY m1.x, m2.y
ORDER BY m1.x, m2.y;

-- Tree-decomposition-based exact compilation (built-in, no external tool)
SELECT m1.x, m2.y,
       probability_evaluate(provenance(), 'tree-decomposition') AS prob
FROM matrix m1, matrix m2
WHERE m2.x = m1.y AND m1.x > 90 AND m2.x > 90 AND m2.y > 90
GROUP BY m1.x, m2.y
ORDER BY m1.x, m2.y;

-- Knowledge compilation via external tool d4
SELECT m1.x, m2.y,
       probability_evaluate(provenance(), 'compilation', 'd4') AS prob
FROM matrix m1, matrix m2
WHERE m2.x = m1.y AND m1.x > 90 AND m2.x > 90 AND m2.y > 90
GROUP BY m1.x, m2.y
ORDER BY m1.x, m2.y;

The Monte Carlo query uses 9604 samples, which gives roughly 1 % additive error with 95 % confidence (by the formula \(n = z^2 / (4\varepsilon^2)\) with \(z = 1.96\), \(\varepsilon = 0.01\); see Margin of error).

The 'tree-decomposition' method is exact and built into ProvSQL (no external binary required). It is often the fastest exact method on simple queries, but it fails on circuits with high treewidth – when that happens, fall back to 'compilation' or one of the other methods.

Step 14: The Boolean Expression Behind a Token

sr_boolexpr returns the abstract Boolean formula of a provenance circuit. Without a mapping it uses internal variable names x0, x1, …; with an optional second argument naming a provenance mapping table the leaves are labelled by the mapping’s value column instead. This is the same expression ProvSQL hands to its d-DNNF compilers internally to compute probabilities.

SELECT city, sr_boolexpr(provenance()) AS boolexpr
FROM (
    SELECT DISTINCT city FROM personnel
  EXCEPT
    SELECT p1.city
    FROM personnel p1
    JOIN personnel p2 ON p1.city = p2.city AND p1.id < p2.id
    GROUP BY p1.city
) t
WHERE city = 'Nairobi';

For Nairobi, the result is the circuit (Juma ⊕ Paul) ⊖ (Juma ⊗ Paul) from Step 5, interpreted in the Boolean function semiring – every provenance gate is mapped to its Boolean counterpart (⊕ to ∨, ⊗ to ∧, ⊖ to ∧¬) – and the resulting Boolean function rendered as a formula over anonymous variables. Unlike sr_formula, the provenance mapping is optional: the expression captures the circuit’s logical structure independently of any naming, and a mapping can be supplied later if you want the leaves labelled.

Step 15: Programmatic Circuit Inspection

What view_circuit renders, you can also walk programmatically with the low-level circuit API. Capture Nairobi’s monus token first:

CREATE TEMP TABLE nairobi_token AS
SELECT provenance() AS prov
FROM (
    SELECT DISTINCT city FROM personnel
  EXCEPT
    SELECT p1.city
    FROM personnel p1
    JOIN personnel p2 ON p1.city = p2.city AND p1.id < p2.id
    GROUP BY p1.city
) t
WHERE city = 'Nairobi';

get_nb_gates reports how many gates have been materialized in the current database’s circuit:

SELECT get_nb_gates();

get_gate_type and get_children give a single-step view of the gate structure: they return the operator and direct children of a gate.

SELECT get_gate_type(prov)            AS root_type,
       get_children(prov)             AS root_children
FROM nairobi_token;

For Nairobi, the root is a monus gate with two children: the ⊕ sub-circuit (Juma ⊕ Paul) and the ⊗ sub-circuit (Juma ⊗ Paul). Recurse to inspect the children:

SELECT (get_children(prov))[1]                        AS plus_token,
       get_gate_type((get_children(prov))[1])         AS plus_type,
       get_children((get_children(prov))[1])          AS plus_children
FROM nairobi_token;

The leaves of the circuit are input gates that originate from the personnel table. identify_token performs a reverse lookup, returning the table and column count for an input token:

SELECT identify_token(child) AS source
FROM nairobi_token, unnest(get_children((get_children(prov))[1])) AS child;

Both leaves resolve to (personnel, 6) – the personnel table with its six non-provenance columns (id, name, position, city, classification, and the probability column added in Step 7). This is exactly the traversal view_circuit performs to render the box-art diagram.