Knowledge Compilation

ProvSQL builds provenance circuits from SQL queries transparently, through its planner hook. When Boolean provenance is required, whether for probability evaluation, for Shapley value computation, or simply at the user’s request, a Boolean circuit of a particularly convenient form is obtained. This chapter follows the full pipeline behind probability evaluation and shows how to inspect every intermediate artifact:

SQL query  →  provenance circuit  →  CNF  →  d-DNNF  →  probability

Each arrow is exposed by a SQL function, and each artifact can be rendered as GraphViz DOT or printed as DIMACS text. The same surfaces drive the knowledge-compilation panels of ProvSQL Studio.

Note

Because this chapter is entirely about knowledge compilation over Boolean provenance, you will usually want ProvSQL’s Boolean-only optimisations switched on for the whole session. They are gated behind the provsql.boolean_provenance GUC (off by default): issue SET provsql.boolean_provenance = on;, or pick Boolean on ProvSQL Studio’s Provenance scheme switch next to the query box.

The circuit

Every probabilistic query result carries a provenance token (a UUID) naming the root of its provenance circuit (see Adding Provenance to a Table and Querying with Provenance). The circuit is a DAG over input leaves and plus / times / monus gates. view_circuit renders it as DOT; Studio’s circuit mode renders it interactively. The remaining sections take such a token and walk it down to a probability.

For the examples below, assign probabilities to the inputs first (see Probabilities):

SELECT set_prob(provenance(), reliability) FROM sightings;

From circuit to CNF

Before invoking an external compiler, ProvSQL encodes the Boolean circuit into a CNF formula by the Tseytin transformation: one fresh variable per gate, plus clauses asserting that each gate variable is equivalent to the Boolean combination of its inputs. This is the exact encoding the extension streams to d4 / c2d / minic2d / dsharp on a temporary file; tseytin_cnf returns it as text instead:

SELECT tseytin_cnf(provenance()) FROM suspects WHERE id = 1;

The output is DIMACS. With the default mapping => true it opens with one c input comment line per input variable, the self-documenting variable mapping described below; the p cnf <variables> <clauses> header and the clauses follow. With the default weighted => true, the literal-weight lines required by weighted model counters are appended, one per input literal:

c input 1 7f3a2b1c-… 0.5
c input 2 9b2c4d5e-… 0.6
c input 3 a1b2c3d4-… 0.7
p cnf 6 7
-4 0
5 0
-6 1 0
...
w 1 0.5
w -1 0.5
w 2 0.6
w -2 0.4

Pass weighted => false to obtain the bare CNF (no w lines), for plain (unweighted) model counting or for feeding a compiler that reads weights separately:

SELECT tseytin_cnf(provenance(), weighted => false)
FROM suspects WHERE id = 1;

Reading a model back: the variable mapping

A DIMACS variable is just an integer, so a satisfying assignment or weighted count returned by an external tool is meaningless until you know which provenance input each variable stands for. This is what the c input lines at the top of the output above record, one per input variable: its DIMACS number, the originating provenance UUID, and its probability. They make the CNF self-documenting at no cost to the solver: model counters and compilers ignore c comment lines, so the file stays valid DIMACS. Pass mapping => false to omit them. Only input variables appear (the auxiliary Tseytin variables, one per gate, are not provenance inputs).

The same information is available as a table through tseytin_cnf_mapping, which is what you want when reading a tool’s output back programmatically:

SELECT * FROM tseytin_cnf_mapping(
  '00000000-0000-0000-0000-000000000000');

Feed it a concrete token, either a literal UUID or one materialised in a plain table. You cannot pull the token inline from a provenance-tracked relation in the same statement (FROM suspects s, tseytin_cnf_mapping(s.provsql) m): while provsql.active is on the planner hook refuses to rewrite a multi-column function applied to a tracked relation (FROM function with multiple output attributes not supported). Set provsql.active off for that pattern.

The variable column matches the DIMACS numbering, gate is the original-circuit input UUID, and probability its weight. In ProvSQL Studio, the Tseytin CNF panel goes one step further and annotates each c input line with the source tuple the variable came from (resolved from the tracked relations), so you can see at a glance that, say, variable 3 is suspects(5, …).

From CNF to d-DNNF

An external knowledge compiler turns the CNF into a deterministic, decomposable negation normal form (d-DNNF), on which weighted model counting (hence probability) is linear-time. compile_to_ddnnf_dot runs the requested compiler and returns the resulting d-DNNF as a GraphViz digraph over AND / OR / NOT / input gates:

SELECT compile_to_ddnnf_dot(provenance(), 'd4')
FROM suspects WHERE id = 1;

The second argument names the compiler. ProvSQL ships bindings for:

'd4'

Decision-DNNF compiler with backtracking [Lagniez and Marquis, 2017]; the default for the compilation probability method.

'd4v2'

The rewritten d4 [Lagniez and Marquis, 2021].

'c2d'

The original dtree-guided compiler [Darwiche, 2004].

'minic2d'

Top-down CNF-to-SDD compiler [Oztok and Darwiche, 2015].

'dsharp'

DPLL-style compiler built on sharpSAT [Muise et al., 2012].

'panini-obdd', 'panini-obdd-and', 'panini-decdnnf'

Three target languages of KCBox’s Panini compiler [Lai et al., 2024]:

  • OBDD is the canonical ordered Boolean decision diagram, a strict subset of d-DNNF whose decisions are restricted to a global variable order.

  • OBDD[AND] augments OBDD with internal AND nodes [Lai et al., 2017], recovering a more compact representation while keeping polynomial Apply.

  • Decision-DNNF drops the variable order, retaining only the decomposability + determinism of d-DNNF; it is the canonical target of d4.

Panini also ships R2-D2 and CCDD target languages [Lai et al., 2022]. ProvSQL does not expose them: both emit K (kernelize) nodes encoding literal-equivalence constraints over a shared kernel variable, which break the decomposability invariant of a d-DNNF. A direct AND-translation gives silently-wrong probabilities; a correct translation requires case-splitting on the kernel variables and is not yet implemented.

Each compiler must be installed and reachable on the PostgreSQL server’s PATH, or in a directory listed in the provsql.tool_search_path GUC (see Configuration Reference). The same query, compiled by different tools, yields different d-DNNFs for the same Boolean function: comparing their size and sharing is instructive, and is one of the things the Studio knowledge-compilation panel makes visual.

Exporting the d-DNNF as text

compile_to_ddnnf_dot is for the eye. To get the compiled circuit as a machine-readable artifact – to feed it to an external d-DNNF reasoner or verifier, or to archive it – compile_to_ddnnf returns the same circuit in the standard c2d / d4 .nnf text format:

SELECT compile_to_ddnnf(provenance(), 'd4')
FROM suspects WHERE id = 1;

The output opens with an nnf <#nodes> <#edges> <#vars> header, then one line per node: L <lit> for a literal leaf, A <k> for an AND over k earlier nodes, O <j> <k> for an OR (j is the decision variable, 0 when ProvSQL records none). It accepts the same compiler / meta-route names as compile_to_ddnnf_dot.

The literal variables use the same numbering as tseytin_cnf, even when an external compiler renumbered them internally, so a .cnf and a .nnf of the same circuit cross-reference and tseytin_cnf_mapping interprets both. In Studio the Compiled d-D (NNF text) eval-strip option shows the NNF and the copy button yields it verbatim.

Measuring the compiled d-DNNF

To make that comparison quantitative rather than visual, ddnnf_stats returns the structural statistics of the d-DNNF a given compiler produces, as a jsonb object:

SELECT jsonb_pretty(ddnnf_stats(provenance(), 'd4'))
FROM suspects WHERE id = 1;

The object reports nodes and edges, the and / or / not / inputs gate-type split, whether the result is smooth (every OR gate’s children share their variable set), the longest-path depth, the circuit treewidth (null when it exceeds the supported limit), and the compile_ms wall-clock compile time. It accepts the same compiler names as compile_to_ddnnf_dot, including the in-process interpret-as-dd and tree-decomposition routes, so a single string change re-measures a different compiler on the same circuit:

{
    "and": 3, "or": 4, "not": 0, "inputs": 5,
    "nodes": 12, "edges": 14, "depth": 6,
    "smooth": true, "treewidth": 3,
    "compiler": "d4", "compile_ms": 1.83
}

In Studio, the compiled-d-DNNF canvas shows a gates / edges / depth summary in its subtitle, and the probability benchmark (below) adds a d-DNNF (n/e) column so the size of every compiler’s output sits beside its run time in one table.

The in-process routes

ProvSQL also ships two built-in compilers that need no external tool. Both are accepted wherever a compiler name is, by compile_to_ddnnf_dot, compile_to_ddnnf, ddnnf_stats, and the matching probability_evaluate methods.

interpret-as-dd reinterprets the provenance circuit directly as a d-DNNF, with no compilation step, reading each times as an AND of independent children and each plus as an independent OR, the latter rewritten by De Morgan into a NOT over an AND of NOTs (¬(¬a ¬b …)) so the result stays a genuine d-DNNF. It is therefore exact only on circuits whose gates are genuinely independent, the shape an independent or read-once query produces; it does not try to certify that plus gates are deterministic (mutually exclusive), which would be expensive to assert. Gate types it cannot read this way raise an unsupported-gate error. It is the cheapest route, and the one the default method tries first.

tree-decomposition is the structural fallback: it builds a tree decomposition of the Boolean circuit and compiles it to a d-DNNF in time linear in the circuit and singly-exponential in the treewidth [Amarilli et al., 2020]. It fails, raising a Treewidth greater than N error, when the treewidth exceeds the supported limit; the default method then falls through to an external compiler (the provsql.fallback_compiler GUC, default d4).

tree_decomposition_dot exposes the underlying min-fill decomposition as DOT, so the variable-elimination order that yields the in-process d-DNNF is itself inspectable. Its first line is a comment carrying the treewidth:

// treewidth=3
SELECT tree_decomposition_dot(provenance())
FROM suspects WHERE id = 1;

Weighted model counters: the wmc umbrella

The wmc method dispatches to a family of exact weighted model counters, each consuming the same DIMACS CNF emitted by tseytin_cnf (with the literal-weight c p weight lines required by the MCC 2024 input format):

'ganak'

A projected weighted model counter from the meelgroup [Sharma et al., 2019]. Compact and self-contained; requires a single ganak binary on PATH.

'sharpsat-td'

Tree-decomposition-guided exact counter [Korhonen and Järvisalo, 2021]. Needs the flow_cutter_pace17 helper alongside sharpsat-td so the counter can shell out to its tree-decomposer.

'dpmc'

Two-stage planner + executor pipeline [Dudek et al., 2020]: htb builds a project-join tree, dmc consumes it. Both binaries must be installed.

'weightmc'

The approximate ApproxMC variant (kept under the umbrella for discoverability; see Probabilities for its epsilon;delta knob).

Invocation goes through probability_evaluate:

SELECT probability_evaluate(provenance(), 'wmc', 'ganak')
FROM suspects WHERE id = 1;

Switching between counters is a single-string change, which is what the benchmark below exploits to compare them on the same circuit.

Inspecting the in-memory circuit

Before any compilation runs, the provsql.simplify_on_load GUC may have already rewritten parts of the circuit: identity / absorber collapses, gate_cmp resolutions for circuits over random variables, hybrid-evaluator simplifications (see Probabilities). simplified_circuit_subgraph returns the post-simplification DAG, rooted at the given token, as a JSON adjacency list. This is the exact shape the probability evaluators traverse:

SELECT jsonb_pretty(simplified_circuit_subgraph(provenance()))
FROM suspects WHERE id = 1;

Each node carries its gate type, an inline extra field for typed leaves (random variables, cmp thresholds), and the longest-path depth from the root. The longest path is the canonical circuit-depth notion: it tracks the deepest chain of operators between the node and the output, which is what governs evaluation cost and matches the depth a renderer would draw.

Comparing probability methods

probability_benchmark times every probability-evaluation method on a single circuit token and returns one row per method with its wall-clock duration and result. It runs independent, possible-worlds, tree-decomposition, monte-carlo, each external compiler through the compilation method, and each weighted model counter under the wmc umbrella. Methods that cannot apply (an uninstalled compiler, a non-independent circuit, a treewidth blow-up, …) are captured per row in an error column rather than aborting the whole comparison.

probability_benchmark takes a concrete token, so feed it a literal UUID or one materialised in a plain table, exactly like tseytin_cnf_mapping above (and with the same restriction: do not call it inline over a provenance-tracked relation while provsql.active is on):

SET provsql.monte_carlo_seed = 42;   -- reproducible Monte-Carlo row

SELECT method, args,
       ROUND(probability::numeric, 6) AS prob,
       ROUND(milliseconds::numeric, 1) AS ms,
       error
FROM probability_benchmark('00000000-0000-0000-0000-000000000000')
ORDER BY method, args NULLS FIRST;

The exact methods agree to numerical precision; the approximate ones (monte-carlo, weightmc) land within their confidence band. The second and third arguments tune the Monte-Carlo sample count (default 10000) and the epsilon;delta forwarded to weightmc (default '0.8;0.2').

Checking tool availability

Because the compilers and model counters are optional external binaries, tool_available reports whether a given tool resolves to an executable on the backend’s effective PATH, including the provsql.tool_search_path prefix, exactly as a subsequent compilation or view_circuit call would see it:

SELECT tool_available('d4'), tool_available('c2d');

A bare name is resolved through the shell; a name containing a slash is tested as a path. Studio uses this to grey out compilers that are not installed rather than letting them fail at run time.

In ProvSQL Studio

The knowledge-compilation pipeline is available without leaving the browser. In Studio’s evaluation strip, the same artifacts surface as inline panels: the DIMACS CNF, the compiled d-DNNF rendered beside the original circuit, the tree decomposition with its treewidth, and a one-click benchmark across all probability methods. Compilers that tool_available reports as missing are filtered out of the pickers, and the Studio benchmark skips those methods altogether: unlike the SQL probability_benchmark, which still emits a row per method and records the failure in its error column, their rows simply do not appear.

See also