ProvSQL C/C++ API
Adding support for provenance and uncertainty management to PostgreSQL databases
Loading...
Searching...
No Matches
MonteCarloSampler.h
Go to the documentation of this file.
1/**
2 * @file MonteCarloSampler.h
3 * @brief Monte Carlo sampling over a @c GenericCircuit, RV-aware.
4 *
5 * Drop-in replacement for @c BooleanCircuit::monteCarlo for circuits
6 * that contain continuous random variables (@c gate_rv) or arithmetic
7 * over RVs (@c gate_arith). Operates directly on the
8 * @c GenericCircuit produced by @c CircuitFromMMap, so the
9 * BoolExpr-semiring translation that drops non-Boolean gates is not
10 * needed.
11 *
12 * Gate handling:
13 * - @c gate_input (and @c gate_update) — Bernoulli draw at @c getProb,
14 * memoised per iteration (so the same input feeding two children
15 * produces the same draw).
16 * - @c gate_plus / @c gate_times / @c gate_monus — Boolean OR / AND /
17 * AND-NOT.
18 * - @c gate_zero / @c gate_one — false / true.
19 * - @c gate_cmp with scalar (@c gate_rv / @c gate_arith / @c gate_value)
20 * children — compare two scalar samples per the comparison-operator
21 * OID stored in @c info1. Aggregate-vs-constant @c gate_cmp gates
22 * from HAVING semantics are handled by the existing
23 * @c BooleanCircuit path and are not reached here.
24 * - @c gate_value — parse @c extra as @c float8.
25 * - @c gate_rv — fresh draw from the distribution serialised in
26 * @c extra (memoised per iteration so the SAME RV inside an
27 * arithmetic expression uses the same draw, per the thesis's
28 * SampleOne).
29 * - @c gate_arith — recurse on scalar children, combine per the
30 * operator tag in @c info1 (@c provsql_arith_op enum: PLUS / TIMES
31 * are n-ary; MINUS / DIV are binary; NEG is unary).
32 *
33 * The RNG is seeded from the @c provsql.monte_carlo_seed GUC: zero
34 * (default) seeds non-deterministically from @c std::random_device,
35 * any other value is a literal seed shared across the Bernoulli and
36 * continuous paths so a single GUC pins the whole computation.
37 */
38#ifndef PROVSQL_MONTE_CARLO_SAMPLER_H
39#define PROVSQL_MONTE_CARLO_SAMPLER_H
40
41#include <optional>
42#include <vector>
43
44#include "GenericCircuit.h"
45
46extern "C" {
47#include "provsql_utils.h"
48}
49
50namespace provsql {
51
52/**
53 * @brief Run Monte Carlo on a circuit that may contain @c gate_rv leaves.
54 *
55 * @param gc The circuit (loaded from the mmap store via
56 * @c CircuitFromMMap).
57 * @param root Gate to evaluate as a Boolean expression.
58 * @param samples Number of independent worlds to sample.
59 * @return Estimated probability that @p root is true.
60 *
61 * @throws CircuitException on malformed circuits (unknown gate kind in
62 * a Boolean position, malformed @c extra, unknown comparison
63 * operator, etc.).
64 */
65double monteCarloRV(const GenericCircuit &gc, gate_t root, unsigned samples);
66
67/**
68 * @brief Walk the circuit reachable from @p root looking for any @c gate_rv.
69 *
70 * Used by @c probability_evaluate to dispatch between the existing
71 * @c BooleanCircuit path and the RV-aware sampler in this file.
72 */
73bool circuitHasRV(const GenericCircuit &gc, gate_t root);
74
75/**
76 * @brief Estimate the joint distribution of @p cmps via Monte Carlo.
77 *
78 * For each of @p samples worlds, samples the underlying continuous
79 * island once (shared @c gate_rv leaves use the same per-iteration
80 * draw, per @c monteCarloRV's evalScalar) and evaluates each
81 * comparator in @p cmps; the @c k = @p cmps.size() resulting bits
82 * form a single word @c w with bit @c i = result of @c cmps[i]. The
83 * returned vector has size @c 2^k; entry @c w is the empirical
84 * probability that the joint outcome @c w occurred.
85 *
86 * Used by the multi-cmp half of the hybrid evaluator's island
87 * decomposer to inline a categorical distribution over the @c k cmps
88 * that share an island; @p cmps must all sit over a continuous
89 * island whose scalar evaluation reuses common @c gate_rv leaves so
90 * the cmp draws are correctly correlated.
91 *
92 * @c k is capped at 30 (the result vector size is @c 2^30) to keep
93 * memory bounded; the decomposer enforces a much tighter cap
94 * (@c k_max in @c HybridEvaluator.cpp) so this is purely a safety
95 * limit. Throws @c CircuitException above the cap.
96 *
97 * @param gc The circuit.
98 * @param cmps The comparators jointly evaluated.
99 * @param samples Number of independent worlds.
100 * @return Vector of joint probabilities, indexed by the bit
101 * word @c w (bit @c i = @c cmps[i] outcome).
102 */
103std::vector<double> monteCarloJointDistribution(
104 const GenericCircuit &gc,
105 const std::vector<gate_t> &cmps,
106 unsigned samples);
107
108/**
109 * @brief Sample a scalar sub-circuit @p samples times and return the draws.
110 *
111 * @p root must yield a scalar (@c gate_value, @c gate_rv, or @c gate_arith
112 * over scalar children); otherwise a @c CircuitException is thrown. Each
113 * iteration uses a fresh per-iteration memo cache so that repeated
114 * occurrences of the same @c gate_rv UUID inside an arithmetic expression
115 * share their draw within an iteration but not across iterations.
116 *
117 * The RNG is seeded from @c provsql.monte_carlo_seed exactly like
118 * @c monteCarloRV; pinning the GUC makes the returned vector reproducible.
119 *
120 * Used as the universal MC fallback by the analytical evaluators
121 * (@c Expectation, @c HybridEvaluator) when structural shortcuts cannot
122 * decide a sub-expression. Returning the raw draws (rather than a
123 * single statistic) lets callers compute any combination of moments
124 * from a single sampling pass.
125 */
126std::vector<double> monteCarloScalarSamples(
127 const GenericCircuit &gc, gate_t root, unsigned samples);
128
129/**
130 * @brief Outcome of a conditional Monte Carlo sampling pass.
131 *
132 * @c accepted holds the @c root values from the iterations where
133 * @c event_root evaluated to @c true (the rest are rejected).
134 * @c attempted is the total number of iterations -- equal to @c samples
135 * unless the pass was interrupted -- so the caller can derive the
136 * empirical acceptance rate as
137 * <tt>accepted.size() / attempted</tt> for diagnostics.
138 */
140 std::vector<double> accepted;
141 unsigned attempted;
142};
143
144/**
145 * @brief Rejection-sample @p root conditioned on @p event_root.
146 *
147 * For each of @p samples iterations, the shared @c Sampler resets its
148 * per-iteration cache, then:
149 * 1. evaluates @p event_root as a Boolean (populating @c bool_cache_
150 * and @c scalar_cache_ for every @c gate_rv / @c gate_input touched);
151 * 2. if the indicator is @c true, evaluates @p root as a scalar
152 * using the SAME caches, so any shared @c gate_t leaf produces
153 * one draw that the indicator and the value both observe;
154 * 3. otherwise rejects the iteration.
155 *
156 * This coupling is the entire point of routing the conditional path
157 * through one joint circuit: a @c gate_rv reachable from both
158 * @p root and @p event_root has the same @c gate_t and therefore
159 * shares its per-iteration draw between the indicator (which decides
160 * acceptance) and the value (which we record). The accepted draws
161 * are samples from the conditional distribution
162 * @f$X \mid A@f$ where @c X = @p root and @c A = @p event_root.
163 *
164 * @param gc Circuit (typically from @c getJointCircuit).
165 * @param root Scalar gate whose value we sample.
166 * @param event_root Boolean gate that the iteration must satisfy.
167 * @param samples Number of iterations to attempt.
168 */
170 const GenericCircuit &gc, gate_t root, gate_t event_root, unsigned samples);
171
172/**
173 * @brief Try to draw @p n exact samples from the conditional
174 * distribution of @p root @b given @p event_root via closed-form
175 * truncation, bypassing MC rejection.
176 *
177 * Fires only when @p root is a bare @c gate_rv whose family admits a
178 * closed-form truncation (@c Uniform / @c Exponential / @c Normal)
179 * and @c collectRvConstraints can extract a sound interval from
180 * @p event_root. Other shapes (arith composites, mixtures, Erlang,
181 * un-extractable events) return @c std::nullopt so the caller can fall
182 * back to @c monteCarloConditionalScalarSamples.
183 *
184 * Sampling kernels:
185 * - <b>Uniform(a, b)</b>: @c collectRvConstraints already intersects
186 * with @c [a, b], so the draw is a plain @c U(lo, hi) on the
187 * intersected interval. 100% acceptance.
188 * - <b>Exponential(λ)</b>, one-sided @c X > c: memorylessness yields
189 * @c c + Exp(λ). Two-sided @c lo < X < hi: inverse-CDF via
190 * @c std::log1p / @c std::expm1 for numerical accuracy near the
191 * support boundary.
192 * - <b>Normal(μ, σ)</b>: inverse-CDF transform. Forward CDF uses
193 * @c std::erf (matching @c AnalyticEvaluator::cdfAt); inverse uses
194 * the Beasley-Springer-Moro rational approximation (~1e-7 accuracy,
195 * ample for sampling).
196 *
197 * Empty / degenerate truncations (@c lo >= @c hi after intersection)
198 * also return @c std::nullopt so the caller's MC fallback can emit
199 * its usual "accepted 0" diagnostic.
200 *
201 * The RNG is seeded from @c provsql.monte_carlo_seed identically to
202 * @c monteCarloScalarSamples, so a pinned seed gives reproducible
203 * output on either path.
204 */
205std::optional<std::vector<double>>
207 gate_t event_root, unsigned n);
208
209} // namespace provsql
210
211#endif // PROVSQL_MONTE_CARLO_SAMPLER_H
gate_t
Strongly-typed gate identifier.
Definition Circuit.h:49
Semiring-agnostic in-memory provenance circuit.
In-memory provenance circuit with semiring-generic evaluation.
std::vector< double > monteCarloJointDistribution(const GenericCircuit &gc, const std::vector< gate_t > &cmps, unsigned samples)
Estimate the joint distribution of cmps via Monte Carlo.
ConditionalScalarSamples monteCarloConditionalScalarSamples(const GenericCircuit &gc, gate_t root, gate_t event_root, unsigned samples)
Rejection-sample root conditioned on event_root.
std::vector< double > monteCarloScalarSamples(const GenericCircuit &gc, gate_t root, unsigned samples)
Sample a scalar sub-circuit samples times and return the draws.
double monteCarloRV(const GenericCircuit &gc, gate_t root, unsigned samples)
Run Monte Carlo on a circuit that may contain gate_rv leaves.
std::optional< std::vector< double > > try_truncated_closed_form_sample(const GenericCircuit &gc, gate_t root, gate_t event_root, unsigned n)
Try to draw n exact samples from the conditional distribution of root given event_root via closed-for...
bool circuitHasRV(const GenericCircuit &gc, gate_t root)
Walk the circuit reachable from root looking for any gate_rv.
Core types, constants, and utilities shared across ProvSQL.
Outcome of a conditional Monte Carlo sampling pass.