ProvSQL C/C++ API
Adding support for provenance and uncertainty management to PostgreSQL databases
Loading...
Searching...
No Matches
classify_query.c File Reference

Query-time TID / BID / OPAQUE classifier. More...

#include "postgres.h"
#include "lib/stringinfo.h"
#include "nodes/bitmapset.h"
#include "nodes/nodes.h"
#include "nodes/parsenodes.h"
#include "nodes/pg_list.h"
#include "optimizer/tlist.h"
#include "utils/builtins.h"
#include "utils/lsyscache.h"
#include "classify_query.h"
#include "provsql_utils.h"
Include dependency graph for classify_query.c:

Go to the source code of this file.

Functions

static const char * kind_label (provsql_table_kind k)
 Map a provsql_table_kind to its uppercase user-facing label.
static bool classify_fromlist_shape_ok (Node *n)
 Decide whether a jointree fromlist entry has a shape the classifier can certify : a plain RangeTblRef, or a JoinExpr with jointype == JOIN_INNER whose larg and rarg recursively satisfy the same predicate.
static void classify_walk (Query *q, ProvSQLClassification *out, bool *shape_ok, int *n_meta, Oid *sole_relid)
 Recursive walker shared by the top-level entry point and the RTE_SUBQUERY descent.
static bool collect_union_all_legs (Node *node, Query *parent, List **legs)
 Walk a SetOperationStmt tree, collecting each leaf leg's Query body into legs.
static bool try_classify_union_all (Query *q, ProvSQLClassification *out)
 Promote a fully-UNION-ALL Query to TID when each leg classifies as TID and the leg source-relid sets are pairwise disjoint.
static bool try_classify_multi_source_tid (ProvSQLClassification *out)
 Conservative multi-source promotion: when every tracked source in out->source_relids is TID and the registered ancestor sets are pairwise disjoint, promote the classification to TID.
static bool resolve_var_to_base (Query *q, Index varno, AttrNumber attno, Oid *out_relid, AttrNumber *out_attno)
 Resolve a base-level (varno, attno) pair in q transitively through RTE_SUBQUERY layers until reaching the underlying RTE_RELATION column.
static bool bid_block_key_preserved (Query *q, Oid source_relid, const ProvenanceTableInfo *info)
 Decide whether every block-key column of info survives in q's target list – resolved transitively through RTE_SUBQUERY descents so the same check works on SELECT k FROM bid_t and on SELECT k FROM (SELECT k FROM bid_t).
static Node * resolve_through_group_rte (Query *q, Node *e)
 PG 18+ helper: when q has a synthetic RTE_GROUP entry (set parseCheckAggregates() appends it for every GROUP BY query), Vars in groupClause's TLE expressions point at the RTE_GROUP rather than the underlying source.
static bool try_classify_groupby_block_key (Query *q, ProvSQLClassification *out)
 Pre-dispatch special case for GROUP BY on a single BID source's block-key columns.
void provsql_classify_query (Query *q, ProvSQLClassification *out)
 Classify the result relation of a parsed top-level Query.
void provsql_classify_emit_notice (const ProvSQLClassification *c)
 Render the result of provsql_classify_query as a NOTICE.

Variables

bool provsql_classify_top_level = false
 Backing storage for the provsql.classify_top_level GUC.

Detailed Description

Query-time TID / BID / OPAQUE classifier.

Invoked by provsql_planner on the top-level Query when the provsql.classify_top_level GUC is on. Emits a NOTICE carrying the certified kind and the set of provenance-tracked base relations the query touches.

Scope :

  • Single-source classification : a flat fromlist of RangeTblRefs, with no kind-altering features (SubLinks, modifying CTEs, cteList, DISTINCT, GROUP BY, HAVING, aggregates, window functions, set-returning functions in the target list). Either zero or one provenance-tracked base relations are reached either directly (RTE_RELATION) or through any depth of subqueries (RTE_SUBQUERY – view bodies after PG rewriting and inline FROM subqueries). The recorded kind of the sole tracked source (TID / BID / OPAQUE) is preserved verbatim ; zero tracked sources is trivially TID.
  • UNION ALL specialisation : a fully-UNION-ALL tree of leg subqueries each of which classifies independently as TID over a base-relid set that is disjoint from every other leg's. The union is then TID with the cumulative source list. Anything that doesn't fit (INTERSECT, EXCEPT, UNION DISTINCT, mixed kinds, overlapping leg sources) falls back to OPAQUE.

Everything else is reported as OPAQUE. Independent-TID join inference, BID block-key preservation under projection and GROUP BY, and the per-relation base-ancestor registry the disjointness check consults all live in this same file ; see the helpers below.

Definition in file classify_query.c.

Function Documentation

◆ bid_block_key_preserved()

bool bid_block_key_preserved ( Query * q,
Oid source_relid,
const ProvenanceTableInfo * info )
static

Decide whether every block-key column of info survives in q's target list – resolved transitively through RTE_SUBQUERY descents so the same check works on SELECT k FROM bid_t and on SELECT k FROM (SELECT k FROM bid_t).

Renamed projections (SELECT k AS b ...) still count as preserving – the match is on the underlying Var, not the output column's name. resjunk entries in the outer targetList are ignored.

Definition at line 456 of file classify_query.c.

Here is the call graph for this function:
Here is the caller graph for this function:

◆ classify_fromlist_shape_ok()

bool classify_fromlist_shape_ok ( Node * n)
static

Decide whether a jointree fromlist entry has a shape the classifier can certify : a plain RangeTblRef, or a JoinExpr with jointype == JOIN_INNER whose larg and rarg recursively satisfy the same predicate.

ANSI-syntax inner joins (INNER JOIN, CROSS JOIN; PG normalises both to JOIN_INNER) preserve TID per-row independence because every output row corresponds to exactly one pair of source rows – the row's provenance is the AND of the two tokens. Outer joins (LEFT / RIGHT / FULL) introduce NULL-padding rows whose provenance is the negation of the inner-match disjunction, so they break the per-row TID property and stay OPAQUE. Semi / anti joins are also rejected (the planner uses JOIN_SEMI / JOIN_ANTI for sublink-driven joins, which our hasSubLinks shape gate has already filtered out upstream, but the explicit check here keeps the predicate self-contained).

Definition at line 86 of file classify_query.c.

Here is the call graph for this function:
Here is the caller graph for this function:

◆ classify_walk()

void classify_walk ( Query * q,
ProvSQLClassification * out,
bool * shape_ok,
int * n_meta,
Oid * sole_relid )
static

Recursive walker shared by the top-level entry point and the RTE_SUBQUERY descent.

The shape gate, source enumeration, and recursion live here so the outer entry point can decide TID / BID / OPAQUE from the cumulative n_meta / sole_relid pair after the whole tree has been walked. Tracked RTE_RELATION entries reachable through any depth of RTE_SUBQUERY (view bodies after PG rewriting, inline FROM-clause subqueries) contribute to the accumulator.

The recursion is stack-bounded by the SQL parser's own nesting limit ; no explicit depth cap is needed at this layer.

Definition at line 115 of file classify_query.c.

Here is the call graph for this function:
Here is the caller graph for this function:

◆ collect_union_all_legs()

bool collect_union_all_legs ( Node * node,
Query * parent,
List ** legs )
static

Walk a SetOperationStmt tree, collecting each leaf leg's Query body into legs.

The tree is a binary structure : interior SetOperationStmt nodes carry the operator and an all flag, leaf RangeTblRefs point at RTE_SUBQUERY entries in parent->rtable whose subquery field is the leg's parsed Query. Returns false if any interior node is not a UNION ALL or any leaf is not a subquery RTE, so that the dispatcher can fall back to OPAQUE (only fully-UNION-ALL trees qualify for the TID promotion).

Definition at line 217 of file classify_query.c.

Here is the call graph for this function:
Here is the caller graph for this function:

◆ kind_label()

const char * kind_label ( provsql_table_kind k)
static

Map a provsql_table_kind to its uppercase user-facing label.

Definition at line 59 of file classify_query.c.

Here is the caller graph for this function:

◆ provsql_classify_emit_notice()

void provsql_classify_emit_notice ( const ProvSQLClassification * c)

Render the result of provsql_classify_query as a NOTICE.

Formats :

*   ProvSQL: query result is TID (sources: schema.t1, schema.t2)
*   ProvSQL: query result is BID (sources: schema.t1)
*   ProvSQL: query result is TID (no provenance-tracked sources)
*   ProvSQL: query result is OPAQUE
* 

The OPAQUE form omits the parenthetical because the rtable walk only reaches syntactically visible sources : when the shape gate trips on a sublink, set operation, GROUP BY, ... the list would be partial and falsely suggest completeness.

Definition at line 713 of file classify_query.c.

Here is the call graph for this function:
Here is the caller graph for this function:

◆ provsql_classify_query()

void provsql_classify_query ( Query * q,
ProvSQLClassification * out )

Classify the result relation of a parsed top-level Query.

Scope :

  • Single-source classification : a flat fromlist of RangeTblRefs, with no kind-altering features (SubLinks, modifying CTEs, cteList, DISTINCT, GROUP BY, HAVING, aggregates, window functions, set-returning functions in the target list). Zero or one provenance-tracked base relation reached either directly (RTE_RELATION) or through any depth of RTE_SUBQUERY entries (view bodies after PG rewriting, inline FROM subqueries). The PG 18 virtual RTE_GROUP is skipped transparently. When a single tracked base relation is reached, the source's recorded kind is preserved verbatim. ORDER BY, LIMIT, OFFSET do not change row lineages and are therefore transparent.
  • UNION ALL specialisation : a fully-UNION-ALL tree of subquery legs each of which classifies as TID over a base- relid set that is disjoint from every other leg's promotes to TID with the cumulative source list.
  • Zero tracked sources : trivially deterministic, reported as TID with an empty source list.
  • Everything else is reported as OPAQUE.
  • Multi-source TID promotion : n_meta >= 2 promotes to TID when every classifier-reported source is TID and the registered ancestor sets are pairwise disjoint.
  • BID projection preservation : the single-source BID branch walks the outer target list (transitively through any depth of RTE_SUBQUERY TLE projection) and downgrades to OPAQUE when any block-key column is dropped.
  • BID GROUP BY block-key promotion : a pre-dispatch special case promotes SELECT k FROM bid_t GROUP BY k to TID (each output row is one block).
  • ANSI INNER / CROSS JOIN : the shape gate accepts JoinExpr fromlist entries when jointype == JOIN_INNER, recursing into both arms.
Parameters
qParsed Query to classify. Read-only; not mutated.
outOutput struct. source_relids is built in the current memory context.

Definition at line 638 of file classify_query.c.

Here is the call graph for this function:
Here is the caller graph for this function:

◆ resolve_through_group_rte()

Node * resolve_through_group_rte ( Query * q,
Node * e )
static

PG 18+ helper: when q has a synthetic RTE_GROUP entry (set parseCheckAggregates() appends it for every GROUP BY query), Vars in groupClause's TLE expressions point at the RTE_GROUP rather than the underlying source.

Resolve them through the RTE_GROUP's groupexprs list so the BID-block-key check below sees the source Var. No-op on PG < 18 and on queries without hasGroupRTE.

Definition at line 501 of file classify_query.c.

Here is the caller graph for this function:

◆ resolve_var_to_base()

bool resolve_var_to_base ( Query * q,
Index varno,
AttrNumber attno,
Oid * out_relid,
AttrNumber * out_attno )
static

Resolve a base-level (varno, attno) pair in q transitively through RTE_SUBQUERY layers until reaching the underlying RTE_RELATION column.

Returns true on success and writes the base relid / base attno to out_relid / out_attno. Returns false when any intermediate TLE is not a plain Var (possibly through RelabelType wrappers), when an outer-scope reference is hit, or when the chain ends on a non-relation rtekind.

Definition at line 401 of file classify_query.c.

Here is the call graph for this function:
Here is the caller graph for this function:

◆ try_classify_groupby_block_key()

bool try_classify_groupby_block_key ( Query * q,
ProvSQLClassification * out )
static

Pre-dispatch special case for GROUP BY on a single BID source's block-key columns.

The generic shape gate rejects groupClause != NIL up front, but SELECT k FROM bid_t GROUP BY k (and the multi-column-key generalisation) has a well-defined per-row provenance : each output row's block_key value uniquely identifies one BID block, and the OR over that block's mulinput slots reduces to the block's key token (an independent gate_input). So the output is per-row independent – TID – with the cumulative source list narrowed to the single BID source.

Conservative: requires no aggregates / window functions / sublinks / SRFs / CTEs / set operations / HAVING / DISTINCT / sortClause-with-side-effects, no LIMIT / OFFSET, a flat fromlist of exactly one RangeTblRef pointing at a BID RTE_RELATION, and a groupClause whose resolved Vars match the source's block-key set exactly (no extra columns, no missing ones). When all met, returns true with out populated. Any failure leaves out untouched ; the caller proceeds to the generic dispatcher path.

Definition at line 563 of file classify_query.c.

Here is the call graph for this function:
Here is the caller graph for this function:

◆ try_classify_multi_source_tid()

bool try_classify_multi_source_tid ( ProvSQLClassification * out)
static

Conservative multi-source promotion: when every tracked source in out->source_relids is TID and the registered ancestor sets are pairwise disjoint, promote the classification to TID.

Mirrors the disjointness check the safe-query rewriter runs in is_safe_query_candidate, just at the classifier layer so a multi-source query no longer collapses to OPAQUE before the rewriter even sees it. The hierarchical-CQ structure is NOT inspected here – the rewriter runs the full check downstream, and the classifier's job is only to certify the per-row independence the user-visible NOTICE pill advertises.

Returns true and sets out->kind on success. On failure (any source non-TID, no registry entry, or any pair of ancestor sets overlapping), returns false and leaves out unchanged so the caller falls through to OPAQUE.

Definition at line 339 of file classify_query.c.

Here is the call graph for this function:
Here is the caller graph for this function:

◆ try_classify_union_all()

bool try_classify_union_all ( Query * q,
ProvSQLClassification * out )
static

Promote a fully-UNION-ALL Query to TID when each leg classifies as TID and the leg source-relid sets are pairwise disjoint.

Returns true and populates out with TID + the cumulative source list on success ; returns false to let the dispatcher fall through to the conservative OPAQUE path on any failure (non-UNION-ALL operator, a leg that classifies as BID/OPAQUE, or overlapping leg sources – the gate-level atoms of a relid that appears in two legs are not disjoint, so the multiset union no longer satisfies the TID property).

Pairwise disjointness is checked at the relid level only. A future correlation registry will refine this by also rejecting legs whose base ancestors overlap (e.g. two views sharing the same underlying TID table), which the syntactic check cannot detect.

BID legs are deliberately not promoted, even when every leg is BID with the same block-key column projected at the same target- list position. The UNION ALL output's "rows with the same block-key value" set spans both legs, but a row from leg A and a row from leg B sharing a block-key value are independent (the legs are different relations), not mutually exclusive. Calling the result BID under that column would falsely advertise an invariant the rows don't satisfy. Recovering BID-ness would require either certifying disjoint block-key values between legs (not knowable from the query text) or emitting a synthetic composite block key (leg_id, k) in the output and recording it in provsql_table_info ; both paths are documented as conservative-by-design in the safe-query follow-ups.

Definition at line 275 of file classify_query.c.

Here is the call graph for this function:
Here is the caller graph for this function:

Variable Documentation

◆ provsql_classify_top_level

bool provsql_classify_top_level = false

Backing storage for the provsql.classify_top_level GUC.

GUC: emit a classification NOTICE on every top-level SELECT.

Definition at line 56 of file classify_query.c.