Case Study: Île-de-France Public Transit

This case study, extending the scenario introduced in [Senellart et al., 2018], applies ProvSQL to the real-world GTFS dataset for Île-de-France public transit, demonstrating Boolean provenance at scale for wheelchair accessibility reasoning.

The Scenario

The STIF GTFS dataset describes hundreds of transit routes and tens of thousands of stops. You want to answer: starting from Bagneux station (served by RER B and several bus lines), which stops and routes are reachable, and is the entire journey from Bagneux to each destination fully wheelchair-accessible?

Boolean provenance answers the second question: a result token evaluates to true if and only if every record along the path has the wheelchair flag set.

Warning

This case study requires external data files. The dataset is not bundled with ProvSQL due to its size (the compressed download is several hundred megabytes). Download instructions are in Setup below.

Setup

Download the GTFS data. Obtain the Île-de-France GTFS archive from data.gouv.fr (direct download: IDFM-gtfs.zip). Extract the archive; you will need the four text files routes.txt, stops.txt, trips.txt, and stop_times.txt.

Load the schema and data. Download setup.sql and run it from the directory containing the four GTFS files:

cd /path/to/gtfs-files
psql -d mydb -f /path/to/setup.sql

This creates four tables:

  • routes – transit lines (RER A, B, M1, bus 91, …)

  • stops – individual stop points with GPS coordinates and a wheelchair_boarding flag

  • trips – individual scheduled journeys, each with a wheelchair_accessible flag

  • stop_times – arrival and departure times at each stop for each trip

The script also adds provenance tracking and creates a combined wheelchair mapping table from both the trip and stop wheelchair columns.

Note

The setup script already creates the most important indexes (on stop_id, trip_id, and parent station), which are essential for acceptable performance on this large dataset.

Step 1: Explore the Database

At the start of every session, set the search path so that ProvSQL functions can be called without the provsql. prefix:

SET search_path TO public, provsql;

Inspect the four tables:

SELECT COUNT(*) FROM routes;
SELECT COUNT(*) FROM stops;
SELECT COUNT(*) FROM trips;
SELECT COUNT(*) FROM stop_times;

To find the stop IDs for Bagneux station and its platforms:

SELECT * FROM stops WHERE stop_name = 'Bagneux';

Step 2: Provenance and Wheelchair Mapping

Provenance has already been added by setup.sql. The wheelchair mapping table combines wheelchair_accessible from trips and wheelchair_boarding from stops. A result token evaluates to true (1) under sr_boolean if every contributing row has its wheelchair column set to 1.

Inspect the mapping:

SELECT * FROM wheelchair LIMIT 10;

Step 3: Reachable Stops from Bagneux

Find all stops reachable from Bagneux on the same trip and later in the sequence – in other words, stops you can reach by boarding a vehicle at Bagneux without changing:

SELECT DISTINCT s2.stop_name, r2.route_long_name
FROM stops s0
JOIN stops      s1 ON s1.parent_station = s0.stop_id
JOIN stop_times t1 ON s1.stop_id = t1.stop_id
JOIN stop_times t2 ON t1.trip_id = t2.trip_id
                  AND t1.stop_sequence < t2.stop_sequence
JOIN stops      s2 ON s2.stop_id = t2.stop_id
JOIN trips      u2 ON u2.trip_id = t2.trip_id
JOIN routes     r2 ON r2.route_id = u2.route_id
WHERE s0.stop_name = 'Bagneux'
ORDER BY r2.route_long_name, s2.stop_name;

This returns several dozen distinct (stop, route) pairs covering the reachable network (the exact number depends on the GTFS dataset version).

Step 4: Boolean Provenance – Full Wheelchair Accessibility

Add Boolean provenance evaluation to mark which results are fully wheelchair-accessible along every leg. Because the query returns one row per trip (each with its own provenance circuit), materialize the result first and then aggregate per destination:

CREATE TEMP TABLE bagneux_b AS
  SELECT s2.stop_name,
         r2.route_long_name,
         sr_boolean(provenance(), 'wheelchair') AS accessible
  FROM stops s0
  JOIN stops      s1 ON s1.parent_station = s0.stop_id
  JOIN stop_times t1 ON s1.stop_id = t1.stop_id
  JOIN stop_times t2 ON t1.trip_id = t2.trip_id
                    AND t1.stop_sequence < t2.stop_sequence
  JOIN stops      s2 ON s2.stop_id = t2.stop_id
  JOIN trips      u2 ON u2.trip_id = t2.trip_id
  JOIN routes     r2 ON r2.route_id = u2.route_id
  WHERE s0.stop_name = 'Bagneux';

SELECT stop_name, route_long_name, bool_or(accessible) AS accessible
FROM bagneux_b
GROUP BY stop_name, route_long_name
ORDER BY route_long_name, stop_name;

sr_boolean evaluates the provenance token under the Boolean semiring, looking up each leaf token in the wheelchair table. A result of true means every record along some path from Bagneux to that stop has the wheelchair flag set; false means no fully accessible path exists.

Step 5: Inspect Individual Results with sr_formula

For a stop that is not fully accessible, use sr_formula to identify which specific trip or stop is responsible. Here we inspect the Paul Bert stop on route 391 as an example:

SELECT s2.stop_name,
       sr_formula(provenance(), 'wheelchair') AS formula
FROM stops s0
JOIN stops      s1 ON s1.parent_station = s0.stop_id
JOIN stop_times t1 ON s1.stop_id = t1.stop_id
JOIN stop_times t2 ON t1.trip_id = t2.trip_id
                  AND t1.stop_sequence < t2.stop_sequence
JOIN stops      s2 ON s2.stop_id = t2.stop_id
JOIN trips      u2 ON u2.trip_id = t2.trip_id
JOIN routes     r2 ON r2.route_id = u2.route_id
WHERE s0.stop_name = 'Bagneux'
  AND r2.route_long_name = '391'
  AND s2.stop_name = 'Paul Bert'
LIMIT 1;

The formula shows which token carries a 0 wheelchair value, pinpointing the accessibility barrier. For example:

stop_name | formula
----------+-----------------
Paul Bert | 1 ⊗ 1 ⊗ 0 ⊗ 1

The four factors correspond to the four provenance-enabled table instances in the join: the Bagneux station record (stops, 1), its platform record (stops, 1), the Paul Bert stop record (stops, 0), and the trip record (trips, 1). The 0 on the third factor pinpoints the specific Paul Bert stop served by route 391 as the accessibility barrier. Note that there are several stops named Paul Bert in the dataset; the one served by route 391 has wheelchair_boarding = 0, as we can verify:

SELECT DISTINCT s2.stop_name, s2.wheelchair_boarding
FROM stops s0
JOIN stops      s1 ON s1.parent_station = s0.stop_id
JOIN stop_times t1 ON s1.stop_id = t1.stop_id
JOIN stop_times t2 ON t1.trip_id = t2.trip_id
                  AND t1.stop_sequence < t2.stop_sequence
JOIN stops      s2 ON s2.stop_id = t2.stop_id
JOIN trips      u2 ON u2.trip_id = t2.trip_id
JOIN routes     r2 ON r2.route_id = u2.route_id
WHERE s0.stop_name = 'Bagneux'
  AND r2.route_long_name = '391'
  AND s2.stop_name = 'Paul Bert';