Skip to content

3 min read

paper-trail-ph: cross-referencing four Philippine government databases

  • civic-tech
  • graph
  • case-study

Philippine government procurement, audit, and election data are public. Each agency publishes in its own format and system. PhilGEPS publishes Excel files of bid awards. The Commission on Audit publishes PDFs of audit reports. Open Congress publishes a REST API of legislative records. The Philippine Statistics Authority publishes geographic codes as CSV. None of them know about each other.

That separation is, in a sense, the point. Each agency owns its dataset. It is also what makes patterns invisible. A company winning a contract, the company’s owner sharing a surname with the local mayor, and a COA flag for concentration in that same procurement office are three records in three separate databases. Nobody sees the connection unless they manually cross-reference everything.

paper-trail-ph automates the cross-referencing. It collects from PhilGEPS, Open Congress, COA, and PSA, resolves entities across the four sources, derives implicit relationships that aren’t in any single dataset, loads everything into a Neo4j graph, and runs analyzers over the result.

Where the work actually is

The hard part is not the graph database. The hard part is entity resolution. The same construction firm appears in PhilGEPS as five spelling variants across five contract awards, in COA reports as a sixth, and in the registry under a different parent corporation. Jaro-Winkler fuzzy matching with a 0.92 auto-merge cutoff catches the obvious cases. A 0.85 to 0.91 manual-review band catches the rest. Below 0.85 is left alone.

Once entities are resolved, the relationships you can derive go beyond what any source explicitly publishes. Co-bidding patterns, where companies always bid together against the same agency. Below-threshold contract clustering, where a single procurement budget gets split into ten awards just under the open-bidding ceiling. Surname matching between contractor owners and politicians representing the same district. Phoenix companies, where the same owner reappears under a new corporate shell with a fresh contract. Circular subcontracting, where A subcontracts to B, B to C, and C back to A.

What the analyzers report

The repo ships thirteen automated red-flag detectors and a concentration pass that computes Herfindahl-Hirschman across each procurement office. Concentration alone catches a class of regulatory capture that’s easy to miss in any single audit cycle. An agency where one contractor wins 78% of awards by value isn’t necessarily corrupt, but it’s worth a human looking at.

A red flag is not a finding. The output is a list of cases that warrant a person walking through them with the underlying records. The graph narrows the search space. It doesn’t replace the judgment.

Stack

Neo4j Community for the graph. FastAPI for the API. Next.js with Sigma.js for the WebGL graph explorer. GraphRAG over Neo4j’s native HNSW vector index for natural-language queries against the graph. The pipeline is plain Python: requests for collection, pandas for transform, the async Neo4j driver for load.

PhilGEPS has no API, so the collector downloads and diffs weekly Excel exports. That part is unglamorous and probably the most fragile component of the system. Most of the cleanup work pays for itself only because the pipeline runs every week.

Repo: github.com/xmpuspus/paper-trail-ph.