5 min read
The country adapter pattern for civic-tech
- civic-tech
- patterns
Every civic-tech project I have shipped points at one country’s data. That data is published in one country’s format. The format will change next year because a new administration moves to a new platform, or the agency redesigns the page, or the Excel template gets a column added. None of this has anything to do with the analysis the project actually does.
The country adapter pattern is the way I have settled on for separating the per-country plumbing from the core pipeline.
The shape
A core pipeline that knows nothing about the country. It accepts records in a canonical schema, runs whatever analysis the project does, and emits results in a canonical schema. ghostwatch’s core takes records of the form {project_id, location, start, end, declared_status} and emits {project_id, classification, confidence, indices}. paper-trail-ph’s core takes resolved entities and known relationships, runs the analyzers, and emits flag rows. floodwatch’s core takes labeled flood-event polygons and a region polygon, samples the AlphaEarth embedding at labeled points, and emits a calibrated classifier.
A country adapter that knows everything about the country. It pulls from PhilGEPS, PHIVOLCS, DPWH, BetterGovPH, COA, the PSA PXWeb endpoint, whatever the country publishes. It deals with the Excel column that got renamed, the breadcrumb-as-advisory bug, the address that resolves to three different PSGC codes. It outputs records in the canonical schema. That is its whole job.
The boundary between the two is the canonical schema. If a project is doing this right, the core pipeline has zero string literals that mention a country, an agency, a column name, or a URL.
What goes in config, what goes in code
Anything that varies per country, per source, or per release goes in config. Stoplists go in config (paper-trail-ph’s “city”, “barangay” stoplist for the location join). Source URLs go in config. Auth credentials go in config. The proximity threshold for “already mapped” goes in config (solar-map-ph: 200 m, same as DeepSolar and SPECTRUM). The HAND-style flood-plausible-terrain mask threshold goes in config.
Anything that defines the analysis itself goes in code. The five-bin spectral classification in ghostwatch goes in code. The Otsu-plus-change-gate-plus-permanent-water pipeline in floodwatch goes in code. The CLIP-encoder-plus-logistic-head architecture in solar-map-ph goes in code. If the analysis changes, the version number changes; if config changes, the data is refreshed.
The test is: if you want to point the project at a new country, do you change config, or do you change code? If the answer is config, the boundary is in the right place. If the answer is code, the country adapter is leaking into the core pipeline.
Where the adapter pattern breaks
Some sources are too weird to wedge into a clean adapter. PhilGEPS publishes weekly Excel exports with no documented schema; paper-trail-ph’s PhilGEPS collector is a separate, unglamorous module that downloads and diffs, and is probably the most fragile component of the system. PAGASA’s advisory feed had the breadcrumb-as-advisory bug for entire versions of ph-civic-data-mcp; the fix was a strict text classifier specific to the PAGASA HTML, not a generic improvement.
The temptation in both cases is to abstract harder. Build a “generic civic-data collector framework”, parameterize the Excel dialect, write a DSL for parsing government navigation chrome. Resist that. The right call is a per-source script, named for the source, that does the ugly thing the source requires and outputs canonical records. The adapter pattern is for the 80% case where the source actually has a stable shape. The other 20% gets a script.
Concrete handoff
The piece of the country adapter that pays for itself most often is the entity resolver. The same construction firm appears in PhilGEPS as five spelling variants across five contract awards, in COA reports as a sixth, and in the registry under a different parent corporation. paper-trail-ph’s resolver uses Jaro-Winkler with a 0.92 auto-merge cutoff for the obvious cases and a 0.85 to 0.91 manual-review band for the rest. Anything below 0.85 is left alone.
That resolver is in the country adapter, not the core pipeline. The core pipeline takes resolved entities as input and trusts them. When the resolver gets a new alias from a manual-review session, the resolver is updated and the pipeline is re-run. The core pipeline never learns about Jaro-Winkler.
The location resolver in ph-civic-data-mcp is the same shape: PSGC alias handling for “City of Manila” vs “Manila City”, for city-of-X-in-Y addresses like “Sta. Mesa, Manila”, for all 81 provinces. It sits in the country adapter. The core MCP server takes resolved PSGC codes and trusts them.
Why this matters
A civic-tech project is in a slow arms race with the agency publishing the source data. The agency will change the format. The address scheme will be revised. The Excel template will get a new column. The MCP server will get a new authentication wall in front of the WMS endpoint. None of this is a research question, and none of it should require you to revisit the analysis.
If the country adapter pattern is doing its job, the change is one file. If it is not doing its job, the change is the whole pipeline.
Repos that exemplify this in different forms: ghostwatch, paper-trail-ph, ph-civic-data-mcp, floodwatch-ph, solar-map-ph.