Most CRE intelligence tools either resell the same proprietary feeds everyone else has, or scrape a single dataset (permits, demographics, traffic counts) and call it intelligence. Locus is built on a different premise: that hundreds of public-record streams, normalized and joined at the block level, contain a richer signal than any private dataset can match — if you can do the spatial joins, dedupe the entities, and resolve the messy strings.
This is the architecture post. If you've ever opened the Explorer and wondered what is actually under the score, this is the answer.
The data backbone
Locus ingests from 80+ public sources across federal, state, and municipal systems. The volume is meaningful but the breadth is what matters — the goal is triangulation across independent signal sources, not depth on any single one.
Every record lands in Supabase Postgres with PostGIS for geometry and the APRS envelope from Axiom Codex (record_id URN, source_uri, schema_version, acl_tier, occurred_at). The envelope is what lets us tell — three pipelines downstream — that a permit, an FDA letter, and a Section 108 grant all reference the same address.
Why H3 hexagons (and not census tracts)
Tracts are political artifacts: variable in area, drawn for population counts, and frozen between decennial updates. They're terrible for spatial analytics. We score on H3 hexagons — Uber's hierarchical hex index — at resolution 8 (≈460k m², roughly a city block) for primary analysis, and resolution 6 for metro summaries.
Hexagons solve three problems tracts can't: (1) uniform area, so densities are comparable across cities; (2) clean hierarchical aggregation, so a metro view is just an h3ToParent() away; (3) consistent neighbor relationships, which matter when you're running density clustering.
Every cell-level table in the database carries an h3_index TEXT column computed app-side with h3-js v4. The h3 Postgres extension isn't available on Supabase, so we precompute and index instead — fast lookups, no PL/pgSQL dependency.
Permit clustering: HDBSCAN on 929K records
Building permits are the rawest signal Locus has. Every cosmetic remodel, restaurant build-out, and ground-up tower files one. The challenge is that permits as point data are noise. The intelligence is in spatial-temporal clusters: where are permits arriving in dense bursts that don't fit the city's existing pattern?
We run HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) on the permit corpus, parameterized per metro to account for very different baseline densities. HDBSCAN was chosen over k-means and DBSCAN for two reasons: it doesn't require pre-specifying cluster count, and it handles wildly variable cluster densities — which is exactly what real urban geography looks like.
| Metro | Permits processed | Clusters detected | Density vs metro mean |
|---|---|---|---|
| NYC | 187,402 | 1,243 | 4.1× |
| LA | 143,818 | 978 | 3.8× |
| Chicago | 92,640 | 712 | 3.5× |
| Houston | 81,294 | 654 | 3.2× |
| Phoenix | 68,117 | 541 | 3.6× |
| Nashville | 24,930 | 287 | 3.9× |
Causal scoring across 8 dimensions
A cluster of permits tells you something is happening. The scoring layer tells you whether what's happening matters and to whom. Every H3 cell gets evaluated across 8 dimensions, each backed by independent signal:
| Dimension | Primary signals | Update cadence |
|---|---|---|
| Demographic momentum | IRS SOI migration, ACS 5-yr, USPS NCOA | Annual + quarterly |
| Commercial activity | Building permits, BLS QCEW, liquor licenses | Weekly |
| Environmental | EPA brownfields, FEMA flood, Sentinel-2 | Monthly |
| Safety & livability | 311 complaints, FBI NIBRS, crash data | Daily–weekly |
| Mobility | LEHD origin-destination, GTFS transit | Quarterly |
| Education | GreatSchools, NCES enrollment | Annual |
| Walkability | OSM density, POI clustering, sidewalks | Monthly |
| Job market | Job postings, salary gap, BLS LAUS | Weekly |
The scores aren't a black-box average. Each dimension produces an explainable signal vector — a list of the specific records (with source_uri back to the original) that moved the needle on that cell this quarter. When the Explorer shows you that a hex's commercial score jumped 18 points, you can click through to the 47 permits and 9 liquor licenses that caused it.
Self-healing pipelines
Public records APIs break constantly. Schemas drift. Endpoints rate-limit without warning. NYC OpenData changes a column name. Phoenix moves to a new ArcGIS feature server. The scout agent — a 57-source autonomous collector running on Railway — is built around this reality.
Every loader carries: (1) a schema fingerprint computed at last successful run; (2) row-count tolerance bands; (3) a quarantine path for records that fail validation; (4) an LLM-mediated triage that classifies failures as transient (retry), structural (alert + auto-PR), or definitional (escalate to a human). The result is that source breakage rarely propagates to scoring — we get a Resend digest on Monday with what's drifted and a draft fix already in review.
The unified intelligence timeline
The data backbone's most valuable output isn't any single score — it's axiom_events: a 517K-row unified timeline where every permit, FDA letter, vessel arrival, sanctions designation, and infrastructure grant lives in one temporally-ordered table with consistent geometry. Locus and Overwatch (our maritime product) both read from it. Codex normalizes it. Drift will run causal inference on it.
If you've ever wondered what 'platform leverage' actually means: it's that adding USPTO patent grants to Locus's facility intelligence also made Overwatch's port-of-arrival enrichment more accurate. Same backbone, two products, one event stream.