Most CRE intelligence tools either resell the same proprietary feeds everyone else has, or scrape a single dataset (permits, demographics, traffic counts) and call it intelligence. Locus is built on a different premise: that hundreds of public-record streams, normalized and joined at the block level, contain a richer signal than any private dataset can match — if you can do the spatial joins, dedupe the entities, and resolve the messy strings.

This is the architecture post. If you've ever opened the Explorer and wondered what is actually under the score, this is the answer.

The data backbone

Locus ingests from 80+ public sources across federal, state, and municipal systems. The volume is meaningful but the breadth is what matters — the goal is triangulation across independent signal sources, not depth on any single one.

Records by domain (selected sources)

Building permits

929K

POI / business venues

612K

Commuter flows (LEHD)

454K

EPA cleanups + grants

287K

311 complaints (daily)

198K

Liquor licenses

72K

Job postings

61K

FDA / NIH / CT.gov

10K

Every record lands in Supabase Postgres with PostGIS for geometry and the APRS envelope from Axiom Codex (record_id URN, source_uri, schema_version, acl_tier, occurred_at). The envelope is what lets us tell — three pipelines downstream — that a permit, an FDA letter, and a Section 108 grant all reference the same address.

Why H3 hexagons (and not census tracts)

Tracts are political artifacts: variable in area, drawn for population counts, and frozen between decennial updates. They're terrible for spatial analytics. We score on H3 hexagons — Uber's hierarchical hex index — at resolution 8 (≈460k m², roughly a city block) for primary analysis, and resolution 6 for metro summaries.

Hexagons solve three problems tracts can't: (1) uniform area, so densities are comparable across cities; (2) clean hierarchical aggregation, so a metro view is just an h3ToParent() away; (3) consistent neighbor relationships, which matter when you're running density clustering.

Every cell-level table in the database carries an h3_index TEXT column computed app-side with h3-js v4. The h3 Postgres extension isn't available on Supabase, so we precompute and index instead — fast lookups, no PL/pgSQL dependency.

Permit clustering: HDBSCAN on 929K records

Building permits are the rawest signal Locus has. Every cosmetic remodel, restaurant build-out, and ground-up tower files one. The challenge is that permits as point data are noise. The intelligence is in spatial-temporal clusters: where are permits arriving in dense bursts that don't fit the city's existing pattern?

We run HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) on the permit corpus, parameterized per metro to account for very different baseline densities. HDBSCAN was chosen over k-means and DBSCAN for two reasons: it doesn't require pre-specifying cluster count, and it handles wildly variable cluster densities — which is exactly what real urban geography looks like.

Metro	Permits processed	Clusters detected	Density vs metro mean
NYC	187,402	1,243	4.1×
LA	143,818	978	3.8×
Chicago	92,640	712	3.5×
Houston	81,294	654	3.2×
Phoenix	68,117	541	3.6×
Nashville	24,930	287	3.9×

Causal scoring across 8 dimensions

A cluster of permits tells you something is happening. The scoring layer tells you whether what's happening matters and to whom. Every H3 cell gets evaluated across 8 dimensions, each backed by independent signal:

Dimension	Primary signals	Update cadence
Demographic momentum	IRS SOI migration, ACS 5-yr, USPS NCOA	Annual + quarterly
Commercial activity	Building permits, BLS QCEW, liquor licenses	Weekly
Environmental	EPA brownfields, FEMA flood, Sentinel-2	Monthly
Safety & livability	311 complaints, FBI NIBRS, crash data	Daily–weekly
Mobility	LEHD origin-destination, GTFS transit	Quarterly
Education	GreatSchools, NCES enrollment	Annual
Walkability	OSM density, POI clustering, sidewalks	Monthly
Job market	Job postings, salary gap, BLS LAUS	Weekly

The scores aren't a black-box average. Each dimension produces an explainable signal vector — a list of the specific records (with source_uri back to the original) that moved the needle on that cell this quarter. When the Explorer shows you that a hex's commercial score jumped 18 points, you can click through to the 47 permits and 9 liquor licenses that caused it.

Self-healing pipelines

Public records APIs break constantly. Schemas drift. Endpoints rate-limit without warning. NYC OpenData changes a column name. Phoenix moves to a new ArcGIS feature server. The scout agent — a 57-source autonomous collector running on Railway — is built around this reality.

Every loader carries: (1) a schema fingerprint computed at last successful run; (2) row-count tolerance bands; (3) a quarantine path for records that fail validation; (4) an LLM-mediated triage that classifies failures as transient (retry), structural (alert + auto-PR), or definitional (escalate to a human). The result is that source breakage rarely propagates to scoring — we get a Resend digest on Monday with what's drifted and a draft fix already in review.

The unified intelligence timeline

The data backbone's most valuable output isn't any single score — it's axiom_events: a 517K-row unified timeline where every permit, FDA letter, vessel arrival, sanctions designation, and infrastructure grant lives in one temporally-ordered table with consistent geometry. Locus and Overwatch (our maritime product) both read from it. Codex normalizes it. Drift will run causal inference on it.

If you've ever wondered what 'platform leverage' actually means: it's that adding USPTO patent grants to Locus's facility intelligence also made Overwatch's port-of-arrival enrichment more accurate. Same backbone, two products, one event stream.

Open the Explorer →Read the Scoring Conventions