Manifest
Every record accounted for on the way into one identity layer.
A PostgreSQL-staged pipeline that ingests Airtable client records, deduplicates them across four signal paths, and resolves them into a single master-contact identity layer.
- language
- Python 3.11, PostgreSQL (JSONB, generated columns, views)
- pipeline
- JSONB ingest, 10 typed SQL views, linking job
- dedup
- 4 signal paths (primary email, billing email, proposal ID, external IDs)
- validated
- 792 records, 27 fields field-by-field
- repo
- Private (shares a repo with Throughline)
The problem
A PR company kept its operational CRM in two Airtable tables full of cross-system IDs, disconnected from its analytical database, with inconsistent emails and duplicate records across tables. Manifest unified them into one contact identity layer with verified cross-links.
How it is built
-
Staged JSONB ingest with typed views
Raw Airtable records land as JSONB, then ten SQL views extract and type the fields server-side. Ingest stays schema-agnostic while downstream consumers work with typed columns. A validation run confirmed all 27 documented field extractions were correct against the Airtable schema.
-
Four-signal deduplication into a master contact
A candidate view unions four independent match paths, so a record links if any one resolves: normalized primary email, billing email, PandaDoc proposal ID, or external system IDs. Matches set role flags (billing contact, contract signer) with priority scores, and unmatched rows surface as candidates instead of being dropped.
-
Dry-run before it writes
The linking job has a --dry-run mode that reports every candidate and proposed mutation without touching master_contacts, so an operator can see exactly what a new Airtable sync would change before committing it.
By the numbers
Where this honestly stands
Shipped: the Airtable staging and identity-linking path. A HubSpot destination was scoped but never built, so this is framed as CRM identity resolution, not a HubSpot migration.
Want the parts that are not in a public repo? I will walk you through the architecture and the decisions on a call.