td-e15271

Phase 1: Build aggregation-map report (read-only, no writes)

in_progress task P2 Parent: td-35a62e
Created Apr 16, 2026 7:49 PM Updated Apr 16, 2026 9:45 PM
Description
Produce an offline report that clusters existing MKB profiles and Intercom contacts into proposed merchants. No writes to profiles yet — just the mapping table for review. Steps: 1. canonicalize_store_url(raw) helper — strip scheme, www, trailing slash, path. Return bare host or '' on failure. 2. Walk merchants/profiles/*.json. For each profile, pull the Intercom contact's custom_attributes.store_url (via the existing intercom_metadata.external_id lookup). Canonicalize. 3. Group profiles by canonical store_url. Output a CSV/JSON: {store_url_host: [profile_ids, primary_email, org_name, count_interactions]}. 4. Flag consumer-domain emails (gmail/yahoo/etc.) — these stay unaggregated. 5. Flag RC/Anafore/test-account contacts — these get excluded entirely. Output: aggregation-proposal.jsonl in merchants/ for human review before Phase 2 writes anything. Estimated: 1-2 hrs.
Handoff
Done
Session Log (3 entries)
Apr 16, 9:20 PM
lg-e0b9505f ses_aa8a2c
progress
Started work
Apr 16, 9:20 PM
lg-4e41b9f1 ses_aa8a2c
progress
Starting Phase 1: aggregation-map report (read-only). Builds merchants/aggregation-proposal.jsonl clustering existing profiles by canonical store_url. Output goes to human review before Phase 2 writes anything.
Apr 16, 9:44 PM
lg-07f7347a ses_aa8a2c
progress
Codex review P1+P2+P3 all addressed: (P1) proposal gitignored — rebuild from cache locally. (P2a) gaps logic now checks Redshift.url usability not just presence. (P2b) confidence downgrade fixed (multi-source-disagreement name). (P3) malformed profiles skipped with warning not abort. Final output: 2099 clusters, 6 merge candidates, 295 off-domain-founder-inferred singletons, 46 multi-source-disagreement cases, 359 unresolved (246 no-contact + 113 no-store-signal), 2 RC/test excluded. Summary artifact at projects/mkb-scope-expansion/2026-04-16-phase1-summary.md.
Git State
Started d4b227a (master) Current b7b8f8e (master)
Sessions Involved