progress
WS6 + WS7 drafted. 7 files written to projects/onboarding/eval-scenarios/:
- 2026-04-17-ws6-audit-and-sweep.md (audit memo: 01/14/15/08 REWRITE, 22 KEEP + ambient fix; 16-file sweep with prescribed replacement)
- 2026-04-17-28 through -32 (NS-1 through NS-4 + regression control; scenario IDs 28-32)
- 2026-04-17-ws7-judge-rubric.md (8-label LLM-judge rubric with operational definitions, PASS/FAIL examples, edge cases, aggregate scoring, calibration tests)
Key decisions logged:
- Scenario 15 flagged REWRITE heavy — unsolicited conservative alternative after merchant accept is the exact gatekeeper behavior the fix is removing; Why-It-Works narrative needs full rewrite, not just pass-criteria tweak.
- Scenario 22 KEEP — merchant raised concern first, agent's response is mechanism-grounded (closed channel, conversion-trigger), satisfies new platform_fraud_flagged_only_if_present label.
- Ambient sweep: 16/27 scenarios carry 'margins and buying behavior' opening; prescribed replacement 'goals and your customers'.
- Regression control picks math-contradicts-goal (not platform-infeasible) since it's the cleanest test of WS1 allowable-counter case 3.
Not executed in this run:
- Ambient sweep patch (mechanical replacement across 16 files)
- Per-scenario rewrites for 01/08/14/15
- Baseline pass-rate capture against current agent
These gate on confirmation from Zach that scenario decisions are right before editing existing scenarios — the 15 REWRITE-heavy call in particular could be contested.