Scope discussion

Three asks on the table: Data Retention, Edits, Downloadables. Below — what each one means, current state, options on the menu, and what I'd recommend as P0 / P1 / P2 today. Push back wherever you disagree.

For: Landis Current From: Sid Dani Meeting: Wed 2026-05-20 · 10:15 PT · 45 min Companion doc: 01-operating-model.html

Read me first. Each ask below has a "My take" callout with a tentative priority. These are anchors for the conversation, not commitments — the goal of the meeting is for you to walk out with a P0/P1/P2 ranking that matches your roadmap as the dashboard's owner. The scoring table at the bottom is meant to be filled in live.

Ask 1

Data Retention

How long do we keep raw + processed campaign data, and where? Mostly already in place; small policy decisions remain.

Ask 2

Edits

Curation control — the ability to override what shows up in the dashboard. Largest architectural impact. Needs definition first.

Ask 3

Downloadables

Export views to Excel / CSV / PDF so MSCI can pull data into their own tools. Moderate work, well-scoped, achievable.

Ask 1

Data Retention

My take: P2 · already mostly solved

What we keep, where we keep it, and for how long. The good news: most of this is already wired and the policies are sensible. Decisions remain on (a) raw archive duration and (b) whether to set a BigQuery expiry policy.

Current state

Layer	Retention today	Why
Raw API JSON (GCS)	30 days, auto-deleted	Audit trail for debugging recent runs; quickly outgrows usefulness
`fact_campaigns_raw` (BQ)	Indefinite, day-partitioned	Lossless record of everything we pulled, ever
`fact_campaigns` (BQ)	Indefinite, day-partitioned	Time-series of clean published campaign metrics
`audit_log` (BQ)	Indefinite	One row per run forever — small, cheap, important for forensics
HTML dashboard	Last ~10 deploys	Cloudflare default; rollback available

Options on the menu

Option	What changes	Effort	Cost impact
1a. Keep as-is	Nothing. Current policy is fine for now.	0	baseline
1b. Extend GCS to 90d	3× more raw JSON kept; lets you re-investigate older anomalies without re-running pipelines.	5 min config	~$0.50/mo extra at current volume
1c. Add BQ partition expiry (e.g., 2 years)	Old daily snapshots auto-expire; query cost shrinks, but historical comparisons get harder.	5 min config	slightly lower BQ storage cost
1d. Snapshot BQ to GCS monthly for cold storage	Cheap long-term backup; recoverable but not queryable directly.	30 min one-off	~$0.10/mo per snapshot

My take. Defaults are sensible. The only question worth deciding today is whether compliance, legal, or contracts dictate a specific retention floor or ceiling. If MSCI's data contract with measurement clients requires deleting campaign data after N months, we need 1c sized to that. Otherwise, leave it.

Recommendation: P2 for the meeting. Confirm there's no compliance constraint we're missing, then revisit in Q3.

Architectural implication: Essentially none — these are configuration changes, not code changes. The pipeline keeps running the same way; only the storage policies shift.

Open questions for Landis

Are there compliance / legal / MSCI-contract reasons we need a specific retention policy?
Has anyone ever asked to query data older than 6 months? If not, partition expiry is safe.
Do you want the v1 March snapshot (still serving at trf-benchmark.pages.dev) preserved permanently as a baseline?

Ask 2

Edits

My take: needs definition first

"Edits to the data and how it is curated." This is the biggest architectural ask of the three — it changes the dashboard from a one-way view of upstream data into a system of record with mutation rights. Before scoping, we need to land what "edit" actually means.

Three things "edit" could mean

Flavor	What you'd be able to do	Blast radius
2a. Annotations	Add notes / tags / comments to campaigns — overlay UI. The metrics themselves stay read-only. ("This campaign had a measurement anomaly, see notes.")	Small — new table, joined at view time, no upstream contact
2b. Industry / taxonomy overrides	Manually classify campaigns that fall to "Uncategorized" (currently 88% of the universe). Override the v1-BQ industry join.	Medium — new override table, modify `normalize.py` to consult it
2c. Data corrections	Override actual metric values (reach, frequency, on-target). Change which campaigns are included or excluded from the published set.	Large — requires audit trail, "who edited what when why" system, breaks data lineage from msci-mcp

My take. What you described — "edits to the data and how it is curated" — sounds most like 2b + 2c combined. That's the largest scope. Before committing to it, two things are worth getting on the table:

(1) If we go to 2c, the dashboard stops being a transparent view of msci-mcp data and becomes its own source of truth — MSCI consumers need to know that. We'd want an "edited" badge on changed values + a side-by-side "API value vs override value" view, otherwise the same chart on two screens gives different answers.

(2) A lot of what people ask "edits" for can actually be solved by fixing the upstream classification (2b) — and that's a much smaller, safer change. If 80% of the "edit" need is "uncategorized campaigns need to be categorized," let's do 2b and see if that absorbs the demand before going to 2c.

Recommendation: disambiguate live. Then likely P1 for 2b (industry overrides) and P2 with design first for 2c (data corrections — let's not just build it, let's design it).

Architectural implication: 2a is additive (~1 week). 2b is moderate (~2 weeks; touches normalize.py + needs a small admin UI). 2c is its own project (~1–2 months; needs auth, audit log, conflict resolution, MSCI comms). Whichever flavor we pick should be on a separate roadmap, not bundled into the v2 ship.

Open questions for Landis

Walk me through a specific case where you wanted to edit something — what would you have changed, and why?
Who else on the MSCI team needs edit rights? Just you? Multiple analysts? Engineering can read-only with you as approver?
If a campaign's reach number is edited, should the API value remain queryable (audit) or be replaced entirely (cleaner)?
Would you trade some 2c scope (full mutation) for faster 2b (taxonomy fixes) shipping in 3 weeks?

Ask 3

Downloadables

My take: P1 · ship soon

Let MSCI consumers pull dashboard data into their own tools — Excel, Tableau, internal reports. The data exists; we just don't surface a download path today. Moderate work, well-scoped, no architectural risk.

Options on the menu

Option	What it gives MSCI	Effort	Format
3a. CSV download button per view	"Download the current filter set as CSV" — same data as on screen, in their spreadsheet	~2-3 days	CSV
3b. Excel export with formatting	Multi-sheet workbook matching v1 xlsx shape (Master + per-window + per-industry tabs) — what they're used to	~1 week	XLSX
3c. Pre-built PDF report (monthly)	Auto-generated executive summary PDF emailed monthly with hero charts + commentary	~2 weeks	PDF
3d. Direct BQ access for power users	MSCI analysts query `fact_campaigns` directly with their own tools (Looker, Data Studio, Sheets connector)	~1 day (IAM only)	BQ-native

My take. The honest framing: MSCI lived in the v1 xlsx world for months — they will instinctively want 3b. But 3a + 3d together is faster to ship, more flexible for power users, and doesn't lock us into matching the xlsx shape forever.

Recommendation: P1 = 3a + 3d (CSV button + BQ access for analysts). P2 = 3b (Excel export later, only if 3a doesn't satisfy demand). P3 / strike = 3c (PDF report — overkill until someone specifically asks).

Architectural implication: Zero risk. The pipeline doesn't change. CSV download is a small frontend addition; BQ access is an IAM policy. Both can ship without touching the daily ingestion job.

Open questions for Landis

Who specifically asked for downloads? Single-analyst use case, or whole-team workflow?
Do they use Sheets / Looker / Tableau today? That answers 3d feasibility.
For the v1 xlsx users: do they edit it after download, or just consume? (If edit → 3b is necessary; if consume → 3a is enough.)
Should downloads be gated (audit log who downloaded what when) or fully open within @samba.com?

Live scoring — fill this in together

Walk through each ask, decide priority, capture owner + rough timing. Anything that doesn't get a P0/P1 commit ships as P2 by default.

Ask	Priority	Owner	Target window
1. Data Retention Pick 1a/1b/1c/1d or combo	P0P1P2—
2a. Edits — annotations	P0P1P2—
2b. Edits — industry overrides	P0P1P2—
2c. Edits — data corrections	P0P1P2—
3a. Downloads — CSV per view	P0P1P2—
3b. Downloads — Excel multi-sheet	P0P1P2—
3c. Downloads — monthly PDF report	P0P1P2—
3d. Downloads — direct BQ access	P0P1P2—

Before we close. The v2 pipeline itself has an open architectural issue — first full-universe production run hit a memory ceiling and got OOM-killed at ~3,800 / 6,874 campaigns. The Cloud Scheduler is paused. Three recovery paths are on the table (bump memory, limit scope, or refactor to stream). This is independent of the three asks above, but it's worth knowing — the dashboard you're inheriting has a known fix-in-flight, not a polished finished product. The fix is straightforward, no new architecture needed.

Companion document with full operating model: 01-operating-model.html.