Cadence — Phase 2 pre-registration

Pre-Registration — Phase 2 (Brand/Category) + Nowcast · France

Locked 13 June 2026, before any outcome data is inspected. Frozen 13 June 2026 (step 4) — outcome panel data-ready on the primary (Wikipedia) series; no model has run; no further edits. Supersedes the brand-backtest pre-registration of 28 April 2026 (folds it in, adds the nowcast test, and locks the expanded brand universe).

0. Why this exists, and the discipline

Phase 1 (April 2026) tested whether music forecasts French consumer confidence at 30–90 day lags. It failed its pre-registered criterion — 0/32 features survived Bonferroni; adding music degraded the CCI forecast — and that claim was withdrawn. This document does not reopen that claim. It tests two new and distinct questions, both of which Phase 1 left open:

Test A — Nowcast. Can the behavioural read estimate the current month's confidence before Eurostat publishes it — i.e. lead the official print, not forecast the future?
Test B — Brand/Category. Does the national mood signal predict category- and brand-level demand proxies (weekly search), where the theory says less macro absorption should let signal survive?

The rules that make the result trustworthy, win or lose:

Locked before inspection. Brand universe, outcomes, features, lags, baselines, and pass criteria are fixed in this document. No post-hoc changes except logged amendments made before execution.
Germany and Japan are held out. No brand-outcome data is collected for them until the France result is committed. Generalisation is a separate, later test.
Published either way. Pass or fail, the result and this pre-registration ship together. The failure of Phase 1 is the reason anyone should believe a Phase 2 pass — do not break that.
Walk-forward, out-of-sample, multiple-comparison-corrected. A curve-fit "win" is worse than a loss; it would torch the receipts wedge that is the whole brand.

1. The hard constraint that shapes both tests

Audio features are frozen at November 2024. Spotify deprecated the endpoint; the licensed corpus has chart positions through ~Jan 2026 but the validated mood features (tempo, mode, valence, …) only exist through Nov 2024. Therefore:

Validation runs on the 2021-01 → 2024-11 audio-feature window (≈ 47 months / ≈ 200 weeks). This is what we can honestly test today.
Live deployment of either product is blocked on the in-house extractor (Essentia/Musicnn). A live nowcast in June 2026 needs fresh mood features that don't exist until that build ships. So the sequence is: validate now on history → if it passes, build Essentia → deploy. The pre-registration and the extractor build are complementary, not competing.

This constraint is stated up front so no result is quietly read as more current than it is.

2. Data to collect (the build)

2.1 Brand universe — LOCKED (6 categories × 6 brands = 36)

Frozen now, before any outcome relationship is examined. Chosen for category coverage and baseline prominence, not for any 2021–2024 relationship to mood.

Category	Brands
Grocery / FMCG	Leclerc · Carrefour · Lidl · Intermarché · Auchan · Monoprix
Luxury	Louis Vuitton · Hermès · Chanel · Dior · Cartier · Gucci
Automotive	Renault · Peugeot · Dacia · Citroën · Tesla · BMW
Finance / banking	BNP Paribas · Crédit Agricole · Société Générale · Boursorama · Revolut · Qonto
Leisure / travel	SNCF · Airbnb · Booking · Center Parcs · Air France · Club Med
Tech / lifestyle	Apple · Samsung · Nike · Adidas · Decathlon · Sony

The 5 existing brands (Carrefour, Hermès, Renault, Boursorama, Club Med) are already collected 2019–2025; the remaining 31 require collection for 2021-01 → 2025-03 (the audio-feature window plus an 8-week lag tail).

2.2 Outcome series to collect, per brand (re-tiered — see A1.6)

Primary: Wikipedia weekly pageviews (fr.wikipedia, canonical-title-resolved). Chosen as primary because it is the series we can reliably and reproducibly collect on demand — a requirement for a "Measured" claim in a reproducibility-wedged product. It measures curiosity/attention, not purchase; the claim it supports is therefore "mood predicts brand attention," read as deltas (levels are structural — brand size/newsworthiness).
Corroborating, collect-when-possible: Google Trends weekly index. Closer to demand/search-intent, but throttle-fragile (the datacenter IP is 429-blocked under sustained load; collection requires a residential IP / production runner and still isn't guaranteed). Treated as a corroborating layer, never a load-bearing input to a Measured claim — same tier as social.
Robustness: GDELT weekly brand-tone.

Why Wikipedia and not a social feed. The reproducible layer of this stack is the attention layer, by nature not by luck. The closer-to-demand social feeds are each unusable as a primary: X is paywalled; TikTok Creative Center is a curated highlight, not a representative national sample; Reddit is discourse-texture, not sentiment. Wikipedia is the one demand-adjacent series we can collect cleanly and repeatedly — so it is primary, and the program's honest ceiling is attention-foresight until a reliable demand proxy exists.

2.3 Google Trends batching — the methodological requirement

Trends normalises within a request of ≤5 terms, so 36 brands are not directly comparable. Pre-registered design: each request contains ≤4 brand terms plus one fixed high-volume anchor term (locked: "météo" — stable, seasonal-only, category-neutral, high volume). All batches are rescaled to the anchor so brand series are comparable across batches. The anchor is fixed before collection and never changed.

2.4 Predictors (already held, via Athena)

National monthly and weekly position-weighted aggregates of the validated mood features, computed on the 2021–2024 audio-feature window: tempo, mode_major, valence, energy, danceability, acousticness, local_share, catalogue_age — 8 features, locked.

2.5 Controls / baselines (already held)

Eurostat CCI + savings-intent (monthly, interpolated to weekly for Test B), CAC 40 close, EUR/USD (weekly). Open-Meteo Paris weather as a confound control, regressed out, never reported.

3. TEST A — The Nowcast

Question. By mid-month, using behavioural data available to date, can we estimate the current month's French confidence index more accurately than the naïve "it'll be the same as last month" — i.e. beat the official print to the punch?

Element	Locked specification
Outcomes	Primary: `EUROSTAT_CCI_SAVINGS_INTENT` (the sub-index that validated in Phase 1). Secondary: `EUROSTAT_CCI` headline.
Reference timing	Eurostat consumer-confidence flash for month M releases ~day 20–23 of M. Nowcast date = day 15 of M, using behavioural data through day 14. We estimate CCI(M) ≈ 5–8 days before the flash.
Features	Month-to-date aggregates of the 8 mood features, plus daily behavioural signals available by day 14 (Wikipedia category views, GDELT national tone, Trends category aggregates), month-to-date averaged.
Model	Primary: ElasticNet (handles the mood-feature collinearity). Secondary: Random Forest. Expanding-window walk-forward — train on months ≤ M−1, estimate M.
Baselines (revised — the bar that matters)	Primary baseline = AR(p) on CCI itself + all contemporaneously-available public predictors that arrive before the Eurostat print: the INSEE monthly business-climate / consumer survey, CAC 40, EUR/USD, retail. Music must beat the cheap public stuff a buyer could already use — not a coin flip. Random walk and AR(1) are retained only as secondary sanity floors, not the pass bar.
Window & n & power (stated)	Train from 2021-01; walk-forward test the final 18 months of the audio-feature window (2023-06 → 2024-11) → n ≈ 18 out-of-sample nowcasts. This is underpowered and we say so explicitly: at n≈18 a Diebold-Mariano test can only detect a large, consistent forecast-error edge (roughly a standardised loss-differential of ~0.6–0.7, i.e. a substantial and stable RMSE gap); it cannot confirm a subtle one. Pre-committed interpretation: a non-significant result is read as "could not confirm a signal at this n" — NOT "music carries no sentiment information." Only a significant DM result is reported as a pass; a null is reported as inconclusive-pending-more-data, never as a disproof.
PASS criterion (revised)	Both required: (1) out-of-sample RMSE on the primary outcome lower than the primary (AR + public-predictor) baseline, and (2) a Diebold-Mariano test on the forecast-error differential vs that baseline significant at α = 0.05 (n stated above). Point-estimate RMSE reduction is reported but does not pass on its own. Secondary: directional accuracy, reported, not gating.
Generalisation bar (locked now)	Each market is tested separately at the identical bar (RMSE-beat + DM-significant). A "pass" requires France AND ≥1 held-out market (Germany or Japan) each clearing independently. A directionally-positive but non-significant held-out result is not a pass. A pooled three-market model (which would raise n and power) is a secondary, reported analysis only — it does not substitute for the per-market bar, because pooling can let one strong market carry two weak ones.
If it passes	Not primarily a SKU — a one-week lead on the French CCI print is not, by itself, a product anyone pays much for (the print isn't market-moving). What it is: the validated receipt that the mood signal carries genuine, non-redundant sentiment information beyond public data — which is what licenses the brand product's claims. The saleable nowcast is the brand/category one (Test B), not the macro print.
If it fails	The mood signal does not add beyond public predictors; the read stays descriptive/contemporaneous, no lead-time claim anywhere. Reported as such.

Deployment gate — feature parity (separate from the nowcast test, pre-registered). Validation runs on real Spotify features (2021–2024); live deployment will run on the in-house extractor (Essentia/Musicnn). These are different feature distributions, so a nowcast pass on Spotify features does not validate the live product. Before any live deployment: on a held-out set of ≥500 tracks that have both, each feature must clear its threshold independently — no averaging across features, because a high pooled correlation can hide a weak parity on the exact feature the claim rests on. Continuous features at Spearman ρ ≥ 0.80 each (tempo, valence, energy, danceability, acousticness); mode_major at ≥ 0.90 class agreement — set higher precisely because it is a noisier binary classification and it is half the locked brand-test spec, so a weak mode parity would silently undermine Test B's headline. This gate is independent of the nowcast result — "it works" must be true of the thing we actually ship, not just the thing we validated.

4. TEST B — Brand / Category Backtest

Question. Does national music mood predict category- and brand-level demand proxies beyond an autoregressive + macro baseline — at the granularity Phase 1's audit suggested signal might survive?

Causal chain assumed (stated, not hidden): mood → category demand → brand search. Because the link is cleanest at category level, category is the primary unit; brand is secondary/granular.

Element	Locked specification
Primary unit	Category (6) — brand outcomes aggregated to a category search index.
Secondary unit	Brand (36) — same test per brand; reported as fraction passing, exploratory.
Outcome	Weekly Wikipedia attention (primary — reliably collectable); Google Trends (corroborating, collect-when-possible, throttle-fragile); GDELT tone (robustness). Read as deltas, not levels.
Primary specification — LOCKED (closes the forking-paths hole)	The single confirmatory test is fixed now, so Bonferroni-across-brands can't be undone by researcher choice upstream: features = `tempo` and `mode_major` (the two that survived the Phase-1/H2 validation), single lag = 4 weeks, transformation = first difference (week-over-week change) of the z-scored series — named explicitly: not linear de-trend, not STL, not levels. Everything else — the other 6 features, the other lags (1/2/8), alternative transforms — is explicitly exploratory and reported separately, never as the headline.
Baseline	Per-category AR(4) + linear time trend + week-of-year seasonal dummies (so shared seasonality is not attributed to music) + macro controls (weekly CCI-interpolated, CAC 40, EUR/USD). Music must add beyond this.
Primary method	Pooled panel regression, category fixed effects, Newey-West HAC SE (4 lags). Incremental explanatory power of the locked music features tested by nested F-test / incremental R².
Multiple comparison	The locked primary test is a single hypothesis. The exploratory grid (8 features × 4 lags) is reported with Bonferroni (α = 0.05/32 = 0.00156) and Benjamini-Hochberg FDR, AND uncorrected p-values alongside — so a Bonferroni "fail" is interpretable (real-but-over-corrected vs genuinely absent) rather than a black box.
Window	Weekly, 2021-01 → 2024-11; walk-forward out-of-sample on the final 52 weeks.
PASS criterion (both arms required)	(1) Significance: ≥2 music feature-lags survive Bonferroni in the pooled category model, sign-consistent, no Phase-1-style sign reversal between lags. (2) Forecast value: walk-forward out-of-sample RMSE ≥3% lower than baseline on category search in ≥4 of 6 categories.
If it passes	Music adds predictive value at brand-category granularity. The league tables become predictive, not descriptive; a premium tier reopens with the backtest attached.
If it fails	Brand-level prediction does not hold either. The league tables remain descriptive share-of-attention only; no forecasting claim anywhere.

5. Decision matrix → product implications

Read Test A as the signal-validation / credibility test and Test B as the saleable product test — that division is deliberate (a one-week lead on French CCI is not itself a SKU; brand-category foresight is).

Test A (nowcast = validation)	Test B (brand/category = product)	What we can honestly sell
Pass	Pass	Strongest: the mood signal is validated to carry non-redundant sentiment info and that translates into brand-category demand foresight. Predictive premium tier, with the pre-registered receipt. (Live deployment gated on Essentia + the feature-parity gate.)
Pass	Fail	The signal is validated as real, but doesn't yet translate to a saleable brand forecast. Lead with the credibility receipt ("our mood signal is externally validated to carry genuine sentiment information") on a descriptive product. No predictive SKU.
Fail	Pass	Brand/category foresight works — the actual product — even though the macro-validation didn't clear the higher bar. A brand-demand product; descriptive macro context.
Fail	Fail	Descriptive only, confirmed. The read is context, not foresight. Price as a provocative one-off, shelve predictive language 12+ months, lean entirely on freshness + receipts + the corpus.

Any outcome is a result, not a setback — the published failure of a clean test is itself the asset.

6. Build sequence (to data-ready, then run)

Collect the 31 new brands' Trends (anchor-batched), Wikipedia, GDELT for 2021-01 → 2025-03. (~2–4 days; rate-limit-bound on Trends.)
Aggregate to weekly per brand and per category; rescale Trends to the météo anchor.
Assemble the analysis panel (mood features × outcomes × controls), audio-feature window only.
Freeze the panel; re-confirm this pre-registration is unchanged; timestamp.
Run Test A then Test B end-to-end; write results memo (signed, like Phase 1).
Publish result + this pre-registration. If pass → scope Essentia for live deployment.

7. Amendments log

What was and was not known at the moment of this amendment (the disclosure that keeps it honest). A sceptic's first question about any amendment is "what did you already know when you made it." Answered precisely:

Already computed and published (Phase 1, April 2026), on the 2021–2024 window: the contemporaneous, feature-level correlations of mood features against CCI and its sub-indices — this is the origin of tempo × savings-intent r = −0.52 and mode_major × CCI r = +0.43 — and the failed country-level forecast backtest. So the choice of tempo and mode_major as the locked brand-test features is NOT blind — it is deliberately informed by those prior published findings. That is legitimate pre-registration practice (you may pre-specify informed by prior work), but it must be stated, and it is: the spec is informed by Phase 1's contemporaneous results, blind to the Test A and Test B results.
Not computed, by anyone, as of this lock: the nowcast (Test A) in any form; the brand/category backtest (Test B) — its outcome panel (Wikipedia collected 13 Jun, full 36-brand universe; GDELT settled at 23/36 brands, robustness-tier; Trends throttled to zero) is not yet assembled and no model has touched it; any walk-forward / out-of-sample estimate on either test; anything at all on Germany or Japan. These remain genuinely unseen.

The legitimacy claim is therefore the narrow, true one: blind to the results of the two tests being pre-registered, not blind to all prior France analysis.

The five hardening changes (all made before either test was run, in response to an external adversarial review), all tightening the bar:

Nowcast baseline raised from random-walk/AR(1) to AR(p) + contemporaneously-available public predictors (INSEE survey, markets, retail). The honest question is "does music beat the cheap public data," not "beat a coin flip."
Significance test + power stated — Diebold-Mariano at α=0.05, n ≈ 18 stated, the detectable effect size stated, and a null pre-committed as "inconclusive at this n," not "no signal."
Generalisation bar locked — France and ≥1 held-out market, each tested separately at the same bar; pooling is secondary only.
Brand test primary spec locked (tempo + mode_major, 4-week lag, first-difference of z-scored series) to close researcher degrees-of-freedom; seasonal dummies added; uncorrected reported alongside corrected.
Feature-parity deployment gate — Spotify vs in-house extractor, per-feature (no averaging), continuous ρ ≥ 0.80, mode ≥ 0.90 agreement.
Outcome series re-tiered for reproducibility — Test B's primary outcome moved from Google Trends to Wikipedia attention, after the 13 Jun collection proved Trends is throttle-blocked from our environment (zero brand rows committed) while Wikipedia collected cleanly (67,017 rows). A primary outcome must be one we can reproduce on demand; Trends is demoted to corroborating. This change is forced by a collection-reliability fact discovered before execution, not by inspecting any test result.

These are legitimate only because they were made before either test was executed and are logged here with the disclosure above.

Companion: Phase 1 results (2026-04-28-phase1-bridge-test-results.md), Data Book monetisation page, external critique 13 Jun 2026. The nowcast test was added in direct response to that critique's "make the read front-run something" challenge — answered honestly as nowcasting, not the forecasting that already failed.