Locked 13 June 2026, before any outcome data is inspected. Frozen 13 June 2026 (step 4) — outcome panel data-ready on the primary (Wikipedia) series; no model has run; no further edits. Supersedes the brand-backtest pre-registration of 28 April 2026 (folds it in, adds the nowcast test, and locks the expanded brand universe).
Phase 1 (April 2026) tested whether music forecasts French consumer confidence at 30–90 day lags. It failed its pre-registered criterion — 0/32 features survived Bonferroni; adding music degraded the CCI forecast — and that claim was withdrawn. This document does not reopen that claim. It tests two new and distinct questions, both of which Phase 1 left open:
The rules that make the result trustworthy, win or lose:
Audio features are frozen at November 2024. Spotify deprecated the endpoint; the licensed corpus has chart positions through ~Jan 2026 but the validated mood features (tempo, mode, valence, …) only exist through Nov 2024. Therefore:
This constraint is stated up front so no result is quietly read as more current than it is.
Frozen now, before any outcome relationship is examined. Chosen for category coverage and baseline prominence, not for any 2021–2024 relationship to mood.
| Category | Brands |
|---|---|
| Grocery / FMCG | Leclerc · Carrefour · Lidl · Intermarché · Auchan · Monoprix |
| Luxury | Louis Vuitton · Hermès · Chanel · Dior · Cartier · Gucci |
| Automotive | Renault · Peugeot · Dacia · Citroën · Tesla · BMW |
| Finance / banking | BNP Paribas · Crédit Agricole · Société Générale · Boursorama · Revolut · Qonto |
| Leisure / travel | SNCF · Airbnb · Booking · Center Parcs · Air France · Club Med |
| Tech / lifestyle | Apple · Samsung · Nike · Adidas · Decathlon · Sony |
The 5 existing brands (Carrefour, Hermès, Renault, Boursorama, Club Med) are already collected 2019–2025; the remaining 31 require collection for 2021-01 → 2025-03 (the audio-feature window plus an 8-week lag tail).
Why Wikipedia and not a social feed. The reproducible layer of this stack is the attention layer, by nature not by luck. The closer-to-demand social feeds are each unusable as a primary: X is paywalled; TikTok Creative Center is a curated highlight, not a representative national sample; Reddit is discourse-texture, not sentiment. Wikipedia is the one demand-adjacent series we can collect cleanly and repeatedly — so it is primary, and the program's honest ceiling is attention-foresight until a reliable demand proxy exists.
Trends normalises within a request of ≤5 terms, so 36 brands are not
directly comparable. Pre-registered design: each
request contains ≤4 brand terms plus one fixed high-volume
anchor term (locked: "météo" — stable,
seasonal-only, category-neutral, high volume). All batches are rescaled
to the anchor so brand series are comparable across batches. The anchor
is fixed before collection and never changed.
National monthly and weekly position-weighted
aggregates of the validated mood features, computed on the 2021–2024
audio-feature window: tempo, mode_major,
valence, energy, danceability,
acousticness, local_share,
catalogue_age — 8 features, locked.
Eurostat CCI + savings-intent (monthly, interpolated to weekly for Test B), CAC 40 close, EUR/USD (weekly). Open-Meteo Paris weather as a confound control, regressed out, never reported.
Question. By mid-month, using behavioural data available to date, can we estimate the current month's French confidence index more accurately than the naïve "it'll be the same as last month" — i.e. beat the official print to the punch?
| Element | Locked specification |
|---|---|
| Outcomes | Primary: EUROSTAT_CCI_SAVINGS_INTENT (the sub-index
that validated in Phase 1). Secondary: EUROSTAT_CCI
headline. |
| Reference timing | Eurostat consumer-confidence flash for month M releases ~day 20–23 of M. Nowcast date = day 15 of M, using behavioural data through day 14. We estimate CCI(M) ≈ 5–8 days before the flash. |
| Features | Month-to-date aggregates of the 8 mood features, plus daily behavioural signals available by day 14 (Wikipedia category views, GDELT national tone, Trends category aggregates), month-to-date averaged. |
| Model | Primary: ElasticNet (handles the mood-feature collinearity). Secondary: Random Forest. Expanding-window walk-forward — train on months ≤ M−1, estimate M. |
| Baselines (revised — the bar that matters) | Primary baseline = AR(p) on CCI itself + all contemporaneously-available public predictors that arrive before the Eurostat print: the INSEE monthly business-climate / consumer survey, CAC 40, EUR/USD, retail. Music must beat the cheap public stuff a buyer could already use — not a coin flip. Random walk and AR(1) are retained only as secondary sanity floors, not the pass bar. |
| Window & n & power (stated) | Train from 2021-01; walk-forward test the final 18 months of the audio-feature window (2023-06 → 2024-11) → n ≈ 18 out-of-sample nowcasts. This is underpowered and we say so explicitly: at n≈18 a Diebold-Mariano test can only detect a large, consistent forecast-error edge (roughly a standardised loss-differential of ~0.6–0.7, i.e. a substantial and stable RMSE gap); it cannot confirm a subtle one. Pre-committed interpretation: a non-significant result is read as "could not confirm a signal at this n" — NOT "music carries no sentiment information." Only a significant DM result is reported as a pass; a null is reported as inconclusive-pending-more-data, never as a disproof. |
| PASS criterion (revised) | Both required: (1) out-of-sample RMSE on the primary outcome lower than the primary (AR + public-predictor) baseline, and (2) a Diebold-Mariano test on the forecast-error differential vs that baseline significant at α = 0.05 (n stated above). Point-estimate RMSE reduction is reported but does not pass on its own. Secondary: directional accuracy, reported, not gating. |
| Generalisation bar (locked now) | Each market is tested separately at the identical bar (RMSE-beat + DM-significant). A "pass" requires France AND ≥1 held-out market (Germany or Japan) each clearing independently. A directionally-positive but non-significant held-out result is not a pass. A pooled three-market model (which would raise n and power) is a secondary, reported analysis only — it does not substitute for the per-market bar, because pooling can let one strong market carry two weak ones. |
| If it passes | Not primarily a SKU — a one-week lead on the French CCI print is not, by itself, a product anyone pays much for (the print isn't market-moving). What it is: the validated receipt that the mood signal carries genuine, non-redundant sentiment information beyond public data — which is what licenses the brand product's claims. The saleable nowcast is the brand/category one (Test B), not the macro print. |
| If it fails | The mood signal does not add beyond public predictors; the read stays descriptive/contemporaneous, no lead-time claim anywhere. Reported as such. |
Deployment gate — feature parity (separate from the nowcast
test, pre-registered). Validation runs on real Spotify
features (2021–2024); live deployment will run on the in-house
extractor (Essentia/Musicnn). These are different feature
distributions, so a nowcast pass on Spotify features does not
validate the live product. Before any live deployment: on a
held-out set of ≥500 tracks that have both, each feature must
clear its threshold independently — no averaging across features,
because a high pooled correlation can hide a weak parity on the exact
feature the claim rests on. Continuous features at
Spearman ρ ≥ 0.80 each (tempo, valence, energy,
danceability, acousticness); mode_major at ≥ 0.90
class agreement — set higher precisely because it is a
noisier binary classification and it is half the locked brand-test spec,
so a weak mode parity would silently undermine Test B's headline. This
gate is independent of the nowcast result — "it works" must be true of
the thing we actually ship, not just the thing we validated.
Question. Does national music mood predict category- and brand-level demand proxies beyond an autoregressive + macro baseline — at the granularity Phase 1's audit suggested signal might survive?
Causal chain assumed (stated, not hidden): mood → category demand → brand search. Because the link is cleanest at category level, category is the primary unit; brand is secondary/granular.
| Element | Locked specification |
|---|---|
| Primary unit | Category (6) — brand outcomes aggregated to a category search index. |
| Secondary unit | Brand (36) — same test per brand; reported as fraction passing, exploratory. |
| Outcome | Weekly Wikipedia attention (primary — reliably collectable); Google Trends (corroborating, collect-when-possible, throttle-fragile); GDELT tone (robustness). Read as deltas, not levels. |
| Primary specification — LOCKED (closes the forking-paths hole) | The single confirmatory test is fixed now, so
Bonferroni-across-brands can't be undone by researcher choice upstream:
features = tempo and
mode_major (the two that survived the Phase-1/H2
validation), single lag = 4 weeks,
transformation = first difference (week-over-week change) of the
z-scored series — named explicitly: not linear de-trend,
not STL, not levels. Everything else — the other 6 features, the
other lags (1/2/8), alternative transforms — is explicitly
exploratory and reported separately, never as the
headline. |
| Baseline | Per-category AR(4) + linear time trend + week-of-year seasonal dummies (so shared seasonality is not attributed to music) + macro controls (weekly CCI-interpolated, CAC 40, EUR/USD). Music must add beyond this. |
| Primary method | Pooled panel regression, category fixed effects, Newey-West HAC SE (4 lags). Incremental explanatory power of the locked music features tested by nested F-test / incremental R². |
| Multiple comparison | The locked primary test is a single hypothesis. The exploratory grid (8 features × 4 lags) is reported with Bonferroni (α = 0.05/32 = 0.00156) and Benjamini-Hochberg FDR, AND uncorrected p-values alongside — so a Bonferroni "fail" is interpretable (real-but-over-corrected vs genuinely absent) rather than a black box. |
| Window | Weekly, 2021-01 → 2024-11; walk-forward out-of-sample on the final 52 weeks. |
| PASS criterion (both arms required) | (1) Significance: ≥2 music feature-lags survive Bonferroni in the pooled category model, sign-consistent, no Phase-1-style sign reversal between lags. (2) Forecast value: walk-forward out-of-sample RMSE ≥3% lower than baseline on category search in ≥4 of 6 categories. |
| If it passes | Music adds predictive value at brand-category granularity. The league tables become predictive, not descriptive; a premium tier reopens with the backtest attached. |
| If it fails | Brand-level prediction does not hold either. The league tables remain descriptive share-of-attention only; no forecasting claim anywhere. |
Read Test A as the signal-validation / credibility test and Test B as the saleable product test — that division is deliberate (a one-week lead on French CCI is not itself a SKU; brand-category foresight is).
| Test A (nowcast = validation) | Test B (brand/category = product) | What we can honestly sell |
|---|---|---|
| Pass | Pass | Strongest: the mood signal is validated to carry non-redundant sentiment info and that translates into brand-category demand foresight. Predictive premium tier, with the pre-registered receipt. (Live deployment gated on Essentia + the feature-parity gate.) |
| Pass | Fail | The signal is validated as real, but doesn't yet translate to a saleable brand forecast. Lead with the credibility receipt ("our mood signal is externally validated to carry genuine sentiment information") on a descriptive product. No predictive SKU. |
| Fail | Pass | Brand/category foresight works — the actual product — even though the macro-validation didn't clear the higher bar. A brand-demand product; descriptive macro context. |
| Fail | Fail | Descriptive only, confirmed. The read is context, not foresight. Price as a provocative one-off, shelve predictive language 12+ months, lean entirely on freshness + receipts + the corpus. |
Any outcome is a result, not a setback — the published failure of a clean test is itself the asset.
météo anchor.What was and was not known at the moment of this amendment (the disclosure that keeps it honest). A sceptic's first question about any amendment is "what did you already know when you made it." Answered precisely:
tempo × savings-intent r = −0.52 and
mode_major × CCI r = +0.43 — and the failed
country-level forecast backtest. So the choice of
tempo and mode_major as the locked brand-test
features is NOT blind — it is deliberately informed by those prior
published findings. That is legitimate pre-registration
practice (you may pre-specify informed by prior work), but it must be
stated, and it is: the spec is informed by Phase 1's contemporaneous
results, blind to the Test A and Test B results.The legitimacy claim is therefore the narrow, true one: blind to the results of the two tests being pre-registered, not blind to all prior France analysis.
The five hardening changes (all made before either test was run, in response to an external adversarial review), all tightening the bar:
These are legitimate only because they were made before either test was executed and are logged here with the disclosure above.
Companion: Phase 1 results
(2026-04-28-phase1-bridge-test-results.md), Data Book
monetisation page, external critique 13 Jun 2026. The nowcast test was
added in direct response to that critique's "make the read front-run
something" challenge — answered honestly as nowcasting, not the
forecasting that already failed.