Run 13 June 2026 against the pre-registration locked the same day
(2026-06-13-phase2-nowcast-preregistration.md). France arm
only — see §"What this is and is not." Signed: analysis
executed by Claude (Opus 4.8) on local data; scripts
analysis/phase2_test_a.py,
analysis/phase2_test_b.py; raw outputs
data/phase2_test_a_results.json,
data/phase2_test_b_results.json,
data/phase2_test_b_grid.csv.
Read this before the results, because it is the reason to trust them.
The first walk-forward implementation contained a silent failure:
predict() returned a pandas Series indexed by date,
[0] raised KeyError, and an over-broad
except clause caught it and made both the
baseline and the music model fall back to predicting
mean(training y) every step. The consequence was a
byte-identical fake null — all six categories showing
base RMSE == full RMSE to five decimal places.
A fake null still kills the predictive claim, and "we tried, it
didn't work" is the easy story. It was caught anyway, because the
nulls were identical to 5 dp and that is implausible for two genuinely
different models — i.e. by distrusting a result that was too tidy
in the direction of the prior. The bug was fixed
(np.asarray(b.predict(...))[0]) and the test re-run. Every
Test B number in this memo is post-fix. The fix changed the result from
"identical, fake" to "music degrades the forecast in 4/6 categories,
real." Any future walk-forward must be checked the same way: if
results are suspiciously identical, suspect a swallowed exception, not a
clean null.
Test B used the France slice (audio
features in data/france_chart_raw.parquet, 165,668 of
179,569 chart-rows featured; the ~14k gap is the post-Nov-2024 freeze,
visible in the bytes exactly where §1 of the pre-reg said). Test B
failed in France on its merits and more markets won't rescue a
near-zero, wrong-signed OOS coefficient — it was not re-run on
DE/JP.
Test A was first run France-only (n=18), then — once Athena proved reachable the same day — powered across France + Germany (n≈45 each), which does engage the locked generalisation bar. So:
Locked primary spec: features = tempo +
mode_major; single lag = 4 weeks; transform = first
difference of the z-scored series; pooled category panel (6 categories ×
205 weeks, n=1,194), category fixed effects, week-of-year dummies, AR(4)
+ linear trend + macro controls (CCI/CAC40/EUR-USD), Newey-West HAC (4
lags).
Arm 1 — significance: | Locked feature-lag | β | p
(uncorrected) | Bonferroni p (×32) | survives? | |---|---|---|---|---| |
tempo_l4 | −0.0024 | 0.954 | 1.000 | no | |
mode_major_l4 | +0.097 | 0.012 | 0.382 | no |
Joint Wald F=6.33, p=0.042 (marginal, uncorrected); incremental R²=+0.0037. 0/2 locked feature-lags survive Bonferroni; 0/28 exploratory grid cells survive. Several cells are nominally significant uncorrected (tempo·lag8 p=0.006, danceability·lag1 p=0.008, valence·lag1 p=0.010, mode_major·lag4 p=0.012) — this is the pre-reg's "real-but-over-corrected vs genuinely absent" diagnostic: a whiff of in-sample association at the feature level (consistent with Phase 1's contemporaneous correlations) that does not survive multiple-comparison correction. Sign-consistency across lags: fails. Arm 1 FAIL.
Arm 2 — forecast value (per-category expanding-window walk-forward, OOS = final 52 weeks): | Category | RMSE base | RMSE +music | Δ | |---|---|---|---| | Grocery | 1.0997 | 1.1043 | −0.42% | | Luxury | 0.6841 | 0.6859 | −0.27% | | Automotive | 0.7522 | 0.7602 | −1.07% | | Finance | 0.7752 | 0.7795 | −0.55% | | Leisure | 0.8517 | 0.8372 | +1.70% | | Tech | 0.7476 | 0.7432 | +0.59% |
0/6 categories reach the +3% bar; music degrades the forecast in 4 of 6 and helps marginally in 2. This mirrors Phase 1's finding that adding music degraded the CCI forecast. Arm 2 FAIL.
These numbers are post-fix. See the Integrity note above — the first implementation produced a fake byte-identical null via a swallowed
KeyError; the real result is that music degrades the OOS forecast in 4/6 categories.
Estimate month M's CCI at day-15 (data through day-14), before the ~day-20 flash. Baseline = AR(2) on CCI + public predictors available by day-15 (CAC40 MTD, EUR/USD MTD, retail t-2, unemployment t-2, inflation t-1). Full = ElasticNetCV on baseline + 8 month-to-date mood features + Wikipedia national attention + GDELT national tone. Walk-forward final 18 months (2023-06 → 2024-11), n=18.
| Target | RMSE baseline | RMSE full (ENet) | RMSE full (RF) | point Δ | DM stat | DM p |
|---|---|---|---|---|---|---|
| CCI savings-intent (primary) | 2.831 | 2.692 | 2.677 | −4.9% | +0.50 | 0.62 |
| CCI headline (secondary) | 2.628 | 2.019 | 2.088 | −23.2% | +1.52 | 0.15 |
The signal-augmented model beats the public baseline on point estimate for both targets, and directional accuracy is 0.67 / 0.61. But the Diebold-Mariano test is not significant at either target. Pre-committed reading: inconclusive — "could not confirm at this n," not disproof.
Do not over-weight the −23%. It is a secondary outcome (headline CCI) at p=0.15, while the pre-specified primary outcome — savings-intent, the one Phase 1 actually validated contemporaneously — came in at −4.9% and p=0.62, i.e. nothing. A large point estimate at p=0.15 on n=18 is exactly the profile that excites a hopeful reader and then evaporates as n grows; the honest base rate for "striking point gain, non-significant, small n, secondary outcome" becoming a real effect is low. The thread is worth resolving, not because a pass is expected, but because resolution (either way) has value. Verdict outcome is pre-committed to savings-intent (primary). A powered re-run that moves only headline CCI while savings-intent stays null is not "the nowcast validated" — it is "one secondary macro series may carry a weak signal," and must be reported as such.
Test A inconclusive (not a clean fail) / Test B fail. The honest cell is between "Fail | Fail" (descriptive only, confirmed) and the open verdict that the nowcast question is unresolved for lack of power, not closed. Practically:
Per Alex: Germany and Japan chart data 2019–2025 are already in Athena (same 5 tables); France extended back to 2019. This is decisive for the only live predictive thread:
This does not reopen Test B — brand/category prediction failed in France on its merits (music degraded the OOS forecast), and more history won't rescue a signal that's near-zero and wrong-signed out-of-sample. It reopens Test A only, and only as "worth properly powering," not "expected to pass."
After amendment A2 was locked, the Athena data proved directly
reachable (charts_poc_{fr,de,jp}_v1, audio features,
2019-01→2026-03). All three chart sets were pulled; German macro was
collected from Eurostat (ei_bsco_m, country DE); the
powered nowcast ran the same day. Sequence preserved: amendment
locked → data pulled → test run. Japan's CCI is not Eurostat
(Cabinet Office/ESRI) and is not yet collected — but it cannot change
the verdict (see below).
Verdict outcome = savings-intent (primary), per A2.1. Both COVID variants reported, per A2.3.
| Market | n | RMSE base | RMSE full | point Δ | DM p | clears? |
|---|---|---|---|---|---|---|
| France · 2020-control | 45 | 2.790 | 2.621 | +6.1% | 0.496 | no |
| France · excl-2020 | 33 | 2.814 | 2.537 | +9.9% | 0.332 | no |
| Germany · 2020-control | 47 | 2.528 | 2.487 | +1.7% | 0.763 | no |
| Germany · excl-2020 | 35 | 2.579 | 2.485 | +3.6% | 0.523 | no |
Powered result: the nowcast does NOT validate. Across two independent markets at n≈45, the mood-augmented model produces small, consistent, positive — but non-significant improvements on the pre-specified primary outcome. The generalisation bar (France and ≥1 held-out market, each DM-significant) is not met: France itself does not clear, so the bar fails irrespective of Japan.
Supplementary — headline-CCI analog, all three markets (NOT the verdict; per A2.1 the verdict is savings-intent). Japan was acquired (OECD amplitude-adjusted consumer confidence, via OECD SDMX — Japan has no savings-intent sub-index, so it can only run the headline analog, and with a thinner baseline: no JP retail/unemployment locally, so AR + EUR/USD only). Run on headline CCI at n≈45:
| Market | n | music effect on headline-CCI RMSE | DM p | |
|---|---|---|---|---|
| France | 45 | −24.8% (worse) | 0.036 | significant degradation |
| Germany | 47 | −10.0% (worse) | 0.255 | ns |
| Japan | 47 | −20.5% (worse) | 0.031 | significant — but thin baseline |
This is the decisive finding. The headline-CCI +23% "improvement" at n=18 did not merely evaporate under power — it reversed into a statistically significant degradation (France p=0.036). Meanwhile the primary outcome (savings-intent) showed a small, non-significant improvement. A signal that helps one CCI sub-index a little and significantly hurts the headline is the textbook signature of no robust, directionally-stable signal — exactly the noise profile a striking small-n secondary point estimate was warned to be. The discipline call (don't chase the −23%) is vindicated as strongly as the data could vindicate it.
Verdict, powered: on the pre-committed primary
(savings-intent) the nowcast is a non-confirmation (small, consistent,
non-significant gains, bar not met). On the headline analog, music
significantly degrades the forecast. Together: no
predictive claim is licensed, and the predictive thread is closed for
this cycle. Honest characterisation: the mood signal does not
clear a usable bar when properly powered, and on the broader outcome it
actively hurts. Not a stark "music is noise" disproof on the primary,
but decisively not grounds for prediction. (Outputs:
data/phase2_test_a_powered_results.json; Japan via OECD
SDMX DSD_STES@DF_CLI, German macro via Eurostat.)
Companion: 2026-04-28-phase1-bridge-test-results.md
(Phase 1), 2026-06-13-phase2-nowcast-preregistration.md
(the locked design this executes).