Cadence — Phase 2 results

Phase 2 Results — Nowcast (Test A) + Brand/Category (Test B) · France

Run 13 June 2026 against the pre-registration locked the same day (2026-06-13-phase2-nowcast-preregistration.md). France arm only — see §"What this is and is not." Signed: analysis executed by Claude (Opus 4.8) on local data; scripts analysis/phase2_test_a.py, analysis/phase2_test_b.py; raw outputs data/phase2_test_a_results.json, data/phase2_test_b_results.json, data/phase2_test_b_grid.csv.


Headline (the careful words)


Integrity note — a self-caught bug that would have faked the Test B null

Read this before the results, because it is the reason to trust them. The first walk-forward implementation contained a silent failure: predict() returned a pandas Series indexed by date, [0] raised KeyError, and an over-broad except clause caught it and made both the baseline and the music model fall back to predicting mean(training y) every step. The consequence was a byte-identical fake null — all six categories showing base RMSE == full RMSE to five decimal places.

A fake null still kills the predictive claim, and "we tried, it didn't work" is the easy story. It was caught anyway, because the nulls were identical to 5 dp and that is implausible for two genuinely different models — i.e. by distrusting a result that was too tidy in the direction of the prior. The bug was fixed (np.asarray(b.predict(...))[0]) and the test re-run. Every Test B number in this memo is post-fix. The fix changed the result from "identical, fake" to "music degrades the forecast in 4/6 categories, real." Any future walk-forward must be checked the same way: if results are suspiciously identical, suspect a swallowed exception, not a clean null.


What this is and is not

Test B used the France slice (audio features in data/france_chart_raw.parquet, 165,668 of 179,569 chart-rows featured; the ~14k gap is the post-Nov-2024 freeze, visible in the bytes exactly where §1 of the pre-reg said). Test B failed in France on its merits and more markets won't rescue a near-zero, wrong-signed OOS coefficient — it was not re-run on DE/JP.

Test A was first run France-only (n=18), then — once Athena proved reachable the same day — powered across France + Germany (n≈45 each), which does engage the locked generalisation bar. So:


Test B — Brand / Category Backtest (FAIL, both arms)

Locked primary spec: features = tempo + mode_major; single lag = 4 weeks; transform = first difference of the z-scored series; pooled category panel (6 categories × 205 weeks, n=1,194), category fixed effects, week-of-year dummies, AR(4) + linear trend + macro controls (CCI/CAC40/EUR-USD), Newey-West HAC (4 lags).

Arm 1 — significance: | Locked feature-lag | β | p (uncorrected) | Bonferroni p (×32) | survives? | |---|---|---|---|---| | tempo_l4 | −0.0024 | 0.954 | 1.000 | no | | mode_major_l4 | +0.097 | 0.012 | 0.382 | no |

Joint Wald F=6.33, p=0.042 (marginal, uncorrected); incremental R²=+0.0037. 0/2 locked feature-lags survive Bonferroni; 0/28 exploratory grid cells survive. Several cells are nominally significant uncorrected (tempo·lag8 p=0.006, danceability·lag1 p=0.008, valence·lag1 p=0.010, mode_major·lag4 p=0.012) — this is the pre-reg's "real-but-over-corrected vs genuinely absent" diagnostic: a whiff of in-sample association at the feature level (consistent with Phase 1's contemporaneous correlations) that does not survive multiple-comparison correction. Sign-consistency across lags: fails. Arm 1 FAIL.

Arm 2 — forecast value (per-category expanding-window walk-forward, OOS = final 52 weeks): | Category | RMSE base | RMSE +music | Δ | |---|---|---|---| | Grocery | 1.0997 | 1.1043 | −0.42% | | Luxury | 0.6841 | 0.6859 | −0.27% | | Automotive | 0.7522 | 0.7602 | −1.07% | | Finance | 0.7752 | 0.7795 | −0.55% | | Leisure | 0.8517 | 0.8372 | +1.70% | | Tech | 0.7476 | 0.7432 | +0.59% |

0/6 categories reach the +3% bar; music degrades the forecast in 4 of 6 and helps marginally in 2. This mirrors Phase 1's finding that adding music degraded the CCI forecast. Arm 2 FAIL.

These numbers are post-fix. See the Integrity note above — the first implementation produced a fake byte-identical null via a swallowed KeyError; the real result is that music degrades the OOS forecast in 4/6 categories.


Test A — Nowcast (INCONCLUSIVE at n=18)

Estimate month M's CCI at day-15 (data through day-14), before the ~day-20 flash. Baseline = AR(2) on CCI + public predictors available by day-15 (CAC40 MTD, EUR/USD MTD, retail t-2, unemployment t-2, inflation t-1). Full = ElasticNetCV on baseline + 8 month-to-date mood features + Wikipedia national attention + GDELT national tone. Walk-forward final 18 months (2023-06 → 2024-11), n=18.

Target RMSE baseline RMSE full (ENet) RMSE full (RF) point Δ DM stat DM p
CCI savings-intent (primary) 2.831 2.692 2.677 −4.9% +0.50 0.62
CCI headline (secondary) 2.628 2.019 2.088 −23.2% +1.52 0.15

The signal-augmented model beats the public baseline on point estimate for both targets, and directional accuracy is 0.67 / 0.61. But the Diebold-Mariano test is not significant at either target. Pre-committed reading: inconclusive — "could not confirm at this n," not disproof.

Do not over-weight the −23%. It is a secondary outcome (headline CCI) at p=0.15, while the pre-specified primary outcome — savings-intent, the one Phase 1 actually validated contemporaneously — came in at −4.9% and p=0.62, i.e. nothing. A large point estimate at p=0.15 on n=18 is exactly the profile that excites a hopeful reader and then evaporates as n grows; the honest base rate for "striking point gain, non-significant, small n, secondary outcome" becoming a real effect is low. The thread is worth resolving, not because a pass is expected, but because resolution (either way) has value. Verdict outcome is pre-committed to savings-intent (primary). A powered re-run that moves only headline CCI while savings-intent stays null is not "the nowcast validated" — it is "one secondary macro series may carry a weak signal," and must be reported as such.


Decision-matrix placement (per pre-reg §5)

Test A inconclusive (not a clean fail) / Test B fail. The honest cell is between "Fail | Fail" (descriptive only, confirmed) and the open verdict that the nowcast question is unresolved for lack of power, not closed. Practically:


What changes the power problem (new, 13 Jun): DE/JP in Athena + France to 2019

Per Alex: Germany and Japan chart data 2019–2025 are already in Athena (same 5 tables); France extended back to 2019. This is decisive for the only live predictive thread:

This does not reopen Test B — brand/category prediction failed in France on its merits (music degraded the OOS forecast), and more history won't rescue a signal that's near-zero and wrong-signed out-of-sample. It reopens Test A only, and only as "worth properly powering," not "expected to pass."


Powered re-run — RUN same day (France + Germany, n≈45 each)

After amendment A2 was locked, the Athena data proved directly reachable (charts_poc_{fr,de,jp}_v1, audio features, 2019-01→2026-03). All three chart sets were pulled; German macro was collected from Eurostat (ei_bsco_m, country DE); the powered nowcast ran the same day. Sequence preserved: amendment locked → data pulled → test run. Japan's CCI is not Eurostat (Cabinet Office/ESRI) and is not yet collected — but it cannot change the verdict (see below).

Verdict outcome = savings-intent (primary), per A2.1. Both COVID variants reported, per A2.3.

Market n RMSE base RMSE full point Δ DM p clears?
France · 2020-control 45 2.790 2.621 +6.1% 0.496 no
France · excl-2020 33 2.814 2.537 +9.9% 0.332 no
Germany · 2020-control 47 2.528 2.487 +1.7% 0.763 no
Germany · excl-2020 35 2.579 2.485 +3.6% 0.523 no

Powered result: the nowcast does NOT validate. Across two independent markets at n≈45, the mood-augmented model produces small, consistent, positive — but non-significant improvements on the pre-specified primary outcome. The generalisation bar (France and ≥1 held-out market, each DM-significant) is not met: France itself does not clear, so the bar fails irrespective of Japan.

Supplementary — headline-CCI analog, all three markets (NOT the verdict; per A2.1 the verdict is savings-intent). Japan was acquired (OECD amplitude-adjusted consumer confidence, via OECD SDMX — Japan has no savings-intent sub-index, so it can only run the headline analog, and with a thinner baseline: no JP retail/unemployment locally, so AR + EUR/USD only). Run on headline CCI at n≈45:

Market n music effect on headline-CCI RMSE DM p
France 45 −24.8% (worse) 0.036 significant degradation
Germany 47 −10.0% (worse) 0.255 ns
Japan 47 −20.5% (worse) 0.031 significant — but thin baseline

This is the decisive finding. The headline-CCI +23% "improvement" at n=18 did not merely evaporate under power — it reversed into a statistically significant degradation (France p=0.036). Meanwhile the primary outcome (savings-intent) showed a small, non-significant improvement. A signal that helps one CCI sub-index a little and significantly hurts the headline is the textbook signature of no robust, directionally-stable signal — exactly the noise profile a striking small-n secondary point estimate was warned to be. The discipline call (don't chase the −23%) is vindicated as strongly as the data could vindicate it.

Verdict, powered: on the pre-committed primary (savings-intent) the nowcast is a non-confirmation (small, consistent, non-significant gains, bar not met). On the headline analog, music significantly degrades the forecast. Together: no predictive claim is licensed, and the predictive thread is closed for this cycle. Honest characterisation: the mood signal does not clear a usable bar when properly powered, and on the broader outcome it actively hurts. Not a stark "music is noise" disproof on the primary, but decisively not grounds for prediction. (Outputs: data/phase2_test_a_powered_results.json; Japan via OECD SDMX DSD_STES@DF_CLI, German macro via Eurostat.)


  1. Do not build Essentia. Do not license demand panels. Do not pivot to a finance factor-feed. Test B failed and the powered Test A did not validate — there is no predictive product to deploy, so the extractor would be an engine for a car not chosen.
  2. The powered re-run is DONE (France + Germany) — and it resolved to a non-confirmation. The live predictive thread is now closed for this cycle. Optional only: collect Japan's Cabinet Office/ESRI CCI to make it a clean three-market published result, but it cannot change the verdict. Re-open the predictive question only if a fundamentally better signal or the in-house extractor's live features change the inputs — not by re-running the same test.
  3. Ship the descriptive product now, priced and worded as attention intelligence with receipts — including this memo and the powered re-run. There are now three published clean results (Phase-1 forecast fail, Phase-2 brand/category fail, Phase-2 powered nowcast non-confirmation). That track record — testing your own favourite claim to destruction and publishing the wreckage — is the product's credibility.

Companion: 2026-04-28-phase1-bridge-test-results.md (Phase 1), 2026-06-13-phase2-nowcast-preregistration.md (the locked design this executes).