Cadence — Phase 2 results

Phase 2 Results — Nowcast (Test A) + Brand/Category (Test B) · France

Run 13 June 2026 against the pre-registration locked the same day (2026-06-13-phase2-nowcast-preregistration.md). France arm only — see §"What this is and is not." Signed: analysis executed by Claude (Opus 4.8) on local data; scripts analysis/phase2_test_a.py, analysis/phase2_test_b.py; raw outputs data/phase2_test_a_results.json, data/phase2_test_b_results.json, data/phase2_test_b_grid.csv.

Headline (the careful words)

Test B (brand/category — the saleable product test): DOES NOT PASS in France. Both arms fail. No predictive/forecasting claim is licensed. The league tables stay descriptive share-of-attention, exactly as the pre-reg's "if it fails" cell specifies.
Test A (nowcast — the signal-validation test): does NOT validate, now POWERED. France-only n=18 was inconclusive (savings-intent −4.9%, p=0.62; headline CCI −23.2%, p=0.15). The Athena data proved reachable the same day, so the powered re-run ran immediately: France + Germany, n≈45 each. Point estimates are consistently positive (+1.7% to +9.9%) but none clear Diebold-Mariano significance, and the generalisation bar (France and ≥1 held-out, each significant) is not met — France itself does not clear. Powering up did not rescue it. See the powered section below.
The generalisation bar was tested and not met. With France + Germany both run, this is no longer "untested" — it is a properly-powered non-confirmation. Japan (charts pulled; CCI pending a non-Eurostat source) cannot change the verdict, since France's non-clearance already fails the bar.

Integrity note — a self-caught bug that would have faked the Test B null

Read this before the results, because it is the reason to trust them. The first walk-forward implementation contained a silent failure: predict() returned a pandas Series indexed by date, [0] raised KeyError, and an over-broad except clause caught it and made both the baseline and the music model fall back to predicting mean(training y) every step. The consequence was a byte-identical fake null — all six categories showing base RMSE == full RMSE to five decimal places.

A fake null still kills the predictive claim, and "we tried, it didn't work" is the easy story. It was caught anyway, because the nulls were identical to 5 dp and that is implausible for two genuinely different models — i.e. by distrusting a result that was too tidy in the direction of the prior. The bug was fixed (np.asarray(b.predict(...))[0]) and the test re-run. Every Test B number in this memo is post-fix. The fix changed the result from "identical, fake" to "music degrades the forecast in 4/6 categories, real." Any future walk-forward must be checked the same way: if results are suspiciously identical, suspect a swallowed exception, not a clean null.

What this is and is not

Test B used the France slice (audio features in data/france_chart_raw.parquet, 165,668 of 179,569 chart-rows featured; the ~14k gap is the post-Nov-2024 freeze, visible in the bytes exactly where §1 of the pre-reg said). Test B failed in France on its merits and more markets won't rescue a near-zero, wrong-signed OOS coefficient — it was not re-run on DE/JP.

Test A was first run France-only (n=18), then — once Athena proved reachable the same day — powered across France + Germany (n≈45 each), which does engage the locked generalisation bar. So:

The word "validated" is reserved for a result that clears the generalisation bar (France and ≥1 held-out, each DM-significant). The powered run tested that bar and did not meet it. Nothing here earns "validated." (We have over-read three times before — "Measured"→"market share," "predicts attention"→"predicts demand," "passed France"→"validated." Not a fourth: the careful word for the powered Test A is "non-confirmation," not "validated" and not "disproved.")

Test B — Brand / Category Backtest (FAIL, both arms)

Locked primary spec: features = tempo + mode_major; single lag = 4 weeks; transform = first difference of the z-scored series; pooled category panel (6 categories × 205 weeks, n=1,194), category fixed effects, week-of-year dummies, AR(4) + linear trend + macro controls (CCI/CAC40/EUR-USD), Newey-West HAC (4 lags).

Arm 1 — significance: | Locked feature-lag | β | p (uncorrected) | Bonferroni p (×32) | survives? | |---|---|---|---|---| | tempo_l4 | −0.0024 | 0.954 | 1.000 | no | | mode_major_l4 | +0.097 | 0.012 | 0.382 | no |

Joint Wald F=6.33, p=0.042 (marginal, uncorrected); incremental R²=+0.0037. 0/2 locked feature-lags survive Bonferroni; 0/28 exploratory grid cells survive. Several cells are nominally significant uncorrected (tempo·lag8 p=0.006, danceability·lag1 p=0.008, valence·lag1 p=0.010, mode_major·lag4 p=0.012) — this is the pre-reg's "real-but-over-corrected vs genuinely absent" diagnostic: a whiff of in-sample association at the feature level (consistent with Phase 1's contemporaneous correlations) that does not survive multiple-comparison correction. Sign-consistency across lags: fails. Arm 1 FAIL.

Arm 2 — forecast value (per-category expanding-window walk-forward, OOS = final 52 weeks): | Category | RMSE base | RMSE +music | Δ | |---|---|---|---| | Grocery | 1.0997 | 1.1043 | −0.42% | | Luxury | 0.6841 | 0.6859 | −0.27% | | Automotive | 0.7522 | 0.7602 | −1.07% | | Finance | 0.7752 | 0.7795 | −0.55% | | Leisure | 0.8517 | 0.8372 | +1.70% | | Tech | 0.7476 | 0.7432 | +0.59% |

0/6 categories reach the +3% bar; music degrades the forecast in 4 of 6 and helps marginally in 2. This mirrors Phase 1's finding that adding music degraded the CCI forecast. Arm 2 FAIL.

These numbers are post-fix. See the Integrity note above — the first implementation produced a fake byte-identical null via a swallowed KeyError; the real result is that music degrades the OOS forecast in 4/6 categories.

Test A — Nowcast (INCONCLUSIVE at n=18)

Estimate month M's CCI at day-15 (data through day-14), before the ~day-20 flash. Baseline = AR(2) on CCI + public predictors available by day-15 (CAC40 MTD, EUR/USD MTD, retail t-2, unemployment t-2, inflation t-1). Full = ElasticNetCV on baseline + 8 month-to-date mood features + Wikipedia national attention + GDELT national tone. Walk-forward final 18 months (2023-06 → 2024-11), n=18.

Target	RMSE baseline	RMSE full (ENet)	RMSE full (RF)	point Δ	DM stat	DM p
CCI savings-intent (primary)	2.831	2.692	2.677	−4.9%	+0.50	0.62
CCI headline (secondary)	2.628	2.019	2.088	−23.2%	+1.52	0.15

The signal-augmented model beats the public baseline on point estimate for both targets, and directional accuracy is 0.67 / 0.61. But the Diebold-Mariano test is not significant at either target. Pre-committed reading: inconclusive — "could not confirm at this n," not disproof.

Do not over-weight the −23%. It is a secondary outcome (headline CCI) at p=0.15, while the pre-specified primary outcome — savings-intent, the one Phase 1 actually validated contemporaneously — came in at −4.9% and p=0.62, i.e. nothing. A large point estimate at p=0.15 on n=18 is exactly the profile that excites a hopeful reader and then evaporates as n grows; the honest base rate for "striking point gain, non-significant, small n, secondary outcome" becoming a real effect is low. The thread is worth resolving, not because a pass is expected, but because resolution (either way) has value. Verdict outcome is pre-committed to savings-intent (primary). A powered re-run that moves only headline CCI while savings-intent stays null is not "the nowcast validated" — it is "one secondary macro series may carry a weak signal," and must be reported as such.

Decision-matrix placement (per pre-reg §5)

Test A inconclusive (not a clean fail) / Test B fail. The honest cell is between "Fail | Fail" (descriptive only, confirmed) and the open verdict that the nowcast question is unresolved for lack of power, not closed. Practically:

The honest product today is descriptive attention intelligence. "Where French consumer attention is moving, every number receipted, validated against what we couldn't claim." No predictive SKU. This is the convergence flagged two rounds ago — the higher-ceiling brand play and the honest descriptive read have turned out to be the same product. That is a finding, not a defeat.
The one predictive door not closed: the headline-CCI nowcast (−23% point, underpowered). It earns a properly powered re-run — not a claim.

What changes the power problem (new, 13 Jun): DE/JP in Athena + France to 2019

Per Alex: Germany and Japan chart data 2019–2025 are already in Athena (same 5 tables); France extended back to 2019. This is decisive for the only live predictive thread:

n roughly triples. France alone extends from 2021-01 to 2019-01 → Test A walk-forward n goes from 18 toward ~40+; pooling France+DE+JP raises it further. The pre-reg's central limitation — underpowered DM at n=18 — is the thing this fixes.
Generalisation becomes buildable. The locked bar (France + ≥1 held-out market) is no longer permanently gated; it is gated on an Athena pull (engineering), not a licence (five-to-six figures). That is a different, smaller blocker.
Caveat: 2019 extension pulls in the 2020 COVID structural break (large simultaneous mood + macro dislocation) — include a regime control; don't let one episode carry the result. And the audio-feature freeze (end Nov-2024) is unchanged — extension adds history at the start, not currency at the end.

This does not reopen Test B — brand/category prediction failed in France on its merits (music degraded the OOS forecast), and more history won't rescue a signal that's near-zero and wrong-signed out-of-sample. It reopens Test A only, and only as "worth properly powering," not "expected to pass."

Powered re-run — RUN same day (France + Germany, n≈45 each)

After amendment A2 was locked, the Athena data proved directly reachable (charts_poc_{fr,de,jp}_v1, audio features, 2019-01→2026-03). All three chart sets were pulled; German macro was collected from Eurostat (ei_bsco_m, country DE); the powered nowcast ran the same day. Sequence preserved: amendment locked → data pulled → test run. Japan's CCI is not Eurostat (Cabinet Office/ESRI) and is not yet collected — but it cannot change the verdict (see below).

Verdict outcome = savings-intent (primary), per A2.1. Both COVID variants reported, per A2.3.

Market	n	RMSE base	RMSE full	point Δ	DM p	clears?
France · 2020-control	45	2.790	2.621	+6.1%	0.496	no
France · excl-2020	33	2.814	2.537	+9.9%	0.332	no
Germany · 2020-control	47	2.528	2.487	+1.7%	0.763	no
Germany · excl-2020	35	2.579	2.485	+3.6%	0.523	no

Powered result: the nowcast does NOT validate. Across two independent markets at n≈45, the mood-augmented model produces small, consistent, positive — but non-significant improvements on the pre-specified primary outcome. The generalisation bar (France and ≥1 held-out market, each DM-significant) is not met: France itself does not clear, so the bar fails irrespective of Japan.

Supplementary — headline-CCI analog, all three markets (NOT the verdict; per A2.1 the verdict is savings-intent). Japan was acquired (OECD amplitude-adjusted consumer confidence, via OECD SDMX — Japan has no savings-intent sub-index, so it can only run the headline analog, and with a thinner baseline: no JP retail/unemployment locally, so AR + EUR/USD only). Run on headline CCI at n≈45:

Market	n	music effect on headline-CCI RMSE	DM p
France	45	−24.8% (worse)	0.036	significant degradation
Germany	47	−10.0% (worse)	0.255	ns
Japan	47	−20.5% (worse)	0.031	significant — but thin baseline

This is the decisive finding. The headline-CCI +23% "improvement" at n=18 did not merely evaporate under power — it reversed into a statistically significant degradation (France p=0.036). Meanwhile the primary outcome (savings-intent) showed a small, non-significant improvement. A signal that helps one CCI sub-index a little and significantly hurts the headline is the textbook signature of no robust, directionally-stable signal — exactly the noise profile a striking small-n secondary point estimate was warned to be. The discipline call (don't chase the −23%) is vindicated as strongly as the data could vindicate it.

Verdict, powered: on the pre-committed primary (savings-intent) the nowcast is a non-confirmation (small, consistent, non-significant gains, bar not met). On the headline analog, music significantly degrades the forecast. Together: no predictive claim is licensed, and the predictive thread is closed for this cycle. Honest characterisation: the mood signal does not clear a usable bar when properly powered, and on the broader outcome it actively hurts. Not a stark "music is noise" disproof on the primary, but decisively not grounds for prediction. (Outputs: data/phase2_test_a_powered_results.json; Japan via OECD SDMX DSD_STES@DF_CLI, German macro via Eurostat.)

Recommended next step (unchanged in spirit, sharpened by the data)

Do not build Essentia. Do not license demand panels. Do not pivot to a finance factor-feed. Test B failed and the powered Test A did not validate — there is no predictive product to deploy, so the extractor would be an engine for a car not chosen.
The powered re-run is DONE (France + Germany) — and it resolved to a non-confirmation. The live predictive thread is now closed for this cycle. Optional only: collect Japan's Cabinet Office/ESRI CCI to make it a clean three-market published result, but it cannot change the verdict. Re-open the predictive question only if a fundamentally better signal or the in-house extractor's live features change the inputs — not by re-running the same test.
Ship the descriptive product now, priced and worded as attention intelligence with receipts — including this memo and the powered re-run. There are now three published clean results (Phase-1 forecast fail, Phase-2 brand/category fail, Phase-2 powered nowcast non-confirmation). That track record — testing your own favourite claim to destruction and publishing the wreckage — is the product's credibility.

Companion: 2026-04-28-phase1-bridge-test-results.md (Phase 1), 2026-06-13-phase2-nowcast-preregistration.md (the locked design this executes).