KWL · Data Book Vol. I · Jun 2026 · Internal

§ The asset register

What we
have, exactly.

Ten live data sources feed the Cadence pipeline — one music corpus, two more music charts, a screen layer, three attention/discourse layers, a macro layer and a weather control. This book is the cold inventory: what each source is, how we get it, its status, and where it reaches — then the per-source taxonomy, the insights each drives, the new sources worth adding, and a hard look at how to monetise the whole. Every figure here is read from the live tables, not estimated.

13+

Live sources

Across 8 signal families. Spotify via NPILABS Athena; the rest via the KARLA harness, SearchApi and public APIs. The Trade family is now live via in-house auction data (Barnebys / Salle).

~160k

Rows in the local store

Across the SQLite databases — plus the 8-year, 150M-song Spotify corpus in Athena. Grew this quarter with search-intent (8k), GDELT themes (38k) and the brand panel.

2019–26

History depth

Seven years of daily/weekly/monthly non-music signal for France; 8+ years of worldwide charts in Spotify.

Collection update · 15 June 2026 — the stack, now complete-for-France

Since the brand expansion, the gaps have closed and three new layers are live:

Search-intent — SOLVED. Google Trends now collected reliably via SearchApi (paid, keyed): all 36 brands, 8,028 weekly rows. The datacenter-IP throttle is gone — the production-runner path we flagged. GDELT — deepened. Brand & national tone complete (23 series, 40k rows), plus a new thematic-weather layer — GDELT GKG theme coverage-volume (15/19 themes, 38k daily rows: inflation salience, safety, jobs, strikes…), so we now read not just how positive the news is but what it’s about. Genre — added, commercial-safe. Netflix titles tagged with genre + origin via Wikidata (CC0), ~89% of viewing-weeks. Brand activity via the TikTok Ads Library and auction signals via our in-house Barnebys/Salle data round out the stack.

01

The inventory

source · how · status · coverage

The complete live stack. Status reflects the data as it sits today: green = current and deep, amber = working but shallow or stale, with the reason named. Coverage is the honest count — most non-music signal is France-only today (the France PoC scope), and that is the single biggest expansion lever.

SourceSignal familyHow we get itCadenceDepthRowsGeographic coverageStatus
Spotify
Top 50 + Viral 50 + audio features
Mood (music) NPILABS Athena france_poc_v1.combined (licensed) Daily 8+ yrs 150M songs Worldwide (70+ markets in corpus; France in active analysis) Live · features frozen Nov 2024
Apple Music
Top 100 songs
Mood (music) Apple RSS endpoint Daily From Apr 2026 3,400 34 markets (EU-20 + Americas + APAC + MEA) Live · snapshot-shallow
Amazon Music
Retail digital bestsellers
Commerce (music) Amazon storefront scrape Daily From Apr 2026 120 5 markets (US, GB, DE, FR, JP) Beta · artist enrich pending
Netflix
Top 10 — Films & TV
Narrative (screen) Netflix Tudum public TSV Weekly 2021–26 45,000 9 markets (BR, DE, ES, FR, GB, IT, JP, KR, US) Live
Google Trends
Brand search index
Intent SearchApi (paid, keyed) Weekly 2021–25 8,028 France (36 brands, anchored) Live · throttle solved via SearchApi
Wikipedia
Pageviews per article
Attention Wikimedia REST API Daily 2019–25 67,017 France (fr.wikipedia, 36 brands + culture) Live · brand set collected 13 Jun
GDELT
News tone + GKG themes
Discourse GDELT 2.0 Doc API (tone + timelinevol) Daily 2019–25 79,000+ France (tone: 23 series; themes: 15) Live · tone complete + thematic weather
TikTok Ads Library
Brand ad activity
Brand activity SearchApi (TikTok Ads) On-demand rolling France (validated; per-brand) Validated
Auctions
Hammer prices / lots
Trade In-house (Barnebys / Salle) Ongoing multi-year large Global houses (Christie's, Sotheby's, Phillips…) In-house · luxury tier
Eurostat
CCI + sub-indices, retail, jobs, inflation
Macro Eurostat dissemination API Monthly 2019–26 516 France (EU-wide capable) Live
Yahoo Finance
CAC 40, EUR/USD
Trade (markets) yfinance library Daily 2019–25 3,613 France / EU Live
Open-Meteo
Temp, rain, sun, wind
Control Open-Meteo archive API Daily 2019–25 15,342 France (Paris proxy) Live · control only
02

The eight signal families

why the stack is shaped this way

The sources aren't a random scrape pile — they map to eight ways a market reveals itself. The product's whole claim is triangulation: any one family is noise, but where families agree (or one diverges) is the read. This is the frame the catalogue is organised around.

Mood — what they play

Spotify (anchor) · Apple Music. Audio features (tempo, key, valence) are the validated mood proxy. The hardest signal to fake and the one no incumbent owns.

Narrative — what they watch

Netflix Top 10, now genre- & origin-tagged via Wikidata (CC0). Films vs TV, comfort vs prestige, local vs international — what stories a market is choosing.

Attention — what they look up

Wikipedia pageviews. Where curiosity concentrates — artists, brands, events — in near-real time.

Intent — what they search & buy

Google Trends (now reliable via SearchApi) · Amazon Music. Brand search-intent and purchase-side commerce — the closest layer to demand.

Discourse — what's said

GDELT tone + thematic weather (GKG themes — what coverage is about, and how it's shifting) · Google News headlines. Separates felt mood from reported mood.

Brand activity — what brands push

TikTok Ads Library. The supply side: who's advertising, how hard, with what creative — paired against demand-side attention & search.

Trade — what they value & exchange

Now live: auction signals via our in-house Barnebys / Salle data (Christie's, Sotheby's, Phillips) for the elite tier, plus CAC 40 / EUR-USD context. The hero's "trade" verb, kept.

Macro & control — the context

Eurostat · OECD (ground truth + commercial weather) · Open-Meteo (confound control, never published). What anchors and validates everything else.

All six verbs now have data behind them

The brand promise is "We read markets by what they play, watch, read, trade, buy and search" — six verbs. The live stack now delivers all six: play (Spotify/Apple), watch (Netflix + Wikidata genre), read (Wikipedia / GDELT tone & themes / Google News), trade (auction signals via Barnebys/Salle + markets context), buy (Amazon), search (Google Trends via SearchApi). The remaining build-out on Trade is promoting markets from control to signal (sector rotation + brand equity for listed names). See the catalogue and roadmap.

03

How it arrives

the KARLA harness

Nine of the ten sources run through the KARLA scraping harness as discrete fetchers (kwl_*.py), each writing to a long-format, country-partitioned, source-attributed SQLite table. Spotify is the exception — licensed via NPILABS and queried from AWS Athena. The contract that holds them together: every row carries a source and fetched_at, so every figure in a report can name its origin and date. That discipline is the receipts product.

MechanismSourcesReliabilityNote
Official APIEurostat, Wikipedia, GDELT, Open-Meteo, Yahoo FinanceHighStable contracts, no parser rot — the backbone
RSS / public feedApple Music, Netflix TudumHighDecade-stable endpoints; Apple serves current-only (no backfill)
Library wrapperGoogle Trends (pytrends)MediumRate-limited; exponential backoff in place
HTML scrapeAmazon MusicMediumBrittle to layout change; artist enrichment pending
Licensed warehouseSpotify (NPILABS Athena)HighThe deep asset; audio-features endpoint frozen by Spotify Nov 2024

The two honest caveats, up front

1. France-centric. Seven of ten sources are France-only today. The schemas are country-partitioned from day one, so widening is config not rebuild — but the multi-market story is a roadmap item, not a current fact.

2. The pipeline must run. The deep asset is accumulated history — charts are ephemeral and cannot be backfilled once missed. Every day the fetchers run, the moat deepens; the cron is the most strategic line of code in the company. (See Monetisation for why this is the real asset.)