examples/notebooks/05_intra_run.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 05 \u2014 Intra-run dynamics\n",
    "\n",
    "Within-run signals from lap-level splits: cardiac drift, cadence/stride, route-controlled pace, HR-zone distribution.\n",
    "\n",
    "**Data note.** This project's sync was via the Garmin live API (`sync.py`), not the official zip export, so `activity_fit_files` is empty and per-second FIT data isn't available. Everything here runs on `activity_splits` (per-mile laps, ~2 000 rows). When FIT files arrive via `ingest_export.py`, these same analyses upgrade to per-second resolution \u2014 only the loader needs to change."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import sys\n",
    "sys.path.insert(0, '..')\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "from analysis import (\n",
    "    open_conn, load_splits, decoupling,\n",
    "    assign_hr_zone, cluster_routes, haversine_km,\n",
    ")\n",
    "\n",
    "conn = open_conn()\n",
    "splits = load_splits(conn)  # running only by default\n",
    "print(f'{len(splits):,} splits across {splits.activity_id.nunique():,} runs, {splits.start_time_local.min().date()} \u2192 {splits.start_time_local.max().date()}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Cardiac drift (Pa:Hr decoupling)\n",
    "\n",
    "Within a single run, divide the laps into first half and second half. For each half compute the duration-weighted ratio `speed / HR` \u2014 essentially \"pace per heartbeat,\" the gold-standard aerobic-fitness index. Decoupling = how much that ratio falls between halves.\n",
    "\n",
    "$$\\text{decoupling}\\;\\% = \\left(\\frac{(\\text{speed}/\\text{HR})_{1st}}{(\\text{speed}/\\text{HR})_{2nd}} - 1\\right) \\times 100$$\n",
    "\n",
    "Friel's rule of thumb for steady aerobic runs:\n",
    "- **< 5 %**  \u2014 aerobically developed\n",
    "- **5\u201310 %** \u2014 moderate drift; sustainable\n",
    "- **> 10 %** \u2014 significant drift; pace was unsustainable or it was a hot/hard day\n",
    "\n",
    "Negative values mean you ran *more efficiently* in the second half (a negative split or conservative opener)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dec = decoupling(splits, min_splits=6)\n",
    "print(f'{len(dec)} runs with \u2265 6 splits')\n",
    "dec['decoupling_pct'].describe().round(2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, axes = plt.subplots(1, 2, figsize=(13, 4.5))\n",
    "\n",
    "ax = axes[0]\n",
    "ax.hist(dec['decoupling_pct'], bins=30, edgecolor='white')\n",
    "for x, lab, color in [(5, 'good <5%', '#2a9d8f'), (10, 'caution 10%', '#e76f51')]:\n",
    "    ax.axvline(x, color=color, ls='--', lw=1.2, label=lab)\n",
    "ax.axvline(0, color='gray', lw=0.8)\n",
    "ax.set_xlabel('decoupling (%)'); ax.set_ylabel('runs')\n",
    "ax.set_title(f'Pa:Hr decoupling distribution (n={len(dec)})')\n",
    "ax.legend()\n",
    "\n",
    "ax = axes[1]\n",
    "sc = ax.scatter(dec['start_time_local'], dec['decoupling_pct'],\n",
    "                c=dec['distance_km'], cmap='viridis', alpha=0.75, s=28)\n",
    "ax.axhline(5, color='#2a9d8f', ls='--', lw=1)\n",
    "ax.axhline(10, color='#e76f51', ls='--', lw=1)\n",
    "ax.axhline(0, color='gray', lw=0.6)\n",
    "ax.set_ylabel('decoupling (%)'); ax.set_title('decoupling over time (color = distance km)')\n",
    "plt.colorbar(sc, ax=ax, label='distance (km)')\n",
    "fig.autofmt_xdate()\n",
    "fig.tight_layout()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Aerobic runs only \u2014 the clean view\n",
    "\n",
    "Decoupling is only interpretable on **steady aerobic** efforts. Filter to runs \u2265 8 km with avg HR < 165 (well below threshold for a sub-3:30 marathoner) and look at the trend. Less drift over time = better aerobic conditioning."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "aerobic = dec[(dec['distance_km'] >= 8) & (dec['avg_hr'] < 165)].copy()\n",
    "aerobic['quarter'] = aerobic['start_time_local'].dt.to_period('Q').dt.to_timestamp()\n",
    "print(f'{len(aerobic)} aerobic runs')\n",
    "\n",
    "fig, ax = plt.subplots(figsize=(11, 4.5))\n",
    "ax.scatter(aerobic['start_time_local'], aerobic['decoupling_pct'],\n",
    "           c=aerobic['avg_hr'], cmap='magma_r', s=40, alpha=0.85)\n",
    "\n",
    "# rolling median (smooth trend)\n",
    "aerobic_sorted = aerobic.sort_values('start_time_local')\n",
    "rolling = aerobic_sorted.set_index('start_time_local')['decoupling_pct'].rolling('120D', min_periods=5).median()\n",
    "ax.plot(rolling.index, rolling.values, color='black', lw=2, label='120-day rolling median')\n",
    "\n",
    "ax.axhline(5, color='#2a9d8f', ls='--', lw=1)\n",
    "ax.axhline(10, color='#e76f51', ls='--', lw=1)\n",
    "ax.set_ylabel('decoupling (%)')\n",
    "ax.set_title('Cardiac drift on aerobic runs (\u22658 km, avg HR < 165)')\n",
    "ax.legend(loc='upper right')\n",
    "fig.autofmt_xdate()\n",
    "fig.tight_layout()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Year-over-year aerobic decoupling summary\n",
    "aerobic.groupby('year').agg(\n",
    "    n=('decoupling_pct', 'size'),\n",
    "    median_drift=('decoupling_pct', 'median'),\n",
    "    mean_drift=('decoupling_pct', 'mean'),\n",
    "    pct_under_5=('decoupling_pct', lambda s: (s < 5).mean() * 100),\n",
    ").round(2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1b. Per-second decoupling \u2014 race-day deep dive\n",
    "\n",
    "Lap-level decoupling (above) is coarse. With FIT files linked (since the takeout-export ingest), we can read the per-second `heart_rate` and `enhanced_speed` directly and compute Friel's decoupling without the noise from aid-station stops and lap rounding.\n",
    "\n",
    "**Method:**\n",
    "1. Drop the first 5 min (warmup) and last 2 min (cooldown / finish sprint).\n",
    "2. Drop records with speed < 0.5 m/s \u2014 aid-station pauses don't drag the mean.\n",
    "3. Slice the moving time into equal-time chunks (halves or quartiles).\n",
    "4. For each chunk: `efficiency = mean(speed) / mean(HR)`.\n",
    "5. `decoupling % = (eff_first / eff_chunk \u2212 1) \u00d7 100` \u2014 positive = drift.\n",
    "\n",
    "Friel's rule: < 5% on a steady aerobic run = aerobically developed; > 10% = unsustainable pacing or fueling deficit. Race-day numbers are expected to be higher than training (you push the back half), but *how much* higher matters."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from analysis import load_fit_records, fit_decoupling, fit_rolling_efficiency\n",
    "\n",
    "race_meta = pd.read_sql('''\n",
    "    SELECT a.activity_id, a.start_time_local, a.distance_m/1000 AS km, a.avg_hr\n",
    "    FROM activities a JOIN activity_fit_files f USING(activity_id)\n",
    "    WHERE a.distance_m >= 45000 AND a.distance_m <= 60000\n",
    "      AND a.activity_type='running'\n",
    "    ORDER BY a.start_time_local\n",
    "''', conn, parse_dates=['start_time_local'])\n",
    "print(f'{len(race_meta)} prior 50K-class races with FIT linked:')\n",
    "race_meta"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load all four FITs once, cache the records frames\n",
    "race_records = {}\n",
    "for _, r in race_meta.iterrows():\n",
    "    aid = int(r['activity_id'])\n",
    "    race_records[aid] = load_fit_records(conn, aid)\n",
    "    print(f\"  {r['start_time_local'].date()}  aid={aid}  records={len(race_records[aid]):,}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Halves and quartiles \u2014 when does the drift start?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "halves = []\n",
    "quarts = []\n",
    "for _, r in race_meta.iterrows():\n",
    "    aid = int(r['activity_id'])\n",
    "    h = fit_decoupling(race_records[aid], segments=2)\n",
    "    h.insert(0, 'race', r['start_time_local'].date())\n",
    "    halves.append(h)\n",
    "    q = fit_decoupling(race_records[aid], segments=4)\n",
    "    q.insert(0, 'race', r['start_time_local'].date())\n",
    "    quarts.append(q)\n",
    "halves_df = pd.concat(halves, ignore_index=True)\n",
    "quarts_df = pd.concat(quarts, ignore_index=True)\n",
    "\n",
    "print('Per-race half-by-half decoupling:')\n",
    "(halves_df.pivot(index='race', columns='segment', values='decoupling_pct')\n",
    "          .round(1).rename(columns={1:'Q1+Q2', 2:'Q3+Q4'}))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print('Quartile decoupling \u2014 where the drift actually starts:')\n",
    "(quarts_df.pivot(index='race', columns='segment', values='decoupling_pct')\n",
    "          .round(1).rename(columns={1:'Q1',2:'Q2',3:'Q3',4:'Q4'}))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Visualise quartile decoupling \u2014 bars per race, grouped by quartile\n",
    "fig, ax = plt.subplots(figsize=(11, 4.5))\n",
    "races = sorted(quarts_df['race'].unique())\n",
    "x = np.arange(len(races))\n",
    "w = 0.2\n",
    "colors = ['#2a9d8f', '#e9c46a', '#f4a261', '#e76f51']  # cool \u2192 hot\n",
    "for i in range(1, 5):\n",
    "    vals = [quarts_df[(quarts_df.race == r) & (quarts_df.segment == i)]['decoupling_pct'].iloc[0]\n",
    "            for r in races]\n",
    "    ax.bar(x + (i - 2.5) * w, vals, w, color=colors[i-1], label=f'Q{i}')\n",
    "ax.axhline(0, color='black', lw=0.5)\n",
    "ax.axhline(10, color='gray', ls='--', lw=1, label='Friel \"unsustainable\" threshold')\n",
    "ax.set_xticks(x)\n",
    "ax.set_xticklabels([str(r) for r in races])\n",
    "ax.set_ylabel('decoupling (%)')\n",
    "ax.set_title('Per-second decoupling by race quartile \u2014 the wall lands in Q3 every time')\n",
    "ax.legend(loc='upper left', ncol=5)\n",
    "fig.tight_layout()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Rolling efficiency curves \u2014 when does the wheels-come-off moment hit?\n",
    "\n",
    "5-minute rolling speed/HR over elapsed time. Flat = pacing matches HR. Falling curve = decoupling in progress. The y-axis is the same physical quantity Friel's method aggregates, just plotted continuously."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, axes = plt.subplots(len(race_meta), 1, figsize=(13, 2.5 * len(race_meta)),\n",
    "                          sharex=True, squeeze=False)\n",
    "axes = axes.flatten()\n",
    "for ax, (_, r) in zip(axes, race_meta.iterrows()):\n",
    "    aid = int(r['activity_id'])\n",
    "    rolled = fit_rolling_efficiency(race_records[aid], window_s=300)\n",
    "    valid = rolled.dropna(subset=['rolling_efficiency'])\n",
    "    ax.plot(valid['elapsed_min'], valid['rolling_efficiency'], color='#264653', lw=1.5)\n",
    "    # Normalise against the first 30 minutes' mean to show % drop\n",
    "    base = valid.loc[valid['elapsed_min'] < 30, 'rolling_efficiency'].mean()\n",
    "    if base and base > 0:\n",
    "        ax2 = ax.twinx()\n",
    "        ax2.plot(valid['elapsed_min'], (valid['rolling_efficiency'] / base - 1) * 100,\n",
    "                 color='#e76f51', lw=1, alpha=0.6)\n",
    "        ax2.set_ylabel('% vs first 30 min', color='#e76f51', fontsize=9)\n",
    "        ax2.axhline(0, color='#e76f51', ls=':', lw=0.6, alpha=0.5)\n",
    "    ax.set_ylabel('speed / HR')\n",
    "    ax.set_title(f\"{r['start_time_local'].date()}  \u2014  {r['km']:.1f} km, avg HR {r['avg_hr']:.0f}\",\n",
    "                 fontsize=10)\n",
    "axes[-1].set_xlabel('elapsed minutes')\n",
    "fig.suptitle('Rolling efficiency through each race (5-min window)', y=1.01)\n",
    "fig.tight_layout()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### HR and pace traces, side by side\n",
    "\n",
    "Same data, separated: HR (left axis, magma colour-scale) and pace (right axis, inverted so faster is up). The interesting moments are where the curves *diverge* \u2014 HR climbing while pace stays flat (drift) or HR steady while pace falls (just tired legs)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, axes = plt.subplots(len(race_meta), 1, figsize=(13, 2.8 * len(race_meta)),\n",
    "                          sharex=True, squeeze=False)\n",
    "axes = axes.flatten()\n",
    "for ax, (_, r) in zip(axes, race_meta.iterrows()):\n",
    "    aid = int(r['activity_id'])\n",
    "    rec = race_records[aid].dropna(subset=['heart_rate','speed_mps','elapsed_s'])\n",
    "    rec = rec[rec['speed_mps'] > 0.5]\n",
    "    em = rec['elapsed_s'] / 60\n",
    "    rolled_hr = rec['heart_rate'].rolling(300, min_periods=30).mean()\n",
    "    rolled_pace = (1 / rec['speed_mps']) * 1000 / 60\n",
    "    rolled_pace = rolled_pace.rolling(300, min_periods=30).mean()\n",
    "    ax.plot(em, rolled_hr, color='#9b2226', lw=1.4, label='HR (5-min avg)')\n",
    "    ax.set_ylabel('HR (bpm)', color='#9b2226')\n",
    "    ax.tick_params(axis='y', labelcolor='#9b2226')\n",
    "    ax2 = ax.twinx()\n",
    "    ax2.plot(em, rolled_pace, color='#264653', lw=1.4, label='pace (5-min avg)')\n",
    "    ax2.set_ylabel('pace (min/km)', color='#264653')\n",
    "    ax2.tick_params(axis='y', labelcolor='#264653')\n",
    "    ax2.invert_yaxis()  # faster = up\n",
    "    ax.set_title(f\"{r['start_time_local'].date()}  \u2014  {r['km']:.1f} km\", fontsize=10)\n",
    "axes[-1].set_xlabel('elapsed minutes')\n",
    "fig.suptitle('HR (red) and pace (dark) \u2014 divergence = decoupling', y=1.01)\n",
    "fig.tight_layout()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Per-second vs per-mile decoupling \u2014 sanity check\n",
    "\n",
    "How does the FIT-derived number compare to the lap-level decoupling we computed in \u00a71? Per-second is correctly excluding stopped time and lap rounding, so should be **lower** than the per-mile number for the same race \u2014 but the qualitative ranking should agree."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Pull the per-mile (\u00a71) value for each race and compare to per-second\n",
    "lap_dec = dec.set_index('activity_id')['decoupling_pct']\n",
    "rows = []\n",
    "for _, r in race_meta.iterrows():\n",
    "    aid = int(r['activity_id'])\n",
    "    ps = halves_df[(halves_df.race == r['start_time_local'].date()) & (halves_df.segment == 2)]['decoupling_pct'].iloc[0]\n",
    "    lap = lap_dec.get(aid, float('nan'))\n",
    "    rows.append({'race': r['start_time_local'].date(), 'km': r['km'],\n",
    "                 'per_mile_decoupling_pct': round(lap, 1),\n",
    "                 'per_second_decoupling_pct': round(ps, 1),\n",
    "                 'delta': round(lap - ps, 1) if not pd.isna(lap) else None})\n",
    "pd.DataFrame(rows)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### What this means for the 50-mile\n",
    "\n",
    "The per-second view localises the drift: in every prior race the wheels come off around the **4-hour mark** (between Q2 and Q3). For the 50-mile that's roughly halfway through the race \u2014 exactly when fueling errors stop being recoverable.\n",
    "\n",
    "Three concrete implications:\n",
    "\n",
    "1. **Front-load fueling.** The textbook glycogen depletion curve says 90 min of running on stored glycogen, then performance falls off without external carbs. Q1 (the easy half) shouldn't be a fueling holiday \u2014 every aid station, every hour, from the start.\n",
    "2. **Recalibrate pace by HR, not by feel.** The rolling-efficiency plots show HR rising while pace falls. Setting an HR ceiling (e.g. Z2 top = 143 bpm for the long run, slightly higher for race) and *enforcing it* would flatten the Q3 collapse.\n",
    "3. **What success looks like on Sept 12.** A 50-mile race executed cleanly should look like the *first half* of these 50K curves repeated twice. If the Q3 wall reappears around hour 4\u20135, treat it as a planned aid-station break to top up calories before continuing."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Cadence and stride length\n",
    "\n",
    "At a given pace, faster runners tend to have **higher cadence and shorter stride**. Watching cadence-vs-pace and stride-vs-pace by year shows whether form is shifting independently of fitness.\n",
    "\n",
    "Garmin's `averageRunCadence` per split is already **both-legs** steps-per-minute (typical running range 150\u2013185). `strideLength` is in cm."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "form = splits.dropna(subset=['averageRunCadence', 'strideLength', 'pace_min_per_km']).copy()\n",
    "form = form[form['averageRunCadence'] > 0].copy()  # zeros = walking/standing intervals\n",
    "form['cadence_spm'] = form['averageRunCadence']  # already both-legs SPM\n",
    "form['stride_m'] = form['strideLength'] / 100\n",
    "form = form[(form['cadence_spm'] >= 140) & (form['cadence_spm'] <= 200)]  # drop walks/junk\n",
    "print(f'{len(form):,} clean form splits, {form.activity_id.nunique()} runs')\n",
    "\n",
    "fig, axes = plt.subplots(1, 2, figsize=(13, 5), sharex=True)\n",
    "years = sorted(form['year'].unique())\n",
    "cmap = plt.cm.viridis(np.linspace(0, 0.9, len(years)))\n",
    "\n",
    "for c, y in zip(cmap, years):\n",
    "    d = form[form['year'] == y]\n",
    "    axes[0].scatter(d['pace_min_per_km'], d['cadence_spm'], s=8, alpha=0.35, color=c, label=str(y))\n",
    "    axes[1].scatter(d['pace_min_per_km'], d['stride_m'], s=8, alpha=0.35, color=c, label=str(y))\n",
    "\n",
    "axes[0].set_ylabel('cadence (steps/min, both legs)')\n",
    "axes[1].set_ylabel('stride length (m)')\n",
    "for ax in axes:\n",
    "    ax.set_xlabel('pace (min/km)')\n",
    "    ax.invert_xaxis()  # faster runs to the right\n",
    "axes[0].legend(title='year', loc='lower left', fontsize=8)\n",
    "axes[0].set_title('Cadence vs pace')\n",
    "axes[1].set_title('Stride length vs pace')\n",
    "fig.tight_layout()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Form at a controlled pace\n",
    "\n",
    "Bin splits into a narrow easy-pace band (5:30\u20136:30 min/km) and look at cadence / stride / vertical metrics year-over-year. Holding pace constant strips out the obvious \"faster = higher cadence\" effect and isolates technique drift."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "band = form[(form['pace_min_per_km'] >= 5.5) & (form['pace_min_per_km'] <= 6.5)].copy()\n",
    "vert = splits.dropna(subset=['verticalOscillation', 'verticalRatio', 'groundContactTime'])\n",
    "vert = vert[(vert['pace_min_per_km'] >= 5.5) & (vert['pace_min_per_km'] <= 6.5)]\n",
    "\n",
    "summary = band.groupby('year').agg(\n",
    "    n_splits=('cadence_spm', 'size'),\n",
    "    cadence_med=('cadence_spm', 'median'),\n",
    "    stride_med=('stride_m', 'median'),\n",
    ")\n",
    "vsum = vert.groupby('year').agg(\n",
    "    vert_osc_cm=('verticalOscillation', 'median'),\n",
    "    vert_ratio=('verticalRatio', 'median'),\n",
    "    gct_ms=('groundContactTime', 'median'),\n",
    ")\n",
    "summary = summary.join(vsum).round(2)\n",
    "summary"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, axes = plt.subplots(1, 3, figsize=(14, 4))\n",
    "summary['cadence_med'].plot(kind='bar', ax=axes[0], color='#264653')\n",
    "axes[0].set_ylabel('cadence (spm)'); axes[0].set_title('Cadence at easy pace')\n",
    "axes[0].set_ylim(summary['cadence_med'].min() - 3, summary['cadence_med'].max() + 3)\n",
    "\n",
    "summary['stride_med'].plot(kind='bar', ax=axes[1], color='#2a9d8f')\n",
    "axes[1].set_ylabel('stride length (m)'); axes[1].set_title('Stride at easy pace')\n",
    "axes[1].set_ylim(summary['stride_med'].min() - 0.05, summary['stride_med'].max() + 0.05)\n",
    "\n",
    "if summary['vert_osc_cm'].notna().any():\n",
    "    summary['vert_osc_cm'].plot(kind='bar', ax=axes[2], color='#e76f51')\n",
    "    axes[2].set_ylabel('vertical oscillation (cm)'); axes[2].set_title('Vertical bounce at easy pace')\n",
    "    axes[2].set_ylim(summary['vert_osc_cm'].min() - 0.5, summary['vert_osc_cm'].max() + 0.5)\n",
    "else:\n",
    "    axes[2].text(0.5, 0.5, 'no vertical-osc data', ha='center', va='center', transform=axes[2].transAxes)\n",
    "\n",
    "for ax in axes:\n",
    "    ax.set_xlabel('year')\n",
    "fig.suptitle('Form metrics in the 5:30\u20136:30 min/km band, year over year')\n",
    "fig.tight_layout()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Route clustering \u2014 pace controlled for terrain\n",
    "\n",
    "Raw pace year-over-year mixes terrain, weather, intent. Cluster runs by their **start coordinates** (greedy haversine, 250 m radius) and you get \"my usual routes.\" Within a cluster the route is roughly the same, so pace differences are mostly fitness, not geography."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "starts = (splits.dropna(subset=['startLatitude', 'startLongitude'])\n",
    "                  .groupby('activity_id')\n",
    "                  .agg(lat=('startLatitude', 'first'),\n",
    "                       lon=('startLongitude', 'first'),\n",
    "                       start_time=('start_time_local', 'first'),\n",
    "                       distance_km=('distance_m', lambda s: s.sum() / 1000),\n",
    "                       avg_hr=('avg_hr', 'mean'),\n",
    "                       avg_pace=('pace_min_per_km', 'mean')))\n",
    "starts['cluster'] = cluster_routes(starts['lat'].values, starts['lon'].values, radius_km=0.25)\n",
    "print(f'{len(starts)} runs with start coords; {(starts.cluster >= 0).sum()} clustered, {(starts.cluster == -1).sum()} singletons')\n",
    "\n",
    "top = (starts[starts.cluster >= 0]\n",
    "       .groupby('cluster')\n",
    "       .agg(n=('cluster', 'size'),\n",
    "            lat=('lat', 'median'),\n",
    "            lon=('lon', 'median'),\n",
    "            first=('start_time', 'min'),\n",
    "            last=('start_time', 'max'),\n",
    "            med_dist=('distance_km', 'median'))\n",
    "       .sort_values('n', ascending=False))\n",
    "top.head(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, ax = plt.subplots(figsize=(9, 7))\n",
    "\n",
    "# unclustered points faint\n",
    "unc = starts[starts.cluster == -1]\n",
    "ax.scatter(unc['lon'], unc['lat'], color='lightgray', s=12, alpha=0.6, label='singletons')\n",
    "\n",
    "# top-10 clusters colored\n",
    "top10 = top.head(10).index.tolist()\n",
    "cmap = plt.cm.tab10(np.linspace(0, 1, len(top10)))\n",
    "for c, cl in zip(cmap, top10):\n",
    "    pts = starts[starts.cluster == cl]\n",
    "    ax.scatter(pts['lon'], pts['lat'], color=c, s=40, alpha=0.8,\n",
    "               label=f'route #{cl} (n={len(pts)})')\n",
    "\n",
    "ax.set_xlabel('longitude'); ax.set_ylabel('latitude')\n",
    "ax.set_title('Run start points \u2014 top 10 recurring routes')\n",
    "ax.legend(loc='best', fontsize=8)\n",
    "ax.set_aspect('equal', adjustable='datalim')\n",
    "fig.tight_layout()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Pace progression within each frequent route\n",
    "\n",
    "Now restrict to clusters with \u22655 runs and plot pace over time per route. Slopes here are much cleaner than the global pace trend because terrain is held constant."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "frequent = top[top['n'] >= 5].index.tolist()\n",
    "print(f'{len(frequent)} routes with \u22655 runs')\n",
    "\n",
    "if frequent:\n",
    "    n_cols = min(3, len(frequent))\n",
    "    n_rows = (len(frequent) + n_cols - 1) // n_cols\n",
    "    fig, axes = plt.subplots(n_rows, n_cols, figsize=(5 * n_cols, 3.2 * n_rows),\n",
    "                              squeeze=False, sharey=True)\n",
    "    for ax, cl in zip(axes.flat, frequent):\n",
    "        d = (starts[starts.cluster == cl]\n",
    "             .dropna(subset=['avg_pace'])\n",
    "             .sort_values('start_time'))\n",
    "        ax.scatter(d['start_time'], d['avg_pace'], s=30, alpha=0.75, color='#264653')\n",
    "        if len(d) >= 3:\n",
    "            # rolling median by date\n",
    "            rd = d.set_index('start_time')['avg_pace'].rolling('120D', min_periods=2).median()\n",
    "            ax.plot(rd.index, rd.values, color='#e76f51', lw=1.5)\n",
    "        ax.invert_yaxis()  # faster up\n",
    "        ax.set_title(f'route #{cl} \u2014 n={len(d)}, ~{top.loc[cl, \"med_dist\"]:.1f} km')\n",
    "        ax.set_ylabel('pace (min/km)')\n",
    "    # blank unused axes\n",
    "    for ax in axes.flat[len(frequent):]:\n",
    "        ax.set_visible(False)\n",
    "    fig.suptitle('Per-route pace over time (terrain held roughly constant)')\n",
    "    fig.tight_layout()\n",
    "else:\n",
    "    print('No clusters with \u22655 runs.')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. HR-zone time-in-zone (Garmin-configured zones, per-second when possible)\n",
    "\n",
    "**Zones come from Garmin's `heartRateZones.json` (training method: HR_MAX),** not estimated from observed HR. Lactate-threshold HR sits at 182 inside Z4.\n",
    "\n",
    "| Zone | range (bpm) | feel | role |\n",
    "|------|-------------|------|------|\n",
    "| Z1   | 102\u2013122    | walk / recovery | active rest |\n",
    "| Z2   | 123\u2013143    | conversational | **long-run target** |\n",
    "| Z3   | 144\u2013164    | tempo | the \"junk-miles middle\" |\n",
    "| Z4   | 165\u2013185    | threshold (LTHR=182) | hard sustained |\n",
    "| Z5   | 186\u2013209    | VO\u2082 max | intervals |\n",
    "\n",
    "For each activity, time-in-zone comes from `activity_time_in_zone` (precomputed by `compute_time_in_zone.py`):\n",
    "- **`source='fit'`** \u2014 per-second HR from the FIT file. Each record's `dt` (typically 1 s) goes into whichever zone its HR falls in. Accurate even when laps span zone boundaries.\n",
    "- **`source='lap'`** \u2014 fallback for activities without a linked FIT. The whole lap's duration is assigned to whichever zone the *lap's average* HR sits in. Smears across boundaries, biases toward middle zones.\n",
    "\n",
    "**Polarized-training rule (Seiler):** elites accumulate ~80% of weekly time in Z1+Z2 and ~20% in Z4+Z5, with little Z3."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from analysis import HR_ZONES_USER\n",
    "\n",
    "tiz = pd.read_sql('''\n",
    "    SELECT t.activity_id, t.z1_s, t.z2_s, t.z3_s, t.z4_s, t.z5_s, t.total_s, t.source,\n",
    "           a.start_time_local, a.activity_type, a.distance_m, a.duration_s\n",
    "    FROM activity_time_in_zone t\n",
    "    JOIN activities a USING(activity_id)\n",
    "    WHERE a.activity_type IN ('running','trail_running')\n",
    "''', conn, parse_dates=['start_time_local'])\n",
    "tiz['week']  = tiz['start_time_local'].dt.to_period('W-MON').dt.start_time\n",
    "tiz['year']  = tiz['start_time_local'].dt.year\n",
    "\n",
    "print(f'{len(tiz)} activities with cached time-in-zone')\n",
    "print(' source breakdown:')\n",
    "print(tiz['source'].value_counts().to_string())\n",
    "print(f' fit coverage: {(tiz.source==\"fit\").mean()*100:.0f}% of running activities')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Sanity check: FIT vs lap method on the same race\n",
    "\n",
    "On the same activity, how different are the two estimates? Take the 2025-09-20 race (8 hours, 28k FIT records) and compute both, then compare. The lap method should over-weight whichever zone the typical lap average falls in (here, Z3) and under-count time spent in adjacent zones because boundary-crossing laps get rounded to one zone."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from analysis import load_fit_records, time_in_zone_from_fit, time_in_zone_from_splits\n",
    "\n",
    "race_aid = race_meta.iloc[-1]['activity_id']  # most recent 50K (2025-09-20)\n",
    "race_date = race_meta.iloc[-1]['start_time_local'].date()\n",
    "\n",
    "fit_tiz = time_in_zone_from_fit(race_records[int(race_aid)])\n",
    "race_splits = pd.read_sql(\n",
    "    'SELECT avg_hr, duration_s FROM activity_splits WHERE activity_id = ?',\n",
    "    conn, params=[int(race_aid)]\n",
    ")\n",
    "lap_tiz = time_in_zone_from_splits(race_splits)\n",
    "\n",
    "compare = pd.DataFrame({\n",
    "    'FIT (per-sec) min': {k: round(v / 60, 1) for k, v in fit_tiz.items()},\n",
    "    'lap (avg-HR) min': {k: round(v / 60, 1) for k, v in lap_tiz.items()},\n",
    "}).reindex(['Z1','Z2','Z3','Z4','Z5']).fillna(0)\n",
    "compare.loc['total'] = compare.sum()\n",
    "compare['delta (min)'] = (compare['FIT (per-sec) min'] - compare['lap (avg-HR) min']).round(1)\n",
    "print(f'Race {race_date} \u2014 FIT vs lap time-in-zone:')\n",
    "compare"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Weekly time-in-zone (hours) using whichever method was available per activity\n",
    "wk_cols = ['z1_s','z2_s','z3_s','z4_s','z5_s']\n",
    "weekly = tiz.groupby('week')[wk_cols].sum() / 3600\n",
    "weekly.columns = ['Z1','Z2','Z3','Z4','Z5']\n",
    "\n",
    "fig, ax = plt.subplots(figsize=(14, 4.8))\n",
    "colors = ['#264653', '#2a9d8f', '#e9c46a', '#f4a261', '#e76f51']\n",
    "ax.stackplot(weekly.index, weekly.T.values, labels=weekly.columns, colors=colors, alpha=0.92)\n",
    "ax.set_ylabel('hours / week')\n",
    "ax.set_title('Weekly running time by HR zone (Garmin-configured zones)')\n",
    "ax.legend(loc='upper left', ncol=5, fontsize=9)\n",
    "fig.autofmt_xdate()\n",
    "fig.tight_layout()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Polarized index over time\n",
    "\n",
    "Collapse to three buckets:\n",
    "- **easy** = Z1 + Z2 (HR \u2264 143)\n",
    "- **moderate \"junk\"** = Z3 (144\u2013164)\n",
    "- **hard** = Z4 + Z5 (HR \u2265 165, threshold and up)\n",
    "\n",
    "Polarized: high easy, low moderate, modest hard. Pyramidal: high easy, modest moderate, low hard. Threshold-heavy: lots of moderate."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "buckets = pd.DataFrame({\n",
    "    'easy':     weekly[['Z1','Z2']].sum(axis=1),\n",
    "    'moderate': weekly['Z3'],\n",
    "    'hard':     weekly[['Z4','Z5']].sum(axis=1),\n",
    "})\n",
    "totals = buckets.sum(axis=1)\n",
    "pct = buckets.div(totals.replace(0, np.nan), axis=0) * 100\n",
    "\n",
    "fig, axes = plt.subplots(2, 1, figsize=(14, 7), sharex=True)\n",
    "\n",
    "ax = axes[0]\n",
    "ax.stackplot(pct.index, pct[['easy','moderate','hard']].T.values,\n",
    "             labels=['easy (Z1+Z2)', 'moderate (Z3)', 'hard (Z4+Z5)'],\n",
    "             colors=['#2a9d8f', '#e9c46a', '#e76f51'], alpha=0.9)\n",
    "ax.axhline(80, color='black', ls='--', lw=0.8, label='80% easy target')\n",
    "ax.set_ylabel('% of weekly time'); ax.set_ylim(0, 100)\n",
    "ax.set_title('Polarized-training split (weekly)')\n",
    "ax.legend(loc='lower left', ncol=4, fontsize=9)\n",
    "\n",
    "ax = axes[1]\n",
    "rolling_pct = pct.rolling(4, min_periods=2).mean()\n",
    "for col, c in [('easy','#2a9d8f'), ('moderate','#e9c46a'), ('hard','#e76f51')]:\n",
    "    ax.plot(rolling_pct.index, rolling_pct[col], color=c, lw=2, label=col)\n",
    "ax.axhline(80, color='#2a9d8f', ls='--', lw=0.8)\n",
    "ax.axhline(20, color='#e76f51', ls='--', lw=0.8)\n",
    "ax.set_ylabel('% (4-week rolling mean)')\n",
    "ax.legend(loc='best', fontsize=9)\n",
    "fig.autofmt_xdate()\n",
    "fig.tight_layout()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Yearly summary + FIT-method coverage\n",
    "\n",
    "`fit_coverage_%` shows what fraction of each year's activities had a linked FIT (and therefore got per-second zones). 2026's lower coverage reflects activities that synced via the live API but aren't in the takeout dump."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "yearly_hours = (tiz.groupby('year')[wk_cols].sum() / 3600).round(1)\n",
    "yearly_hours.columns = ['Z1','Z2','Z3','Z4','Z5']\n",
    "yearly_pct = yearly_hours.div(yearly_hours.sum(axis=1), axis=0) * 100\n",
    "\n",
    "out = pd.concat({'hours': yearly_hours, '%': yearly_pct.round(1)}, axis=1)\n",
    "out['easy_pct'] = (yearly_pct[['Z1','Z2']].sum(axis=1)).round(1)\n",
    "out['hard_pct'] = (yearly_pct[['Z4','Z5']].sum(axis=1)).round(1)\n",
    "fit_coverage = tiz.groupby('year').apply(\n",
    "    lambda g: (g['source']=='fit').mean() * 100, include_groups=False\n",
    ").round(0)\n",
    "out['fit_coverage_%'] = fit_coverage\n",
    "out"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Race-build vs base-period zone distribution\n",
    "\n",
    "Compare what training looked like in the 12 weeks before each prior 50K race vs the rest of the year. A serious build should shift time into Z2 (long aerobic) and Z4 (threshold/tempo) and away from Z3."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "race_dates = race_meta['start_time_local']\n",
    "tiz['phase'] = 'base'\n",
    "for rd in race_dates:\n",
    "    build_start = rd - pd.Timedelta(weeks=12)\n",
    "    mask = (tiz['start_time_local'] >= build_start) & (tiz['start_time_local'] < rd)\n",
    "    tiz.loc[mask, 'phase'] = f'build {rd.year}'\n",
    "\n",
    "phase_hours = (tiz.groupby('phase')[wk_cols].sum() / 3600).round(1)\n",
    "phase_hours.columns = ['Z1','Z2','Z3','Z4','Z5']\n",
    "phase_pct = (phase_hours.div(phase_hours.sum(axis=1), axis=0) * 100).round(1)\n",
    "phase_pct['easy_Z1+Z2'] = (phase_pct['Z1'] + phase_pct['Z2']).round(1)\n",
    "phase_pct['junk_Z3']    = phase_pct['Z3']\n",
    "phase_pct['hard_Z4+Z5'] = (phase_pct['Z4'] + phase_pct['Z5']).round(1)\n",
    "phase_pct[['easy_Z1+Z2','junk_Z3','hard_Z4+Z5']].sort_index()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## What's left when FIT files are ingested\n",
    "\n",
    "All four sections now use per-second FIT data where it's linked (349 of 378 activities, 92%). Remaining lap-only activities are mostly old multi-sport / triathlon legs that no FIT was uploaded for. Useful follow-ups:\n",
    "\n",
    "- **Cadence stability** \u2014 plot cadence over elapsed time within a long run; quantify the drop in the final 15 %.\n",
    "- **GPS polylines for route clustering** \u2014 current \u00a73 uses start coordinates only; with full FIT GPS tracks, match routes by Hausdorff distance (more accurate than start-only).\n",
    "- **Decoupling vs fueling protocol** \u2014 once the user logs even informal fueling notes for a few long runs, regress decoupling against carb intake."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": ".venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}