openrun/README.md

# openrun — endurance running analytics

A local-first, open-source endurance-training analysis pipeline. Garmin data in (live API or official export), SQLite on disk, pandas notebooks out. Banister CTL/ATL/TSB, Pa:HR decoupling (per-mile and per-second from FIT), Garmin-configured HR-zone time-in-zone, route clustering, race-plan projection with forward CTL/ATL/TSB.

This README focuses on what's in the data, what gets computed from it, and how each derived metric is defined. The package name `openrun` is provisional; the local repo can be renamed without code changes.

```
openrun/
├── pyproject.toml         # packaged as `openrun` (src layout, hatchling)
├── openrun.toml           # user config (HR zones, weight, LTHR, race calendar)
├── src/openrun/
│   ├── __init__.py        # re-exports the common API
│   ├── config.py          # UserProfile, HRZones, BanisterParams, TOML loader
│   ├── db.py              # SQLite schema + connection helper
│   ├── model.py           # loaders + derived metrics (was analysis.py)
│   └── ingest/
│       ├── auth.py        # Garmin OAuth login
│       ├── garmin_api.py  # Path B: incremental sync via garth
│       ├── garmin_export.py # Path A: parse Connect/Takeout JSON + index FITs
│       ├── fit_linker.py  # match Takeout FITs to activities by session.start_time
│       └── time_in_zone.py # cache per-activity HR-zone breakdowns
├── examples/notebooks/
│   ├── 01_overview.ipynb     # data coverage, recent activity
│   ├── 02_running.ipynb      # volume, pace, PMC (Banister), PRs
│   ├── 03_recovery.ipynb     # sleep, HRV, RHR, body battery
│   ├── 04_efficiency.ipynb   # m/beat trends year over year
│   ├── 05_intra_run.ipynb    # decoupling, cadence, routes, HR zones
│   └── 06_race_plan.ipynb    # periodised plan + tracker + projected PMC
├── data/garmin.db            # SQLite (created on first run)
└── {auth,sync,ingest_export,link_fit_files,compute_time_in_zone,db,analysis}.py
                              # thin top-level shims for back-compat
```

Most callers want:

```python
from openrun import open_conn, load_activities, banister, daily_training_load_series
conn = open_conn()
runs = load_activities(conn, type='running')
pmc  = banister(daily_training_load_series(conn))
```

User-specific values (HR zones, LTHR, weight, race calendar) live in `openrun.toml` — config discovery walks up from cwd looking for it. See [openrun.toml](openrun.toml) for the schema.

CLI entry points (after `uv sync`):

```bash
openrun-auth               # OAuth login (Path B)
openrun-sync [--full]      # incremental Garmin Connect sync
openrun-ingest <export>    # parse a Connect/Takeout export
openrun-link-fit <export>  # match Takeout FITs to activities by content
openrun-time-in-zone       # populate the activity_time_in_zone cache
```

Top-level shims (`uv run sync.py`, `uv run ingest_export.py`, etc.) still work — they re-export `main()` from the corresponding `openrun.ingest.*` module.

---

## 1 — Source data

### Two ways to get data into the DB

| Path | Script | What you get | When to use |
|---|---|---|---|
| **A. Official data export** (recommended) | `ingest_export.py` | Complete history, FIT files, multi-year wellness | First load; periodic refreshes |
| **B. Live API sync** | `sync.py` (after `auth.py`) | Last N days, lap-level splits, no FIT files | Incremental top-ups between exports |

The two paths overlap intentionally: the export is the source of truth, the live API fills the gap between exports. The upsert logic only overwrites the raw-JSON column and `fetched_at` on conflict, so Path B values aren't clobbered by Path A re-ingests.

### Garmin's export formats

Garmin actually has *two* export formats, both confusingly called "the export":

1. **Garmin Connect data export** (`connect.zip`) — request via Account Management → Export Your Data.  Filenames are `<activity_id>_<activity_name>.fit`, distances/durations in SI units (m, s). `ingest_export.py` was built for this.
2. **Garmin Takeout dump** — UUID-named folder (`<uuid>_N/`) with many unrelated domains (aviation, Tacx, InReach…). The running data lives under `DI_CONNECT/`. **Crucially different conventions:**
   - FIT filenames: `<email>_<upload_id>.fit` — upload IDs ≠ activity IDs
   - `summarizedActivities` uses scaled-integer units: distance in **cm**, duration in **ms**, elevation in **cm**, speed in **m/s ÷ 10**
   - `ingest_export.py` now detects and converts these; `link_fit_files.py` links FITs by content (parses `session.start_time` and matches activities by ±60 s) since the filename-based approach doesn't work

For a Takeout dump, the full sequence is:

```bash
# 1. Extract the FIT archives in place
mkdir -p <export>/DI_CONNECT/DI-Connect-Uploaded-Files/fit
unzip -d <export>/DI_CONNECT/DI-Connect-Uploaded-Files/fit \
   <export>/DI_CONNECT/DI-Connect-Uploaded-Files/UploadedFiles_0-_Part*.zip

# 2. Ingest JSON + index FITs that have parseable activity IDs
uv run ingest_export.py <export>/

# 3. Link Takeout FITs to activities by start-time match
uv run link_fit_files.py <export>/

# 4. (Optional) precompute time-in-zone cache
uv run compute_time_in_zone.py
```

### Garmin's units & quirks worth knowing

| Field | Live API | Takeout summarizedActivities | Convert by |
|---|---|---|---|
| distance | m | cm | × 0.01 |
| duration / movingDuration | s | ms | × 0.001 |
| elevationGain / Loss | m | cm | × 0.01 |
| averageSpeed / maxSpeed | m/s | m/s ÷ 10 | × 10 |
| HR (avg, max) | bpm | bpm | — |
| calories | kcal | kcal | — |
| training_effect, vo2_max | scalar | scalar | — |
| cadence (`averageRunCadence`) | both-legs SPM | both-legs SPM | — (do **not** double) |
| strideLength | cm | cm | × 0.01 for m |
| FIT `position_lat/long` | semicircles | semicircles | × (180 / 2³¹) for degrees |
| FIT `enhanced_speed` | m/s | m/s | — |

The cadence trap bit me: `averageRunCadence` on this account's exports is already both-legs (~150–180 spm). Doubling it as if it were single-leg yields 300+ which trips any sane filter, then nothing makes it through.

---

## 2 — Schema

`db.py:SCHEMA` is the source of truth; what follows is the gist.

| Table | Grain | Origin | Notes |
|---|---|---|---|
| `activities` | one row per activity | both paths | `raw` JSON has every Garmin field; the parsed columns are a curated subset |
| `activity_splits` | one row per lap | Path B only | Garmin defaults to auto-laps (~1 km or 1 mi per split). `raw` has cadence, stride, vert osc, GPS bounds |
| `activity_fit_files` | one row per linked FIT | Path A | `fit_path` is **relative** to whichever export root was passed to `link_fit_files.py` — keep the export folder in place |
| `activity_time_in_zone` | one row per activity | precomputed | Cached by `compute_time_in_zone.py`; `source` is `'fit'` or `'lap'` |
| `daily_steps` / `daily_sleep` / `daily_stress` / `daily_hrv` / `daily_body_battery` / `daily_intensity_minutes` / `daily_resting_hr` | one row per calendar date | both paths | Each has a `raw` JSON column with the full source payload |
| `sync_state` | key-value | both paths | Records last-sync / last-ingest timestamps |

The `raw` column on every table is the full upstream JSON. If you need a field the schema doesn't surface (sleep stage breakdowns, GPS bounding box, swimming-specific metrics, sport sub-types), unpack it from `raw` rather than re-ingesting:

```python
from analysis import expand_raw, load_activities
acts = load_activities(open_conn(), type='running')
fields = expand_raw(acts)  # pd.json_normalize'd companion frame
```

---

## 3 — Metrics & how they're calculated

All loaders and helpers live in [analysis.py](analysis.py). The key derivations:

### Pace

```
pace_min_per_km = (moving_duration_s / 60) / (distance_m / 1000)
```

`moving_duration_s` excludes paused time. Activities with implausible paces (sub-3 min/km or over 30 min/km — covers world-record marathon pace on the fast side and walks-with-stops on the slow side) get masked to NaN in `load_activities` rather than dropped, so the row count is preserved.

### Acute:Chronic Workload Ratio (ACWR)

Daily training-load sum, then:

```
acute   = sum(training_load) over the last 7 days
chronic = sum(training_load) over the last 28 days, divided by 4 for week-equivalence
ACWR    = acute / chronic
```

Sweet spot ~0.8–1.3, > 1.5 = aggressive ramp / injury risk. Implemented inline in [02_running.ipynb](notebooks/02_running.ipynb).

### Aerobic efficiency — "m/beat"

```
m_per_beat = distance_m / (duration_s * avg_hr / 60)
```

Higher = more metres covered per heartbeat = fitter aerobic system at the run's effort. Best compared YoY within a tight distance bucket so terrain and intent don't dominate. Used as the headline view in [04_efficiency.ipynb](notebooks/04_efficiency.ipynb).

### Decoupling (Pa:Hr drift)

Friel's method, in two resolutions. *Positive* = pace/HR fell off in the second half (cardiac drift, possibly fueling deficit). *Negative* = improved (negative split).

**Per-mile (lap level)** — `analysis.decoupling(splits, min_splits=6)`:

```
For each activity with ≥ min_splits laps:
    halves   = first half / second half by lap count
    eff_half = duration-weighted mean(speed_mps) / mean(heart_rate)
    decoupling_pct = (eff_first / eff_second − 1) × 100
```

Cheap, but smears across aid-station stops and rounds at lap boundaries.

**Per-second (FIT)** — `analysis.fit_decoupling(records, segments=N, warmup_min=5, cooldown_min=2, min_speed_mps=0.5)`:

```
records = per-second FIT messages
drop first warmup_min and last cooldown_min
drop records where speed_mps < min_speed_mps     # ignore stopped time
slice remaining moving time into N equal-time chunks
for each chunk:
    efficiency = mean(speed_mps) / mean(heart_rate)
decoupling_pct = (efficiency[0] / efficiency[i] − 1) × 100
```

Friel's published thresholds (interpret on **steady aerobic** efforts only):
- **< 5%** — aerobically developed
- **5–10%** — sustainable
- **> 10%** — pacing unsustainable for the distance, OR fueling shortfall, OR heat

In this dataset the per-second number is consistently **7–15 percentage points lower** than the per-mile number for the same run — the gap is aid-station noise.

### HR zone time-in-zone

**Zone boundaries come from Garmin's `heartRateZones.json`** (training method `HR_MAX`), not estimated from observed HR. The constants for this user live in `analysis.HR_ZONES_USER`:

```python
Z1: 102–122   (recovery)
Z2: 123–143   (easy aerobic — long-run target)
Z3: 144–164   (tempo / "junk-miles middle")
Z4: 165–185   (threshold — LTHR=182 sits inside Z4)
Z5: 186–209   (VO₂ max)
```

Time-in-zone has two implementations; `compute_time_in_zone.py` picks the better one per activity and caches the result in `activity_time_in_zone`.

**From FIT (`source='fit'`)** — `analysis.time_in_zone_from_fit(records)`:

```
records = per-second FIT messages
for each consecutive record pair:
    dt = elapsed_seconds delta (clipped to [0, 30] so paused recording doesn't dump hours into one zone)
    zone = bucket(this record's heart_rate)
    accumulate dt into zone
```

**From splits (`source='lap'`)** — fallback when no FIT is linked:

```
for each split:
    zone = bucket(split's avg_hr)
    accumulate split.duration_s into that single zone
```

The FIT version is the right number; the lap version is **biased toward the middle zone**: when a lap's HR average is in Z3 but the actual HR oscillated into Z2 and Z4, the lap method dumps the whole lap into Z3. On a sample 8-hour race in this dataset, the lap method over-counted Z3 by 67 minutes and under-counted Z2 and Z4 by ~32 minutes each.

### Polarized split

A three-bucket simplification used in [05_intra_run.ipynb §4](notebooks/05_intra_run.ipynb):

```
easy     = Z1 + Z2     (HR ≤ 143)
moderate = Z3          (144–164, the "junk-miles middle")
hard     = Z4 + Z5     (≥ 165, threshold and up)
```

Seiler's polarized target is ~80% easy, < 10% moderate, ~20% hard. Pyramidal training is high easy, modest moderate, low hard. Threshold-heavy training is the opposite — most time in moderate.

### Cadence & stride

Both come directly from Garmin's lap data (`averageRunCadence`, `strideLength`) and from FIT records (`cadence`, `step_length`). Two gotchas:

1. **Cadence is already both-legs SPM** for this user's exports — do not double. Sanity range for running: 140–200 spm.
2. **Stride length is in cm** in lap data (`strideLength: 78` → 78 cm → 0.78 m) and in **mm** in FIT records (`step_length: 1041` → 104 cm). Divide accordingly.

The notebook restricts form analysis to splits with cadence > 0 (drops walks/standing intervals) and to a controlled pace band (5:30–6:30 min/km) when comparing YoY so fitness changes don't contaminate the form signal.

### GPS route clustering

`analysis.cluster_routes(lats, lons, radius_km=0.25)`:

```
Greedy haversine clustering:
  For each unassigned point i:
    distances = haversine(point_i, all_other_points)
    neighbours = points within radius_km
    if len(neighbours) >= 2:
        assign all to next_label
    next_label += 1
```

Per-route pace trends control for terrain — the same loop run in 2023 vs 2025 says way more about fitness than raw monthly pace. Adequate for a few hundred starts; switch to `sklearn.cluster.DBSCAN(metric='haversine')` for thousands.

### Banister fitness / fatigue / form (Performance Management Chart)

The standard endurance-training lens (TrainingPeaks PMC), implemented in `analysis.banister(daily_load, ctl_tau=42, atl_tau=7)`:

```
CTL_today  = CTL_yesterday · exp(−1/τ_CTL) + load_today · (1 − exp(−1/τ_CTL))   τ_CTL = 42d
ATL_today  = ATL_yesterday · exp(−1/τ_ATL) + load_today · (1 − exp(−1/τ_ATL))   τ_ATL =  7d
TSB_today  = CTL_yesterday − ATL_yesterday          # yesterday's values (TP convention)
```

Helper: `daily_training_load_series(conn, activity_types=('running','trail_running'))` returns the per-day summed `training_load` Series that feeds it. Rest days are filled with 0 — both EWMAs still update, ATL drops faster than CTL, TSB rises.

TSB interpretation (used in both notebooks 02 and 06):

| TSB | meaning |
|---|---|
| < −30 | severely fatigued — injury risk |
| −10 to −30 | productive overload — heart of a build |
| −10 to 0 | balanced building |
| 0 to +10 | sharpening |
| **+10 to +25** | **fresh / peaked — race-day target** |
| > +25 | detrained (taper too long) |

[02_running.ipynb](notebooks/02_running.ipynb) plots the historical PMC; [06_race_plan.ipynb](notebooks/06_race_plan.ipynb) projects it forward through race day by converting plan-week km into estimated daily loads (separate `RACE_DAY_TL_PER_KM ≈ 7` for the race itself vs `~11` for training runs — race-day TL/km is empirically lower because ultras spread load over more time).

---

## 4 — Notebook tour

Each notebook bootstraps the same way:

```python
import sys; sys.path.insert(0, '..')
from analysis import open_conn, load_activities, load_wellness, ...
conn = open_conn()
```

| # | Notebook | Covers |
|---|---|---|
| 01 | Overview | Data coverage by table, recent activities, monthly mileage YoY, wellness completeness |
| 02 | Running | Weekly volume + rolling, pace-vs-HR scatter, aerobic-efficiency trend, ACWR, PRs by distance |
| 03 | Recovery | Sleep composition, after-run vs after-rest HRV/RHR, correlations between training and recovery, weekly km vs recovery indicators |
| 04 | Efficiency | YoY m/beat by distance bucket, HR/pace split, polynomial curve fits, easy-runs-only headline view |
| 05 | Intra-run | (1) per-mile decoupling, (1b) per-second decoupling on race FITs, (2) cadence & stride at controlled pace, (3) GPS route clustering, (4) HR-zone time-in-zone (FIT-based) + polarized split |
| 06 | Race plan | Periodised 17-week plan around 30K → 50K → 50-mile, with auto-tracking against actual recorded volume |

Notebooks 05 and 06 are generated from `_build_05.py` / `_build_06.py` build scripts. Edit those, then `uv run python notebooks/_build_NN.py` regenerates the .ipynb. Easier than editing 30 KB of JSON.

---

## 5 — Setup

```bash
uv sync                                  # creates .venv from pyproject.toml
```

Kernel: in VSCode, pick the project's `.venv` (top-right of any notebook). If it's not discoverable:

```bash
uv run python -m ipykernel install --user --name garmin --display-name "garmin (.venv)"
```

### Path A — official export

1. Go to <https://www.garmin.com/account/datamanagement/exportdata> and request the export. Garmin emails the link within 24–72 h.
2. Download the zip (several hundred MB to a few GB) or, for the Takeout dump, the UUID-named folder.
3. Ingest:

```bash
uv run ingest_export.py path/to/connect.zip
# or, if already unzipped:
uv run ingest_export.py path/to/<export>/
```

Add `--dry-run` to preview without writing. The data is upserted on `activity_id` / `calendar_date`, so it's safe to re-ingest a refreshed export later.

For Takeout dumps specifically, follow the 4-step extract+ingest+link+precompute sequence in §1.

### Path B — live API sync

Garmin's SSO endpoint is rate-limited; you may hit 429s. If so, wait 30–60 min, switch networks, or use Path A.

```bash
uv run auth.py            # prompts for email / password / MFA, saves OAuth to .secrets/
uv run sync.py --full     # 365-day backfill
uv run sync.py            # incremental thereafter (last 14 days)

uv run sync.py --days 90          # custom backfill window
uv run sync.py --no-details       # skip per-activity detail/splits (much faster)
uv run sync.py --skip-activities  # wellness only
```

`garth` is deprecated upstream (<https://github.com/matin/garth/discussions/222>). If the live sync stops working, fall back to Path A.