A local-first, open-source endurance-training analysis pipeline. Garmin data in (live API or official export), SQLite on disk, pandas notebooks out. Banister CTL/ATL/TSB, Pa:HR decoupling (per-mile and per-second from FIT), Garmin-configured HR-zone time-in-zone, route clustering, race-plan projection with forward CTL/ATL/TSB.
This README focuses on what's in the data, what gets computed from it, and how each derived metric is defined. The package name `openrun` is provisional; the local repo can be renamed without code changes.
from openrun import open_conn, load_activities, banister, daily_training_load_series
conn = open_conn()
runs = load_activities(conn, type='running')
pmc = banister(daily_training_load_series(conn))
```
User-specific values (HR zones, LTHR, weight, race calendar) live in `openrun.toml` — config discovery walks up from cwd looking for it. See [openrun.toml](openrun.toml) for the schema.
openrun-ingest <export> # parse a Connect/Takeout export
openrun-link-fit <export> # match Takeout FITs to activities by content
openrun-time-in-zone # populate the activity_time_in_zone cache
```
Top-level shims (`uv run sync.py`, `uv run ingest_export.py`, etc.) still work — they re-export `main()` from the corresponding `openrun.ingest.*` module.
| **A. Official data export** (recommended) | `ingest_export.py` | Complete history, FIT files, multi-year wellness | First load; periodic refreshes |
| **B. Live API sync** | `sync.py` (after `auth.py`) | Last N days, lap-level splits, no FIT files | Incremental top-ups between exports |
The two paths overlap intentionally: the export is the source of truth, the live API fills the gap between exports. The upsert logic only overwrites the raw-JSON column and `fetched_at` on conflict, so Path B values aren't clobbered by Path A re-ingests.
### Garmin's export formats
Garmin actually has *two* export formats, both confusingly called "the export":
1.**Garmin Connect data export** (`connect.zip`) — request via Account Management → Export Your Data. Filenames are `<activity_id>_<activity_name>.fit`, distances/durations in SI units (m, s). `ingest_export.py` was built for this.
2.**Garmin Takeout dump** — UUID-named folder (`<uuid>_N/`) with many unrelated domains (aviation, Tacx, InReach…). The running data lives under `DI_CONNECT/`. **Crucially different conventions:**
- FIT filenames: `<email>_<upload_id>.fit` — upload IDs ≠ activity IDs
-`summarizedActivities` uses scaled-integer units: distance in **cm**, duration in **ms**, elevation in **cm**, speed in **m/s ÷ 10**
-`ingest_export.py` now detects and converts these; `link_fit_files.py` links FITs by content (parses `session.start_time` and matches activities by ±60 s) since the filename-based approach doesn't work
| FIT `position_lat/long` | semicircles | semicircles | × (180 / 2³¹) for degrees |
| FIT `enhanced_speed` | m/s | m/s | — |
The cadence trap bit me: `averageRunCadence` on this account's exports is already both-legs (~150–180 spm). Doubling it as if it were single-leg yields 300+ which trips any sane filter, then nothing makes it through.
---
## 2 — Schema
`db.py:SCHEMA` is the source of truth; what follows is the gist.
| Table | Grain | Origin | Notes |
|---|---|---|---|
| `activities` | one row per activity | both paths | `raw` JSON has every Garmin field; the parsed columns are a curated subset |
| `activity_splits` | one row per lap | Path B only | Garmin defaults to auto-laps (~1 km or 1 mi per split). `raw` has cadence, stride, vert osc, GPS bounds |
| `activity_fit_files` | one row per linked FIT | Path A | `fit_path` is **relative** to whichever export root was passed to `link_fit_files.py` — keep the export folder in place |
| `activity_time_in_zone` | one row per activity | precomputed | Cached by `compute_time_in_zone.py`; `source` is `'fit'` or `'lap'` |
| `daily_steps` / `daily_sleep` / `daily_stress` / `daily_hrv` / `daily_body_battery` / `daily_intensity_minutes` / `daily_resting_hr` | one row per calendar date | both paths | Each has a `raw` JSON column with the full source payload |
| `sync_state` | key-value | both paths | Records last-sync / last-ingest timestamps |
The `raw` column on every table is the full upstream JSON. If you need a field the schema doesn't surface (sleep stage breakdowns, GPS bounding box, swimming-specific metrics, sport sub-types), unpack it from `raw` rather than re-ingesting:
`moving_duration_s` excludes paused time. Activities with implausible paces (sub-3 min/km or over 30 min/km — covers world-record marathon pace on the fast side and walks-with-stops on the slow side) get masked to NaN in `load_activities` rather than dropped, so the row count is preserved.
### Acute:Chronic Workload Ratio (ACWR)
Daily training-load sum, then:
```
acute = sum(training_load) over the last 7 days
chronic = sum(training_load) over the last 28 days, divided by 4 for week-equivalence
Higher = more metres covered per heartbeat = fitter aerobic system at the run's effort. Best compared YoY within a tight distance bucket so terrain and intent don't dominate. Used as the headline view in [04_efficiency.ipynb](notebooks/04_efficiency.ipynb).
### Decoupling (Pa:Hr drift)
Friel's method, in two resolutions. *Positive* = pace/HR fell off in the second half (cardiac drift, possibly fueling deficit). *Negative* = improved (negative split).
Friel's published thresholds (interpret on **steady aerobic** efforts only):
- **< 5%** — aerobically developed
- **5–10%** — sustainable
- **> 10%** — pacing unsustainable for the distance, OR fueling shortfall, OR heat
In this dataset the per-second number is consistently **7–15 percentage points lower** than the per-mile number for the same run — the gap is aid-station noise.
### HR zone time-in-zone
**Zone boundaries come from Garmin's `heartRateZones.json`** (training method `HR_MAX`), not estimated from observed HR. The constants for this user live in `analysis.HR_ZONES_USER`:
```python
Z1: 102–122 (recovery)
Z2: 123–143 (easy aerobic — long-run target)
Z3: 144–164 (tempo / "junk-miles middle")
Z4: 165–185 (threshold — LTHR=182 sits inside Z4)
Z5: 186–209 (VO₂ max)
```
Time-in-zone has two implementations; `compute_time_in_zone.py` picks the better one per activity and caches the result in `activity_time_in_zone`.
**From FIT (`source='fit'`)** — `analysis.time_in_zone_from_fit(records)`:
```
records = per-second FIT messages
for each consecutive record pair:
dt = elapsed_seconds delta (clipped to [0, 30] so paused recording doesn't dump hours into one zone)
zone = bucket(this record's heart_rate)
accumulate dt into zone
```
**From splits (`source='lap'`)** — fallback when no FIT is linked:
```
for each split:
zone = bucket(split's avg_hr)
accumulate split.duration_s into that single zone
```
The FIT version is the right number; the lap version is **biased toward the middle zone**: when a lap's HR average is in Z3 but the actual HR oscillated into Z2 and Z4, the lap method dumps the whole lap into Z3. On a sample 8-hour race in this dataset, the lap method over-counted Z3 by 67 minutes and under-counted Z2 and Z4 by ~32 minutes each.
### Polarized split
A three-bucket simplification used in [05_intra_run.ipynb §4](notebooks/05_intra_run.ipynb):
```
easy = Z1 + Z2 (HR ≤ 143)
moderate = Z3 (144–164, the "junk-miles middle")
hard = Z4 + Z5 (≥ 165, threshold and up)
```
Seiler's polarized target is ~80% easy, < 10% moderate, ~20% hard. Pyramidal training is high easy, modest moderate, low hard. Threshold-heavy training is the opposite — most time in moderate.
### Cadence & stride
Both come directly from Garmin's lap data (`averageRunCadence`, `strideLength`) and from FIT records (`cadence`, `step_length`). Two gotchas:
1.**Cadence is already both-legs SPM** for this user's exports — do not double. Sanity range for running: 140–200 spm.
2.**Stride length is in cm** in lap data (`strideLength: 78` → 78 cm → 0.78 m) and in **mm** in FIT records (`step_length: 1041` → 104 cm). Divide accordingly.
The notebook restricts form analysis to splits with cadence > 0 (drops walks/standing intervals) and to a controlled pace band (5:30–6:30 min/km) when comparing YoY so fitness changes don't contaminate the form signal.
Per-route pace trends control for terrain — the same loop run in 2023 vs 2025 says way more about fitness than raw monthly pace. Adequate for a few hundred starts; switch to `sklearn.cluster.DBSCAN(metric='haversine')` for thousands.
### Banister fitness / fatigue / form (Performance Management Chart)
The standard endurance-training lens (TrainingPeaks PMC), implemented in `analysis.banister(daily_load, ctl_tau=42, atl_tau=7)`:
Helper: `daily_training_load_series(conn, activity_types=('running','trail_running'))` returns the per-day summed `training_load` Series that feeds it. Rest days are filled with 0 — both EWMAs still update, ATL drops faster than CTL, TSB rises.
TSB interpretation (used in both notebooks 02 and 06):
| TSB | meaning |
|---|---|
| < −30 | severely fatigued — injury risk |
| −10 to −30 | productive overload — heart of a build |
[02_running.ipynb](notebooks/02_running.ipynb) plots the historical PMC; [06_race_plan.ipynb](notebooks/06_race_plan.ipynb) projects it forward through race day by converting plan-week km into estimated daily loads (separate `RACE_DAY_TL_PER_KM ≈ 7` for the race itself vs `~11` for training runs — race-day TL/km is empirically lower because ultras spread load over more time).
---
## 4 — Notebook tour
Each notebook bootstraps the same way:
```python
import sys; sys.path.insert(0, '..')
from analysis import open_conn, load_activities, load_wellness, ...
conn = open_conn()
```
| # | Notebook | Covers |
|---|---|---|
| 01 | Overview | Data coverage by table, recent activities, monthly mileage YoY, wellness completeness |
| 06 | Race plan | Periodised 17-week plan around 30K → 50K → 50-mile, with auto-tracking against actual recorded volume |
Notebooks 05 and 06 are generated from `_build_05.py` / `_build_06.py` build scripts. Edit those, then `uv run python notebooks/_build_NN.py` regenerates the .ipynb. Easier than editing 30 KB of JSON.
---
## 5 — Setup
```bash
uv sync # creates .venv from pyproject.toml
```
Kernel: in VSCode, pick the project's `.venv` (top-right of any notebook). If it's not discoverable:
1. Go to <https://www.garmin.com/account/datamanagement/exportdata> and request the export. Garmin emails the link within 24–72 h.
2. Download the zip (several hundred MB to a few GB) or, for the Takeout dump, the UUID-named folder.
3. Ingest:
```bash
uv run ingest_export.py path/to/connect.zip
# or, if already unzipped:
uv run ingest_export.py path/to/<export>/
```
Add `--dry-run` to preview without writing. The data is upserted on `activity_id` / `calendar_date`, so it's safe to re-ingest a refreshed export later.
For Takeout dumps specifically, follow the 4-step extract+ingest+link+precompute sequence in §1.
### Path B — live API sync
Garmin's SSO endpoint is rate-limited; you may hit 429s. If so, wait 30–60 min, switch networks, or use Path A.
```bash
uv run auth.py # prompts for email / password / MFA, saves OAuth to .secrets/
uv run sync.py --full # 365-day backfill
uv run sync.py # incremental thereafter (last 14 days)
uv run sync.py --days 90 # custom backfill window
uv run sync.py --no-details # skip per-activity detail/splits (much faster)
uv run sync.py --skip-activities # wellness only
```
`garth` is deprecated upstream (<https://github.com/matin/garth/discussions/222>). If the live sync stops working, fall back to Path A.