Files
openrun/README.md
2026-05-19 08:34:22 -04:00

371 lines
18 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# openrun — endurance running analytics
A local-first, open-source endurance-training analysis pipeline. Garmin data in (live API or official export), SQLite on disk, pandas notebooks out. Banister CTL/ATL/TSB, Pa:HR decoupling (per-mile and per-second from FIT), Garmin-configured HR-zone time-in-zone, route clustering, race-plan projection with forward CTL/ATL/TSB.
This README focuses on what's in the data, what gets computed from it, and how each derived metric is defined. The package name `openrun` is provisional; the local repo can be renamed without code changes.
```
openrun/
├── pyproject.toml # packaged as `openrun` (src layout, hatchling)
├── openrun.toml # user config (HR zones, weight, LTHR, race calendar)
├── src/openrun/
│ ├── __init__.py # re-exports the common API
│ ├── config.py # UserProfile, HRZones, BanisterParams, TOML loader
│ ├── db.py # SQLite schema + connection helper
│ ├── model.py # loaders + derived metrics (was analysis.py)
│ └── ingest/
│ ├── auth.py # Garmin OAuth login
│ ├── garmin_api.py # Path B: incremental sync via garth
│ ├── garmin_export.py # Path A: parse Connect/Takeout JSON + index FITs
│ ├── fit_linker.py # match Takeout FITs to activities by session.start_time
│ └── time_in_zone.py # cache per-activity HR-zone breakdowns
├── examples/notebooks/
│ ├── 01_overview.ipynb # data coverage, recent activity
│ ├── 02_running.ipynb # volume, pace, PMC (Banister), PRs
│ ├── 03_recovery.ipynb # sleep, HRV, RHR, body battery
│ ├── 04_efficiency.ipynb # m/beat trends year over year
│ ├── 05_intra_run.ipynb # decoupling, cadence, routes, HR zones
│ └── 06_race_plan.ipynb # periodised plan + tracker + projected PMC
├── data/garmin.db # SQLite (created on first run)
└── {auth,sync,ingest_export,link_fit_files,compute_time_in_zone,db,analysis}.py
# thin top-level shims for back-compat
```
Most callers want:
```python
from openrun import open_conn, load_activities, banister, daily_training_load_series
conn = open_conn()
runs = load_activities(conn, type='running')
pmc = banister(daily_training_load_series(conn))
```
User-specific values (HR zones, LTHR, weight, race calendar) live in `openrun.toml` — config discovery walks up from cwd looking for it. See [openrun.toml](openrun.toml) for the schema.
CLI entry points (after `uv sync`):
```bash
openrun-auth # OAuth login (Path B)
openrun-sync [--full] # incremental Garmin Connect sync
openrun-ingest <export> # parse a Connect/Takeout export
openrun-link-fit <export> # match Takeout FITs to activities by content
openrun-time-in-zone # populate the activity_time_in_zone cache
```
Top-level shims (`uv run sync.py`, `uv run ingest_export.py`, etc.) still work — they re-export `main()` from the corresponding `openrun.ingest.*` module.
---
## 1 — Source data
### Two ways to get data into the DB
| Path | Script | What you get | When to use |
|---|---|---|---|
| **A. Official data export** (recommended) | `ingest_export.py` | Complete history, FIT files, multi-year wellness | First load; periodic refreshes |
| **B. Live API sync** | `sync.py` (after `auth.py`) | Last N days, lap-level splits, no FIT files | Incremental top-ups between exports |
The two paths overlap intentionally: the export is the source of truth, the live API fills the gap between exports. The upsert logic only overwrites the raw-JSON column and `fetched_at` on conflict, so Path B values aren't clobbered by Path A re-ingests.
### Garmin's export formats
Garmin actually has *two* export formats, both confusingly called "the export":
1. **Garmin Connect data export** (`connect.zip`) — request via Account Management → Export Your Data. Filenames are `<activity_id>_<activity_name>.fit`, distances/durations in SI units (m, s). `ingest_export.py` was built for this.
2. **Garmin Takeout dump** — UUID-named folder (`<uuid>_N/`) with many unrelated domains (aviation, Tacx, InReach…). The running data lives under `DI_CONNECT/`. **Crucially different conventions:**
- FIT filenames: `<email>_<upload_id>.fit` — upload IDs ≠ activity IDs
- `summarizedActivities` uses scaled-integer units: distance in **cm**, duration in **ms**, elevation in **cm**, speed in **m/s ÷ 10**
- `ingest_export.py` now detects and converts these; `link_fit_files.py` links FITs by content (parses `session.start_time` and matches activities by ±60 s) since the filename-based approach doesn't work
For a Takeout dump, the full sequence is:
```bash
# 1. Extract the FIT archives in place
mkdir -p <export>/DI_CONNECT/DI-Connect-Uploaded-Files/fit
unzip -d <export>/DI_CONNECT/DI-Connect-Uploaded-Files/fit \
<export>/DI_CONNECT/DI-Connect-Uploaded-Files/UploadedFiles_0-_Part*.zip
# 2. Ingest JSON + index FITs that have parseable activity IDs
uv run ingest_export.py <export>/
# 3. Link Takeout FITs to activities by start-time match
uv run link_fit_files.py <export>/
# 4. (Optional) precompute time-in-zone cache
uv run compute_time_in_zone.py
```
### Garmin's units & quirks worth knowing
| Field | Live API | Takeout summarizedActivities | Convert by |
|---|---|---|---|
| distance | m | cm | × 0.01 |
| duration / movingDuration | s | ms | × 0.001 |
| elevationGain / Loss | m | cm | × 0.01 |
| averageSpeed / maxSpeed | m/s | m/s ÷ 10 | × 10 |
| HR (avg, max) | bpm | bpm | — |
| calories | kcal | kcal | — |
| training_effect, vo2_max | scalar | scalar | — |
| cadence (`averageRunCadence`) | both-legs SPM | both-legs SPM | — (do **not** double) |
| strideLength | cm | cm | × 0.01 for m |
| FIT `position_lat/long` | semicircles | semicircles | × (180 / 2³¹) for degrees |
| FIT `enhanced_speed` | m/s | m/s | — |
The cadence trap bit me: `averageRunCadence` on this account's exports is already both-legs (~150180 spm). Doubling it as if it were single-leg yields 300+ which trips any sane filter, then nothing makes it through.
---
## 2 — Schema
`db.py:SCHEMA` is the source of truth; what follows is the gist.
| Table | Grain | Origin | Notes |
|---|---|---|---|
| `activities` | one row per activity | both paths | `raw` JSON has every Garmin field; the parsed columns are a curated subset |
| `activity_splits` | one row per lap | Path B only | Garmin defaults to auto-laps (~1 km or 1 mi per split). `raw` has cadence, stride, vert osc, GPS bounds |
| `activity_fit_files` | one row per linked FIT | Path A | `fit_path` is **relative** to whichever export root was passed to `link_fit_files.py` — keep the export folder in place |
| `activity_time_in_zone` | one row per activity | precomputed | Cached by `compute_time_in_zone.py`; `source` is `'fit'` or `'lap'` |
| `daily_steps` / `daily_sleep` / `daily_stress` / `daily_hrv` / `daily_body_battery` / `daily_intensity_minutes` / `daily_resting_hr` | one row per calendar date | both paths | Each has a `raw` JSON column with the full source payload |
| `sync_state` | key-value | both paths | Records last-sync / last-ingest timestamps |
The `raw` column on every table is the full upstream JSON. If you need a field the schema doesn't surface (sleep stage breakdowns, GPS bounding box, swimming-specific metrics, sport sub-types), unpack it from `raw` rather than re-ingesting:
```python
from analysis import expand_raw, load_activities
acts = load_activities(open_conn(), type='running')
fields = expand_raw(acts) # pd.json_normalize'd companion frame
```
---
## 3 — Metrics & how they're calculated
All loaders and helpers live in [analysis.py](analysis.py). The key derivations:
### Pace
```
pace_min_per_km = (moving_duration_s / 60) / (distance_m / 1000)
```
`moving_duration_s` excludes paused time. Activities with implausible paces (sub-3 min/km or over 30 min/km — covers world-record marathon pace on the fast side and walks-with-stops on the slow side) get masked to NaN in `load_activities` rather than dropped, so the row count is preserved.
### Acute:Chronic Workload Ratio (ACWR)
Daily training-load sum, then:
```
acute = sum(training_load) over the last 7 days
chronic = sum(training_load) over the last 28 days, divided by 4 for week-equivalence
ACWR = acute / chronic
```
Sweet spot ~0.81.3, > 1.5 = aggressive ramp / injury risk. Implemented inline in [02_running.ipynb](notebooks/02_running.ipynb).
### Aerobic efficiency — "m/beat"
```
m_per_beat = distance_m / (duration_s * avg_hr / 60)
```
Higher = more metres covered per heartbeat = fitter aerobic system at the run's effort. Best compared YoY within a tight distance bucket so terrain and intent don't dominate. Used as the headline view in [04_efficiency.ipynb](notebooks/04_efficiency.ipynb).
### Decoupling (Pa:Hr drift)
Friel's method, in two resolutions. *Positive* = pace/HR fell off in the second half (cardiac drift, possibly fueling deficit). *Negative* = improved (negative split).
**Per-mile (lap level)**`analysis.decoupling(splits, min_splits=6)`:
```
For each activity with ≥ min_splits laps:
halves = first half / second half by lap count
eff_half = duration-weighted mean(speed_mps) / mean(heart_rate)
decoupling_pct = (eff_first / eff_second 1) × 100
```
Cheap, but smears across aid-station stops and rounds at lap boundaries.
**Per-second (FIT)**`analysis.fit_decoupling(records, segments=N, warmup_min=5, cooldown_min=2, min_speed_mps=0.5)`:
```
records = per-second FIT messages
drop first warmup_min and last cooldown_min
drop records where speed_mps < min_speed_mps # ignore stopped time
slice remaining moving time into N equal-time chunks
for each chunk:
efficiency = mean(speed_mps) / mean(heart_rate)
decoupling_pct = (efficiency[0] / efficiency[i] 1) × 100
```
Friel's published thresholds (interpret on **steady aerobic** efforts only):
- **< 5%** — aerobically developed
- **510%** — sustainable
- **> 10%** — pacing unsustainable for the distance, OR fueling shortfall, OR heat
In this dataset the per-second number is consistently **715 percentage points lower** than the per-mile number for the same run — the gap is aid-station noise.
### HR zone time-in-zone
**Zone boundaries come from Garmin's `heartRateZones.json`** (training method `HR_MAX`), not estimated from observed HR. The constants for this user live in `analysis.HR_ZONES_USER`:
```python
Z1: 102122 (recovery)
Z2: 123143 (easy aerobic long-run target)
Z3: 144164 (tempo / "junk-miles middle")
Z4: 165185 (threshold LTHR=182 sits inside Z4)
Z5: 186209 (VO₂ max)
```
Time-in-zone has two implementations; `compute_time_in_zone.py` picks the better one per activity and caches the result in `activity_time_in_zone`.
**From FIT (`source='fit'`)**`analysis.time_in_zone_from_fit(records)`:
```
records = per-second FIT messages
for each consecutive record pair:
dt = elapsed_seconds delta (clipped to [0, 30] so paused recording doesn't dump hours into one zone)
zone = bucket(this record's heart_rate)
accumulate dt into zone
```
**From splits (`source='lap'`)** — fallback when no FIT is linked:
```
for each split:
zone = bucket(split's avg_hr)
accumulate split.duration_s into that single zone
```
The FIT version is the right number; the lap version is **biased toward the middle zone**: when a lap's HR average is in Z3 but the actual HR oscillated into Z2 and Z4, the lap method dumps the whole lap into Z3. On a sample 8-hour race in this dataset, the lap method over-counted Z3 by 67 minutes and under-counted Z2 and Z4 by ~32 minutes each.
### Polarized split
A three-bucket simplification used in [05_intra_run.ipynb §4](notebooks/05_intra_run.ipynb):
```
easy = Z1 + Z2 (HR ≤ 143)
moderate = Z3 (144164, the "junk-miles middle")
hard = Z4 + Z5 (≥ 165, threshold and up)
```
Seiler's polarized target is ~80% easy, < 10% moderate, ~20% hard. Pyramidal training is high easy, modest moderate, low hard. Threshold-heavy training is the opposite — most time in moderate.
### Cadence & stride
Both come directly from Garmin's lap data (`averageRunCadence`, `strideLength`) and from FIT records (`cadence`, `step_length`). Two gotchas:
1. **Cadence is already both-legs SPM** for this user's exports — do not double. Sanity range for running: 140200 spm.
2. **Stride length is in cm** in lap data (`strideLength: 78` → 78 cm → 0.78 m) and in **mm** in FIT records (`step_length: 1041` → 104 cm). Divide accordingly.
The notebook restricts form analysis to splits with cadence > 0 (drops walks/standing intervals) and to a controlled pace band (5:306:30 min/km) when comparing YoY so fitness changes don't contaminate the form signal.
### GPS route clustering
`analysis.cluster_routes(lats, lons, radius_km=0.25)`:
```
Greedy haversine clustering:
For each unassigned point i:
distances = haversine(point_i, all_other_points)
neighbours = points within radius_km
if len(neighbours) >= 2:
assign all to next_label
next_label += 1
```
Per-route pace trends control for terrain — the same loop run in 2023 vs 2025 says way more about fitness than raw monthly pace. Adequate for a few hundred starts; switch to `sklearn.cluster.DBSCAN(metric='haversine')` for thousands.
### Banister fitness / fatigue / form (Performance Management Chart)
The standard endurance-training lens (TrainingPeaks PMC), implemented in `analysis.banister(daily_load, ctl_tau=42, atl_tau=7)`:
```
CTL_today = CTL_yesterday · exp(1/τ_CTL) + load_today · (1 exp(1/τ_CTL)) τ_CTL = 42d
ATL_today = ATL_yesterday · exp(1/τ_ATL) + load_today · (1 exp(1/τ_ATL)) τ_ATL = 7d
TSB_today = CTL_yesterday ATL_yesterday # yesterday's values (TP convention)
```
Helper: `daily_training_load_series(conn, activity_types=('running','trail_running'))` returns the per-day summed `training_load` Series that feeds it. Rest days are filled with 0 — both EWMAs still update, ATL drops faster than CTL, TSB rises.
TSB interpretation (used in both notebooks 02 and 06):
| TSB | meaning |
|---|---|
| < 30 | severely fatigued — injury risk |
| 10 to 30 | productive overload — heart of a build |
| 10 to 0 | balanced building |
| 0 to +10 | sharpening |
| **+10 to +25** | **fresh / peaked — race-day target** |
| > +25 | detrained (taper too long) |
[02_running.ipynb](notebooks/02_running.ipynb) plots the historical PMC; [06_race_plan.ipynb](notebooks/06_race_plan.ipynb) projects it forward through race day by converting plan-week km into estimated daily loads (separate `RACE_DAY_TL_PER_KM ≈ 7` for the race itself vs `~11` for training runs — race-day TL/km is empirically lower because ultras spread load over more time).
---
## 4 — Notebook tour
Each notebook bootstraps the same way:
```python
import sys; sys.path.insert(0, '..')
from analysis import open_conn, load_activities, load_wellness, ...
conn = open_conn()
```
| # | Notebook | Covers |
|---|---|---|
| 01 | Overview | Data coverage by table, recent activities, monthly mileage YoY, wellness completeness |
| 02 | Running | Weekly volume + rolling, pace-vs-HR scatter, aerobic-efficiency trend, ACWR, PRs by distance |
| 03 | Recovery | Sleep composition, after-run vs after-rest HRV/RHR, correlations between training and recovery, weekly km vs recovery indicators |
| 04 | Efficiency | YoY m/beat by distance bucket, HR/pace split, polynomial curve fits, easy-runs-only headline view |
| 05 | Intra-run | (1) per-mile decoupling, (1b) per-second decoupling on race FITs, (2) cadence & stride at controlled pace, (3) GPS route clustering, (4) HR-zone time-in-zone (FIT-based) + polarized split |
| 06 | Race plan | Periodised 17-week plan around 30K → 50K → 50-mile, with auto-tracking against actual recorded volume |
Notebooks 05 and 06 are generated from `_build_05.py` / `_build_06.py` build scripts. Edit those, then `uv run python notebooks/_build_NN.py` regenerates the .ipynb. Easier than editing 30 KB of JSON.
---
## 5 — Setup
```bash
uv sync # creates .venv from pyproject.toml
```
Kernel: in VSCode, pick the project's `.venv` (top-right of any notebook). If it's not discoverable:
```bash
uv run python -m ipykernel install --user --name garmin --display-name "garmin (.venv)"
```
### Path A — official export
1. Go to <https://www.garmin.com/account/datamanagement/exportdata> and request the export. Garmin emails the link within 2472 h.
2. Download the zip (several hundred MB to a few GB) or, for the Takeout dump, the UUID-named folder.
3. Ingest:
```bash
uv run ingest_export.py path/to/connect.zip
# or, if already unzipped:
uv run ingest_export.py path/to/<export>/
```
Add `--dry-run` to preview without writing. The data is upserted on `activity_id` / `calendar_date`, so it's safe to re-ingest a refreshed export later.
For Takeout dumps specifically, follow the 4-step extract+ingest+link+precompute sequence in §1.
### Path B — live API sync
Garmin's SSO endpoint is rate-limited; you may hit 429s. If so, wait 3060 min, switch networks, or use Path A.
```bash
uv run auth.py # prompts for email / password / MFA, saves OAuth to .secrets/
uv run sync.py --full # 365-day backfill
uv run sync.py # incremental thereafter (last 14 days)
uv run sync.py --days 90 # custom backfill window
uv run sync.py --no-details # skip per-activity detail/splits (much faster)
uv run sync.py --skip-activities # wellness only
```
`garth` is deprecated upstream (<https://github.com/matin/garth/discussions/222>). If the live sync stops working, fall back to Path A.