18 KiB
openrun — endurance running analytics
A local-first, open-source endurance-training analysis pipeline. Garmin data in (live API or official export), SQLite on disk, pandas notebooks out. Banister CTL/ATL/TSB, Pa:HR decoupling (per-mile and per-second from FIT), Garmin-configured HR-zone time-in-zone, route clustering, race-plan projection with forward CTL/ATL/TSB.
This README focuses on what's in the data, what gets computed from it, and how each derived metric is defined. The package name openrun is provisional; the local repo can be renamed without code changes.
openrun/
├── pyproject.toml # packaged as `openrun` (src layout, hatchling)
├── openrun.toml # user config (HR zones, weight, LTHR, race calendar)
├── src/openrun/
│ ├── __init__.py # re-exports the common API
│ ├── config.py # UserProfile, HRZones, BanisterParams, TOML loader
│ ├── db.py # SQLite schema + connection helper
│ ├── model.py # loaders + derived metrics (was analysis.py)
│ └── ingest/
│ ├── auth.py # Garmin OAuth login
│ ├── garmin_api.py # Path B: incremental sync via garth
│ ├── garmin_export.py # Path A: parse Connect/Takeout JSON + index FITs
│ ├── fit_linker.py # match Takeout FITs to activities by session.start_time
│ └── time_in_zone.py # cache per-activity HR-zone breakdowns
├── examples/notebooks/
│ ├── 01_overview.ipynb # data coverage, recent activity
│ ├── 02_running.ipynb # volume, pace, PMC (Banister), PRs
│ ├── 03_recovery.ipynb # sleep, HRV, RHR, body battery
│ ├── 04_efficiency.ipynb # m/beat trends year over year
│ ├── 05_intra_run.ipynb # decoupling, cadence, routes, HR zones
│ └── 06_race_plan.ipynb # periodised plan + tracker + projected PMC
├── data/garmin.db # SQLite (created on first run)
└── {auth,sync,ingest_export,link_fit_files,compute_time_in_zone,db,analysis}.py
# thin top-level shims for back-compat
Most callers want:
from openrun import open_conn, load_activities, banister, daily_training_load_series
conn = open_conn()
runs = load_activities(conn, type='running')
pmc = banister(daily_training_load_series(conn))
User-specific values (HR zones, LTHR, weight, race calendar) live in openrun.toml — config discovery walks up from cwd looking for it. See openrun.toml for the schema.
CLI entry points (after uv sync):
openrun-auth # OAuth login (Path B)
openrun-sync [--full] # incremental Garmin Connect sync
openrun-ingest <export> # parse a Connect/Takeout export
openrun-link-fit <export> # match Takeout FITs to activities by content
openrun-time-in-zone # populate the activity_time_in_zone cache
Top-level shims (uv run sync.py, uv run ingest_export.py, etc.) still work — they re-export main() from the corresponding openrun.ingest.* module.
1 — Source data
Two ways to get data into the DB
| Path | Script | What you get | When to use |
|---|---|---|---|
| A. Official data export (recommended) | ingest_export.py |
Complete history, FIT files, multi-year wellness | First load; periodic refreshes |
| B. Live API sync | sync.py (after auth.py) |
Last N days, lap-level splits, no FIT files | Incremental top-ups between exports |
The two paths overlap intentionally: the export is the source of truth, the live API fills the gap between exports. The upsert logic only overwrites the raw-JSON column and fetched_at on conflict, so Path B values aren't clobbered by Path A re-ingests.
Garmin's export formats
Garmin actually has two export formats, both confusingly called "the export":
- Garmin Connect data export (
connect.zip) — request via Account Management → Export Your Data. Filenames are<activity_id>_<activity_name>.fit, distances/durations in SI units (m, s).ingest_export.pywas built for this. - Garmin Takeout dump — UUID-named folder (
<uuid>_N/) with many unrelated domains (aviation, Tacx, InReach…). The running data lives underDI_CONNECT/. Crucially different conventions:- FIT filenames:
<email>_<upload_id>.fit— upload IDs ≠ activity IDs summarizedActivitiesuses scaled-integer units: distance in cm, duration in ms, elevation in cm, speed in m/s ÷ 10ingest_export.pynow detects and converts these;link_fit_files.pylinks FITs by content (parsessession.start_timeand matches activities by ±60 s) since the filename-based approach doesn't work
- FIT filenames:
For a Takeout dump, the full sequence is:
# 1. Extract the FIT archives in place
mkdir -p <export>/DI_CONNECT/DI-Connect-Uploaded-Files/fit
unzip -d <export>/DI_CONNECT/DI-Connect-Uploaded-Files/fit \
<export>/DI_CONNECT/DI-Connect-Uploaded-Files/UploadedFiles_0-_Part*.zip
# 2. Ingest JSON + index FITs that have parseable activity IDs
uv run ingest_export.py <export>/
# 3. Link Takeout FITs to activities by start-time match
uv run link_fit_files.py <export>/
# 4. (Optional) precompute time-in-zone cache
uv run compute_time_in_zone.py
Garmin's units & quirks worth knowing
| Field | Live API | Takeout summarizedActivities | Convert by |
|---|---|---|---|
| distance | m | cm | × 0.01 |
| duration / movingDuration | s | ms | × 0.001 |
| elevationGain / Loss | m | cm | × 0.01 |
| averageSpeed / maxSpeed | m/s | m/s ÷ 10 | × 10 |
| HR (avg, max) | bpm | bpm | — |
| calories | kcal | kcal | — |
| training_effect, vo2_max | scalar | scalar | — |
cadence (averageRunCadence) |
both-legs SPM | both-legs SPM | — (do not double) |
| strideLength | cm | cm | × 0.01 for m |
FIT position_lat/long |
semicircles | semicircles | × (180 / 2³¹) for degrees |
FIT enhanced_speed |
m/s | m/s | — |
The cadence trap bit me: averageRunCadence on this account's exports is already both-legs (~150–180 spm). Doubling it as if it were single-leg yields 300+ which trips any sane filter, then nothing makes it through.
2 — Schema
db.py:SCHEMA is the source of truth; what follows is the gist.
| Table | Grain | Origin | Notes |
|---|---|---|---|
activities |
one row per activity | both paths | raw JSON has every Garmin field; the parsed columns are a curated subset |
activity_splits |
one row per lap | Path B only | Garmin defaults to auto-laps (~1 km or 1 mi per split). raw has cadence, stride, vert osc, GPS bounds |
activity_fit_files |
one row per linked FIT | Path A | fit_path is relative to whichever export root was passed to link_fit_files.py — keep the export folder in place |
activity_time_in_zone |
one row per activity | precomputed | Cached by compute_time_in_zone.py; source is 'fit' or 'lap' |
daily_steps / daily_sleep / daily_stress / daily_hrv / daily_body_battery / daily_intensity_minutes / daily_resting_hr |
one row per calendar date | both paths | Each has a raw JSON column with the full source payload |
sync_state |
key-value | both paths | Records last-sync / last-ingest timestamps |
The raw column on every table is the full upstream JSON. If you need a field the schema doesn't surface (sleep stage breakdowns, GPS bounding box, swimming-specific metrics, sport sub-types), unpack it from raw rather than re-ingesting:
from analysis import expand_raw, load_activities
acts = load_activities(open_conn(), type='running')
fields = expand_raw(acts) # pd.json_normalize'd companion frame
3 — Metrics & how they're calculated
All loaders and helpers live in analysis.py. The key derivations:
Pace
pace_min_per_km = (moving_duration_s / 60) / (distance_m / 1000)
moving_duration_s excludes paused time. Activities with implausible paces (sub-3 min/km or over 30 min/km — covers world-record marathon pace on the fast side and walks-with-stops on the slow side) get masked to NaN in load_activities rather than dropped, so the row count is preserved.
Acute:Chronic Workload Ratio (ACWR)
Daily training-load sum, then:
acute = sum(training_load) over the last 7 days
chronic = sum(training_load) over the last 28 days, divided by 4 for week-equivalence
ACWR = acute / chronic
Sweet spot ~0.8–1.3, > 1.5 = aggressive ramp / injury risk. Implemented inline in 02_running.ipynb.
Aerobic efficiency — "m/beat"
m_per_beat = distance_m / (duration_s * avg_hr / 60)
Higher = more metres covered per heartbeat = fitter aerobic system at the run's effort. Best compared YoY within a tight distance bucket so terrain and intent don't dominate. Used as the headline view in 04_efficiency.ipynb.
Decoupling (Pa:Hr drift)
Friel's method, in two resolutions. Positive = pace/HR fell off in the second half (cardiac drift, possibly fueling deficit). Negative = improved (negative split).
Per-mile (lap level) — analysis.decoupling(splits, min_splits=6):
For each activity with ≥ min_splits laps:
halves = first half / second half by lap count
eff_half = duration-weighted mean(speed_mps) / mean(heart_rate)
decoupling_pct = (eff_first / eff_second − 1) × 100
Cheap, but smears across aid-station stops and rounds at lap boundaries.
Per-second (FIT) — analysis.fit_decoupling(records, segments=N, warmup_min=5, cooldown_min=2, min_speed_mps=0.5):
records = per-second FIT messages
drop first warmup_min and last cooldown_min
drop records where speed_mps < min_speed_mps # ignore stopped time
slice remaining moving time into N equal-time chunks
for each chunk:
efficiency = mean(speed_mps) / mean(heart_rate)
decoupling_pct = (efficiency[0] / efficiency[i] − 1) × 100
Friel's published thresholds (interpret on steady aerobic efforts only):
- < 5% — aerobically developed
- 5–10% — sustainable
- > 10% — pacing unsustainable for the distance, OR fueling shortfall, OR heat
In this dataset the per-second number is consistently 7–15 percentage points lower than the per-mile number for the same run — the gap is aid-station noise.
HR zone time-in-zone
Zone boundaries come from Garmin's heartRateZones.json (training method HR_MAX), not estimated from observed HR. The constants for this user live in analysis.HR_ZONES_USER:
Z1: 102–122 (recovery)
Z2: 123–143 (easy aerobic — long-run target)
Z3: 144–164 (tempo / "junk-miles middle")
Z4: 165–185 (threshold — LTHR=182 sits inside Z4)
Z5: 186–209 (VO₂ max)
Time-in-zone has two implementations; compute_time_in_zone.py picks the better one per activity and caches the result in activity_time_in_zone.
From FIT (source='fit') — analysis.time_in_zone_from_fit(records):
records = per-second FIT messages
for each consecutive record pair:
dt = elapsed_seconds delta (clipped to [0, 30] so paused recording doesn't dump hours into one zone)
zone = bucket(this record's heart_rate)
accumulate dt into zone
From splits (source='lap') — fallback when no FIT is linked:
for each split:
zone = bucket(split's avg_hr)
accumulate split.duration_s into that single zone
The FIT version is the right number; the lap version is biased toward the middle zone: when a lap's HR average is in Z3 but the actual HR oscillated into Z2 and Z4, the lap method dumps the whole lap into Z3. On a sample 8-hour race in this dataset, the lap method over-counted Z3 by 67 minutes and under-counted Z2 and Z4 by ~32 minutes each.
Polarized split
A three-bucket simplification used in 05_intra_run.ipynb §4:
easy = Z1 + Z2 (HR ≤ 143)
moderate = Z3 (144–164, the "junk-miles middle")
hard = Z4 + Z5 (≥ 165, threshold and up)
Seiler's polarized target is ~80% easy, < 10% moderate, ~20% hard. Pyramidal training is high easy, modest moderate, low hard. Threshold-heavy training is the opposite — most time in moderate.
Cadence & stride
Both come directly from Garmin's lap data (averageRunCadence, strideLength) and from FIT records (cadence, step_length). Two gotchas:
- Cadence is already both-legs SPM for this user's exports — do not double. Sanity range for running: 140–200 spm.
- Stride length is in cm in lap data (
strideLength: 78→ 78 cm → 0.78 m) and in mm in FIT records (step_length: 1041→ 104 cm). Divide accordingly.
The notebook restricts form analysis to splits with cadence > 0 (drops walks/standing intervals) and to a controlled pace band (5:30–6:30 min/km) when comparing YoY so fitness changes don't contaminate the form signal.
GPS route clustering
analysis.cluster_routes(lats, lons, radius_km=0.25):
Greedy haversine clustering:
For each unassigned point i:
distances = haversine(point_i, all_other_points)
neighbours = points within radius_km
if len(neighbours) >= 2:
assign all to next_label
next_label += 1
Per-route pace trends control for terrain — the same loop run in 2023 vs 2025 says way more about fitness than raw monthly pace. Adequate for a few hundred starts; switch to sklearn.cluster.DBSCAN(metric='haversine') for thousands.
Banister fitness / fatigue / form (Performance Management Chart)
The standard endurance-training lens (TrainingPeaks PMC), implemented in analysis.banister(daily_load, ctl_tau=42, atl_tau=7):
CTL_today = CTL_yesterday · exp(−1/τ_CTL) + load_today · (1 − exp(−1/τ_CTL)) τ_CTL = 42d
ATL_today = ATL_yesterday · exp(−1/τ_ATL) + load_today · (1 − exp(−1/τ_ATL)) τ_ATL = 7d
TSB_today = CTL_yesterday − ATL_yesterday # yesterday's values (TP convention)
Helper: daily_training_load_series(conn, activity_types=('running','trail_running')) returns the per-day summed training_load Series that feeds it. Rest days are filled with 0 — both EWMAs still update, ATL drops faster than CTL, TSB rises.
TSB interpretation (used in both notebooks 02 and 06):
| TSB | meaning |
|---|---|
| < −30 | severely fatigued — injury risk |
| −10 to −30 | productive overload — heart of a build |
| −10 to 0 | balanced building |
| 0 to +10 | sharpening |
| +10 to +25 | fresh / peaked — race-day target |
| > +25 | detrained (taper too long) |
02_running.ipynb plots the historical PMC; 06_race_plan.ipynb projects it forward through race day by converting plan-week km into estimated daily loads (separate RACE_DAY_TL_PER_KM ≈ 7 for the race itself vs ~11 for training runs — race-day TL/km is empirically lower because ultras spread load over more time).
4 — Notebook tour
Each notebook bootstraps the same way:
import sys; sys.path.insert(0, '..')
from analysis import open_conn, load_activities, load_wellness, ...
conn = open_conn()
| # | Notebook | Covers |
|---|---|---|
| 01 | Overview | Data coverage by table, recent activities, monthly mileage YoY, wellness completeness |
| 02 | Running | Weekly volume + rolling, pace-vs-HR scatter, aerobic-efficiency trend, ACWR, PRs by distance |
| 03 | Recovery | Sleep composition, after-run vs after-rest HRV/RHR, correlations between training and recovery, weekly km vs recovery indicators |
| 04 | Efficiency | YoY m/beat by distance bucket, HR/pace split, polynomial curve fits, easy-runs-only headline view |
| 05 | Intra-run | (1) per-mile decoupling, (1b) per-second decoupling on race FITs, (2) cadence & stride at controlled pace, (3) GPS route clustering, (4) HR-zone time-in-zone (FIT-based) + polarized split |
| 06 | Race plan | Periodised 17-week plan around 30K → 50K → 50-mile, with auto-tracking against actual recorded volume |
Notebooks 05 and 06 are generated from _build_05.py / _build_06.py build scripts. Edit those, then uv run python notebooks/_build_NN.py regenerates the .ipynb. Easier than editing 30 KB of JSON.
5 — Setup
uv sync # creates .venv from pyproject.toml
Kernel: in VSCode, pick the project's .venv (top-right of any notebook). If it's not discoverable:
uv run python -m ipykernel install --user --name garmin --display-name "garmin (.venv)"
Path A — official export
- Go to https://www.garmin.com/account/datamanagement/exportdata and request the export. Garmin emails the link within 24–72 h.
- Download the zip (several hundred MB to a few GB) or, for the Takeout dump, the UUID-named folder.
- Ingest:
uv run ingest_export.py path/to/connect.zip
# or, if already unzipped:
uv run ingest_export.py path/to/<export>/
Add --dry-run to preview without writing. The data is upserted on activity_id / calendar_date, so it's safe to re-ingest a refreshed export later.
For Takeout dumps specifically, follow the 4-step extract+ingest+link+precompute sequence in §1.
Path B — live API sync
Garmin's SSO endpoint is rate-limited; you may hit 429s. If so, wait 30–60 min, switch networks, or use Path A.
uv run auth.py # prompts for email / password / MFA, saves OAuth to .secrets/
uv run sync.py --full # 365-day backfill
uv run sync.py # incremental thereafter (last 14 days)
uv run sync.py --days 90 # custom backfill window
uv run sync.py --no-details # skip per-activity detail/splits (much faster)
uv run sync.py --skip-activities # wellness only
garth is deprecated upstream (https://github.com/matin/garth/discussions/222). If the live sync stops working, fall back to Path A.