Files
openrun/README.md
2026-05-19 08:34:22 -04:00

18 KiB
Raw Permalink Blame History

openrun — endurance running analytics

A local-first, open-source endurance-training analysis pipeline. Garmin data in (live API or official export), SQLite on disk, pandas notebooks out. Banister CTL/ATL/TSB, Pa:HR decoupling (per-mile and per-second from FIT), Garmin-configured HR-zone time-in-zone, route clustering, race-plan projection with forward CTL/ATL/TSB.

This README focuses on what's in the data, what gets computed from it, and how each derived metric is defined. The package name openrun is provisional; the local repo can be renamed without code changes.

openrun/
├── pyproject.toml         # packaged as `openrun` (src layout, hatchling)
├── openrun.toml           # user config (HR zones, weight, LTHR, race calendar)
├── src/openrun/
│   ├── __init__.py        # re-exports the common API
│   ├── config.py          # UserProfile, HRZones, BanisterParams, TOML loader
│   ├── db.py              # SQLite schema + connection helper
│   ├── model.py           # loaders + derived metrics (was analysis.py)
│   └── ingest/
│       ├── auth.py        # Garmin OAuth login
│       ├── garmin_api.py  # Path B: incremental sync via garth
│       ├── garmin_export.py # Path A: parse Connect/Takeout JSON + index FITs
│       ├── fit_linker.py  # match Takeout FITs to activities by session.start_time
│       └── time_in_zone.py # cache per-activity HR-zone breakdowns
├── examples/notebooks/
│   ├── 01_overview.ipynb     # data coverage, recent activity
│   ├── 02_running.ipynb      # volume, pace, PMC (Banister), PRs
│   ├── 03_recovery.ipynb     # sleep, HRV, RHR, body battery
│   ├── 04_efficiency.ipynb   # m/beat trends year over year
│   ├── 05_intra_run.ipynb    # decoupling, cadence, routes, HR zones
│   └── 06_race_plan.ipynb    # periodised plan + tracker + projected PMC
├── data/garmin.db            # SQLite (created on first run)
└── {auth,sync,ingest_export,link_fit_files,compute_time_in_zone,db,analysis}.py
                              # thin top-level shims for back-compat

Most callers want:

from openrun import open_conn, load_activities, banister, daily_training_load_series
conn = open_conn()
runs = load_activities(conn, type='running')
pmc  = banister(daily_training_load_series(conn))

User-specific values (HR zones, LTHR, weight, race calendar) live in openrun.toml — config discovery walks up from cwd looking for it. See openrun.toml for the schema.

CLI entry points (after uv sync):

openrun-auth               # OAuth login (Path B)
openrun-sync [--full]      # incremental Garmin Connect sync
openrun-ingest <export>    # parse a Connect/Takeout export
openrun-link-fit <export>  # match Takeout FITs to activities by content
openrun-time-in-zone       # populate the activity_time_in_zone cache

Top-level shims (uv run sync.py, uv run ingest_export.py, etc.) still work — they re-export main() from the corresponding openrun.ingest.* module.


1 — Source data

Two ways to get data into the DB

Path Script What you get When to use
A. Official data export (recommended) ingest_export.py Complete history, FIT files, multi-year wellness First load; periodic refreshes
B. Live API sync sync.py (after auth.py) Last N days, lap-level splits, no FIT files Incremental top-ups between exports

The two paths overlap intentionally: the export is the source of truth, the live API fills the gap between exports. The upsert logic only overwrites the raw-JSON column and fetched_at on conflict, so Path B values aren't clobbered by Path A re-ingests.

Garmin's export formats

Garmin actually has two export formats, both confusingly called "the export":

  1. Garmin Connect data export (connect.zip) — request via Account Management → Export Your Data. Filenames are <activity_id>_<activity_name>.fit, distances/durations in SI units (m, s). ingest_export.py was built for this.
  2. Garmin Takeout dump — UUID-named folder (<uuid>_N/) with many unrelated domains (aviation, Tacx, InReach…). The running data lives under DI_CONNECT/. Crucially different conventions:
    • FIT filenames: <email>_<upload_id>.fit — upload IDs ≠ activity IDs
    • summarizedActivities uses scaled-integer units: distance in cm, duration in ms, elevation in cm, speed in m/s ÷ 10
    • ingest_export.py now detects and converts these; link_fit_files.py links FITs by content (parses session.start_time and matches activities by ±60 s) since the filename-based approach doesn't work

For a Takeout dump, the full sequence is:

# 1. Extract the FIT archives in place
mkdir -p <export>/DI_CONNECT/DI-Connect-Uploaded-Files/fit
unzip -d <export>/DI_CONNECT/DI-Connect-Uploaded-Files/fit \
   <export>/DI_CONNECT/DI-Connect-Uploaded-Files/UploadedFiles_0-_Part*.zip

# 2. Ingest JSON + index FITs that have parseable activity IDs
uv run ingest_export.py <export>/

# 3. Link Takeout FITs to activities by start-time match
uv run link_fit_files.py <export>/

# 4. (Optional) precompute time-in-zone cache
uv run compute_time_in_zone.py

Garmin's units & quirks worth knowing

Field Live API Takeout summarizedActivities Convert by
distance m cm × 0.01
duration / movingDuration s ms × 0.001
elevationGain / Loss m cm × 0.01
averageSpeed / maxSpeed m/s m/s ÷ 10 × 10
HR (avg, max) bpm bpm
calories kcal kcal
training_effect, vo2_max scalar scalar
cadence (averageRunCadence) both-legs SPM both-legs SPM — (do not double)
strideLength cm cm × 0.01 for m
FIT position_lat/long semicircles semicircles × (180 / 2³¹) for degrees
FIT enhanced_speed m/s m/s

The cadence trap bit me: averageRunCadence on this account's exports is already both-legs (~150180 spm). Doubling it as if it were single-leg yields 300+ which trips any sane filter, then nothing makes it through.


2 — Schema

db.py:SCHEMA is the source of truth; what follows is the gist.

Table Grain Origin Notes
activities one row per activity both paths raw JSON has every Garmin field; the parsed columns are a curated subset
activity_splits one row per lap Path B only Garmin defaults to auto-laps (~1 km or 1 mi per split). raw has cadence, stride, vert osc, GPS bounds
activity_fit_files one row per linked FIT Path A fit_path is relative to whichever export root was passed to link_fit_files.py — keep the export folder in place
activity_time_in_zone one row per activity precomputed Cached by compute_time_in_zone.py; source is 'fit' or 'lap'
daily_steps / daily_sleep / daily_stress / daily_hrv / daily_body_battery / daily_intensity_minutes / daily_resting_hr one row per calendar date both paths Each has a raw JSON column with the full source payload
sync_state key-value both paths Records last-sync / last-ingest timestamps

The raw column on every table is the full upstream JSON. If you need a field the schema doesn't surface (sleep stage breakdowns, GPS bounding box, swimming-specific metrics, sport sub-types), unpack it from raw rather than re-ingesting:

from analysis import expand_raw, load_activities
acts = load_activities(open_conn(), type='running')
fields = expand_raw(acts)  # pd.json_normalize'd companion frame

3 — Metrics & how they're calculated

All loaders and helpers live in analysis.py. The key derivations:

Pace

pace_min_per_km = (moving_duration_s / 60) / (distance_m / 1000)

moving_duration_s excludes paused time. Activities with implausible paces (sub-3 min/km or over 30 min/km — covers world-record marathon pace on the fast side and walks-with-stops on the slow side) get masked to NaN in load_activities rather than dropped, so the row count is preserved.

Acute:Chronic Workload Ratio (ACWR)

Daily training-load sum, then:

acute   = sum(training_load) over the last 7 days
chronic = sum(training_load) over the last 28 days, divided by 4 for week-equivalence
ACWR    = acute / chronic

Sweet spot ~0.81.3, > 1.5 = aggressive ramp / injury risk. Implemented inline in 02_running.ipynb.

Aerobic efficiency — "m/beat"

m_per_beat = distance_m / (duration_s * avg_hr / 60)

Higher = more metres covered per heartbeat = fitter aerobic system at the run's effort. Best compared YoY within a tight distance bucket so terrain and intent don't dominate. Used as the headline view in 04_efficiency.ipynb.

Decoupling (Pa:Hr drift)

Friel's method, in two resolutions. Positive = pace/HR fell off in the second half (cardiac drift, possibly fueling deficit). Negative = improved (negative split).

Per-mile (lap level)analysis.decoupling(splits, min_splits=6):

For each activity with ≥ min_splits laps:
    halves   = first half / second half by lap count
    eff_half = duration-weighted mean(speed_mps) / mean(heart_rate)
    decoupling_pct = (eff_first / eff_second  1) × 100

Cheap, but smears across aid-station stops and rounds at lap boundaries.

Per-second (FIT)analysis.fit_decoupling(records, segments=N, warmup_min=5, cooldown_min=2, min_speed_mps=0.5):

records = per-second FIT messages
drop first warmup_min and last cooldown_min
drop records where speed_mps < min_speed_mps     # ignore stopped time
slice remaining moving time into N equal-time chunks
for each chunk:
    efficiency = mean(speed_mps) / mean(heart_rate)
decoupling_pct = (efficiency[0] / efficiency[i]  1) × 100

Friel's published thresholds (interpret on steady aerobic efforts only):

  • < 5% — aerobically developed
  • 510% — sustainable
  • > 10% — pacing unsustainable for the distance, OR fueling shortfall, OR heat

In this dataset the per-second number is consistently 715 percentage points lower than the per-mile number for the same run — the gap is aid-station noise.

HR zone time-in-zone

Zone boundaries come from Garmin's heartRateZones.json (training method HR_MAX), not estimated from observed HR. The constants for this user live in analysis.HR_ZONES_USER:

Z1: 102122   (recovery)
Z2: 123143   (easy aerobic  long-run target)
Z3: 144164   (tempo / "junk-miles middle")
Z4: 165185   (threshold  LTHR=182 sits inside Z4)
Z5: 186209   (VO₂ max)

Time-in-zone has two implementations; compute_time_in_zone.py picks the better one per activity and caches the result in activity_time_in_zone.

From FIT (source='fit')analysis.time_in_zone_from_fit(records):

records = per-second FIT messages
for each consecutive record pair:
    dt = elapsed_seconds delta (clipped to [0, 30] so paused recording doesn't dump hours into one zone)
    zone = bucket(this record's heart_rate)
    accumulate dt into zone

From splits (source='lap') — fallback when no FIT is linked:

for each split:
    zone = bucket(split's avg_hr)
    accumulate split.duration_s into that single zone

The FIT version is the right number; the lap version is biased toward the middle zone: when a lap's HR average is in Z3 but the actual HR oscillated into Z2 and Z4, the lap method dumps the whole lap into Z3. On a sample 8-hour race in this dataset, the lap method over-counted Z3 by 67 minutes and under-counted Z2 and Z4 by ~32 minutes each.

Polarized split

A three-bucket simplification used in 05_intra_run.ipynb §4:

easy     = Z1 + Z2     (HR ≤ 143)
moderate = Z3          (144164, the "junk-miles middle")
hard     = Z4 + Z5     (≥ 165, threshold and up)

Seiler's polarized target is ~80% easy, < 10% moderate, ~20% hard. Pyramidal training is high easy, modest moderate, low hard. Threshold-heavy training is the opposite — most time in moderate.

Cadence & stride

Both come directly from Garmin's lap data (averageRunCadence, strideLength) and from FIT records (cadence, step_length). Two gotchas:

  1. Cadence is already both-legs SPM for this user's exports — do not double. Sanity range for running: 140200 spm.
  2. Stride length is in cm in lap data (strideLength: 78 → 78 cm → 0.78 m) and in mm in FIT records (step_length: 1041 → 104 cm). Divide accordingly.

The notebook restricts form analysis to splits with cadence > 0 (drops walks/standing intervals) and to a controlled pace band (5:306:30 min/km) when comparing YoY so fitness changes don't contaminate the form signal.

GPS route clustering

analysis.cluster_routes(lats, lons, radius_km=0.25):

Greedy haversine clustering:
  For each unassigned point i:
    distances = haversine(point_i, all_other_points)
    neighbours = points within radius_km
    if len(neighbours) >= 2:
        assign all to next_label
    next_label += 1

Per-route pace trends control for terrain — the same loop run in 2023 vs 2025 says way more about fitness than raw monthly pace. Adequate for a few hundred starts; switch to sklearn.cluster.DBSCAN(metric='haversine') for thousands.

Banister fitness / fatigue / form (Performance Management Chart)

The standard endurance-training lens (TrainingPeaks PMC), implemented in analysis.banister(daily_load, ctl_tau=42, atl_tau=7):

CTL_today  = CTL_yesterday · exp(1/τ_CTL) + load_today · (1  exp(1/τ_CTL))   τ_CTL = 42d
ATL_today  = ATL_yesterday · exp(1/τ_ATL) + load_today · (1  exp(1/τ_ATL))   τ_ATL =  7d
TSB_today  = CTL_yesterday  ATL_yesterday          # yesterday's values (TP convention)

Helper: daily_training_load_series(conn, activity_types=('running','trail_running')) returns the per-day summed training_load Series that feeds it. Rest days are filled with 0 — both EWMAs still update, ATL drops faster than CTL, TSB rises.

TSB interpretation (used in both notebooks 02 and 06):

TSB meaning
< 30 severely fatigued — injury risk
10 to 30 productive overload — heart of a build
10 to 0 balanced building
0 to +10 sharpening
+10 to +25 fresh / peaked — race-day target
> +25 detrained (taper too long)

02_running.ipynb plots the historical PMC; 06_race_plan.ipynb projects it forward through race day by converting plan-week km into estimated daily loads (separate RACE_DAY_TL_PER_KM ≈ 7 for the race itself vs ~11 for training runs — race-day TL/km is empirically lower because ultras spread load over more time).


4 — Notebook tour

Each notebook bootstraps the same way:

import sys; sys.path.insert(0, '..')
from analysis import open_conn, load_activities, load_wellness, ...
conn = open_conn()
# Notebook Covers
01 Overview Data coverage by table, recent activities, monthly mileage YoY, wellness completeness
02 Running Weekly volume + rolling, pace-vs-HR scatter, aerobic-efficiency trend, ACWR, PRs by distance
03 Recovery Sleep composition, after-run vs after-rest HRV/RHR, correlations between training and recovery, weekly km vs recovery indicators
04 Efficiency YoY m/beat by distance bucket, HR/pace split, polynomial curve fits, easy-runs-only headline view
05 Intra-run (1) per-mile decoupling, (1b) per-second decoupling on race FITs, (2) cadence & stride at controlled pace, (3) GPS route clustering, (4) HR-zone time-in-zone (FIT-based) + polarized split
06 Race plan Periodised 17-week plan around 30K → 50K → 50-mile, with auto-tracking against actual recorded volume

Notebooks 05 and 06 are generated from _build_05.py / _build_06.py build scripts. Edit those, then uv run python notebooks/_build_NN.py regenerates the .ipynb. Easier than editing 30 KB of JSON.


5 — Setup

uv sync                                  # creates .venv from pyproject.toml

Kernel: in VSCode, pick the project's .venv (top-right of any notebook). If it's not discoverable:

uv run python -m ipykernel install --user --name garmin --display-name "garmin (.venv)"

Path A — official export

  1. Go to https://www.garmin.com/account/datamanagement/exportdata and request the export. Garmin emails the link within 2472 h.
  2. Download the zip (several hundred MB to a few GB) or, for the Takeout dump, the UUID-named folder.
  3. Ingest:
uv run ingest_export.py path/to/connect.zip
# or, if already unzipped:
uv run ingest_export.py path/to/<export>/

Add --dry-run to preview without writing. The data is upserted on activity_id / calendar_date, so it's safe to re-ingest a refreshed export later.

For Takeout dumps specifically, follow the 4-step extract+ingest+link+precompute sequence in §1.

Path B — live API sync

Garmin's SSO endpoint is rate-limited; you may hit 429s. If so, wait 3060 min, switch networks, or use Path A.

uv run auth.py            # prompts for email / password / MFA, saves OAuth to .secrets/
uv run sync.py --full     # 365-day backfill
uv run sync.py            # incremental thereafter (last 14 days)

uv run sync.py --days 90          # custom backfill window
uv run sync.py --no-details       # skip per-activity detail/splits (much faster)
uv run sync.py --skip-activities  # wellness only

garth is deprecated upstream (https://github.com/matin/garth/discussions/222). If the live sync stops working, fall back to Path A.