338 lines
16 KiB
Markdown
338 lines
16 KiB
Markdown
|
|
# Garmin → local SQLite
|
|||
|
|
|
|||
|
|
A small, self-contained pipeline that pulls Garmin Connect data into a local SQLite database and a stack of pandas notebooks. The aim is a *reference* you can run against your own data — not a polished product — so this README focuses on what's in the data, what gets computed from it, and exactly how each derived metric is defined.
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
garmin/
|
|||
|
|
├── auth.py # Path B: OAuth login for live API
|
|||
|
|
├── sync.py # Path B: incremental sync via garth
|
|||
|
|
├── ingest_export.py # Path A: parse Garmin Connect data export
|
|||
|
|
├── link_fit_files.py # match Takeout FIT files to activities by start_time
|
|||
|
|
├── compute_time_in_zone.py # cache per-activity HR-zone breakdowns
|
|||
|
|
├── analysis.py # shared loaders + derived metrics
|
|||
|
|
├── db.py # schema + connection helpers
|
|||
|
|
├── data/garmin.db # SQLite (created on first run)
|
|||
|
|
└── notebooks/
|
|||
|
|
├── 01_overview.ipynb # data coverage, recent activity
|
|||
|
|
├── 02_running.ipynb # volume, pace, ACWR, PRs
|
|||
|
|
├── 03_recovery.ipynb # sleep, HRV, RHR, body battery
|
|||
|
|
├── 04_efficiency.ipynb # m/beat trends year over year
|
|||
|
|
├── 05_intra_run.ipynb # decoupling, cadence, routes, HR zones
|
|||
|
|
└── 06_race_plan.ipynb # build a periodised plan + adherence tracker
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 1 — Source data
|
|||
|
|
|
|||
|
|
### Two ways to get data into the DB
|
|||
|
|
|
|||
|
|
| Path | Script | What you get | When to use |
|
|||
|
|
|---|---|---|---|
|
|||
|
|
| **A. Official data export** (recommended) | `ingest_export.py` | Complete history, FIT files, multi-year wellness | First load; periodic refreshes |
|
|||
|
|
| **B. Live API sync** | `sync.py` (after `auth.py`) | Last N days, lap-level splits, no FIT files | Incremental top-ups between exports |
|
|||
|
|
|
|||
|
|
The two paths overlap intentionally: the export is the source of truth, the live API fills the gap between exports. The upsert logic only overwrites the raw-JSON column and `fetched_at` on conflict, so Path B values aren't clobbered by Path A re-ingests.
|
|||
|
|
|
|||
|
|
### Garmin's export formats
|
|||
|
|
|
|||
|
|
Garmin actually has *two* export formats, both confusingly called "the export":
|
|||
|
|
|
|||
|
|
1. **Garmin Connect data export** (`connect.zip`) — request via Account Management → Export Your Data. Filenames are `<activity_id>_<activity_name>.fit`, distances/durations in SI units (m, s). `ingest_export.py` was built for this.
|
|||
|
|
2. **Garmin Takeout dump** — UUID-named folder (`<uuid>_N/`) with many unrelated domains (aviation, Tacx, InReach…). The running data lives under `DI_CONNECT/`. **Crucially different conventions:**
|
|||
|
|
- FIT filenames: `<email>_<upload_id>.fit` — upload IDs ≠ activity IDs
|
|||
|
|
- `summarizedActivities` uses scaled-integer units: distance in **cm**, duration in **ms**, elevation in **cm**, speed in **m/s ÷ 10**
|
|||
|
|
- `ingest_export.py` now detects and converts these; `link_fit_files.py` links FITs by content (parses `session.start_time` and matches activities by ±60 s) since the filename-based approach doesn't work
|
|||
|
|
|
|||
|
|
For a Takeout dump, the full sequence is:
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# 1. Extract the FIT archives in place
|
|||
|
|
mkdir -p <export>/DI_CONNECT/DI-Connect-Uploaded-Files/fit
|
|||
|
|
unzip -d <export>/DI_CONNECT/DI-Connect-Uploaded-Files/fit \
|
|||
|
|
<export>/DI_CONNECT/DI-Connect-Uploaded-Files/UploadedFiles_0-_Part*.zip
|
|||
|
|
|
|||
|
|
# 2. Ingest JSON + index FITs that have parseable activity IDs
|
|||
|
|
uv run ingest_export.py <export>/
|
|||
|
|
|
|||
|
|
# 3. Link Takeout FITs to activities by start-time match
|
|||
|
|
uv run link_fit_files.py <export>/
|
|||
|
|
|
|||
|
|
# 4. (Optional) precompute time-in-zone cache
|
|||
|
|
uv run compute_time_in_zone.py
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Garmin's units & quirks worth knowing
|
|||
|
|
|
|||
|
|
| Field | Live API | Takeout summarizedActivities | Convert by |
|
|||
|
|
|---|---|---|---|
|
|||
|
|
| distance | m | cm | × 0.01 |
|
|||
|
|
| duration / movingDuration | s | ms | × 0.001 |
|
|||
|
|
| elevationGain / Loss | m | cm | × 0.01 |
|
|||
|
|
| averageSpeed / maxSpeed | m/s | m/s ÷ 10 | × 10 |
|
|||
|
|
| HR (avg, max) | bpm | bpm | — |
|
|||
|
|
| calories | kcal | kcal | — |
|
|||
|
|
| training_effect, vo2_max | scalar | scalar | — |
|
|||
|
|
| cadence (`averageRunCadence`) | both-legs SPM | both-legs SPM | — (do **not** double) |
|
|||
|
|
| strideLength | cm | cm | × 0.01 for m |
|
|||
|
|
| FIT `position_lat/long` | semicircles | semicircles | × (180 / 2³¹) for degrees |
|
|||
|
|
| FIT `enhanced_speed` | m/s | m/s | — |
|
|||
|
|
|
|||
|
|
The cadence trap bit me: `averageRunCadence` on this account's exports is already both-legs (~150–180 spm). Doubling it as if it were single-leg yields 300+ which trips any sane filter, then nothing makes it through.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 2 — Schema
|
|||
|
|
|
|||
|
|
`db.py:SCHEMA` is the source of truth; what follows is the gist.
|
|||
|
|
|
|||
|
|
| Table | Grain | Origin | Notes |
|
|||
|
|
|---|---|---|---|
|
|||
|
|
| `activities` | one row per activity | both paths | `raw` JSON has every Garmin field; the parsed columns are a curated subset |
|
|||
|
|
| `activity_splits` | one row per lap | Path B only | Garmin defaults to auto-laps (~1 km or 1 mi per split). `raw` has cadence, stride, vert osc, GPS bounds |
|
|||
|
|
| `activity_fit_files` | one row per linked FIT | Path A | `fit_path` is **relative** to whichever export root was passed to `link_fit_files.py` — keep the export folder in place |
|
|||
|
|
| `activity_time_in_zone` | one row per activity | precomputed | Cached by `compute_time_in_zone.py`; `source` is `'fit'` or `'lap'` |
|
|||
|
|
| `daily_steps` / `daily_sleep` / `daily_stress` / `daily_hrv` / `daily_body_battery` / `daily_intensity_minutes` / `daily_resting_hr` | one row per calendar date | both paths | Each has a `raw` JSON column with the full source payload |
|
|||
|
|
| `sync_state` | key-value | both paths | Records last-sync / last-ingest timestamps |
|
|||
|
|
|
|||
|
|
The `raw` column on every table is the full upstream JSON. If you need a field the schema doesn't surface (sleep stage breakdowns, GPS bounding box, swimming-specific metrics, sport sub-types), unpack it from `raw` rather than re-ingesting:
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
from analysis import expand_raw, load_activities
|
|||
|
|
acts = load_activities(open_conn(), type='running')
|
|||
|
|
fields = expand_raw(acts) # pd.json_normalize'd companion frame
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 3 — Metrics & how they're calculated
|
|||
|
|
|
|||
|
|
All loaders and helpers live in [analysis.py](analysis.py). The key derivations:
|
|||
|
|
|
|||
|
|
### Pace
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
pace_min_per_km = (moving_duration_s / 60) / (distance_m / 1000)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
`moving_duration_s` excludes paused time. Activities with implausible paces (sub-3 min/km or over 30 min/km — covers world-record marathon pace on the fast side and walks-with-stops on the slow side) get masked to NaN in `load_activities` rather than dropped, so the row count is preserved.
|
|||
|
|
|
|||
|
|
### Acute:Chronic Workload Ratio (ACWR)
|
|||
|
|
|
|||
|
|
Daily training-load sum, then:
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
acute = sum(training_load) over the last 7 days
|
|||
|
|
chronic = sum(training_load) over the last 28 days, divided by 4 for week-equivalence
|
|||
|
|
ACWR = acute / chronic
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Sweet spot ~0.8–1.3, > 1.5 = aggressive ramp / injury risk. Implemented inline in [02_running.ipynb](notebooks/02_running.ipynb).
|
|||
|
|
|
|||
|
|
### Aerobic efficiency — "m/beat"
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
m_per_beat = distance_m / (duration_s * avg_hr / 60)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Higher = more metres covered per heartbeat = fitter aerobic system at the run's effort. Best compared YoY within a tight distance bucket so terrain and intent don't dominate. Used as the headline view in [04_efficiency.ipynb](notebooks/04_efficiency.ipynb).
|
|||
|
|
|
|||
|
|
### Decoupling (Pa:Hr drift)
|
|||
|
|
|
|||
|
|
Friel's method, in two resolutions. *Positive* = pace/HR fell off in the second half (cardiac drift, possibly fueling deficit). *Negative* = improved (negative split).
|
|||
|
|
|
|||
|
|
**Per-mile (lap level)** — `analysis.decoupling(splits, min_splits=6)`:
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
For each activity with ≥ min_splits laps:
|
|||
|
|
halves = first half / second half by lap count
|
|||
|
|
eff_half = duration-weighted mean(speed_mps) / mean(heart_rate)
|
|||
|
|
decoupling_pct = (eff_first / eff_second − 1) × 100
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Cheap, but smears across aid-station stops and rounds at lap boundaries.
|
|||
|
|
|
|||
|
|
**Per-second (FIT)** — `analysis.fit_decoupling(records, segments=N, warmup_min=5, cooldown_min=2, min_speed_mps=0.5)`:
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
records = per-second FIT messages
|
|||
|
|
drop first warmup_min and last cooldown_min
|
|||
|
|
drop records where speed_mps < min_speed_mps # ignore stopped time
|
|||
|
|
slice remaining moving time into N equal-time chunks
|
|||
|
|
for each chunk:
|
|||
|
|
efficiency = mean(speed_mps) / mean(heart_rate)
|
|||
|
|
decoupling_pct = (efficiency[0] / efficiency[i] − 1) × 100
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Friel's published thresholds (interpret on **steady aerobic** efforts only):
|
|||
|
|
- **< 5%** — aerobically developed
|
|||
|
|
- **5–10%** — sustainable
|
|||
|
|
- **> 10%** — pacing unsustainable for the distance, OR fueling shortfall, OR heat
|
|||
|
|
|
|||
|
|
In this dataset the per-second number is consistently **7–15 percentage points lower** than the per-mile number for the same run — the gap is aid-station noise.
|
|||
|
|
|
|||
|
|
### HR zone time-in-zone
|
|||
|
|
|
|||
|
|
**Zone boundaries come from Garmin's `heartRateZones.json`** (training method `HR_MAX`), not estimated from observed HR. The constants for this user live in `analysis.HR_ZONES_USER`:
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
Z1: 102–122 (recovery)
|
|||
|
|
Z2: 123–143 (easy aerobic — long-run target)
|
|||
|
|
Z3: 144–164 (tempo / "junk-miles middle")
|
|||
|
|
Z4: 165–185 (threshold — LTHR=182 sits inside Z4)
|
|||
|
|
Z5: 186–209 (VO₂ max)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Time-in-zone has two implementations; `compute_time_in_zone.py` picks the better one per activity and caches the result in `activity_time_in_zone`.
|
|||
|
|
|
|||
|
|
**From FIT (`source='fit'`)** — `analysis.time_in_zone_from_fit(records)`:
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
records = per-second FIT messages
|
|||
|
|
for each consecutive record pair:
|
|||
|
|
dt = elapsed_seconds delta (clipped to [0, 30] so paused recording doesn't dump hours into one zone)
|
|||
|
|
zone = bucket(this record's heart_rate)
|
|||
|
|
accumulate dt into zone
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**From splits (`source='lap'`)** — fallback when no FIT is linked:
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
for each split:
|
|||
|
|
zone = bucket(split's avg_hr)
|
|||
|
|
accumulate split.duration_s into that single zone
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
The FIT version is the right number; the lap version is **biased toward the middle zone**: when a lap's HR average is in Z3 but the actual HR oscillated into Z2 and Z4, the lap method dumps the whole lap into Z3. On a sample 8-hour race in this dataset, the lap method over-counted Z3 by 67 minutes and under-counted Z2 and Z4 by ~32 minutes each.
|
|||
|
|
|
|||
|
|
### Polarized split
|
|||
|
|
|
|||
|
|
A three-bucket simplification used in [05_intra_run.ipynb §4](notebooks/05_intra_run.ipynb):
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
easy = Z1 + Z2 (HR ≤ 143)
|
|||
|
|
moderate = Z3 (144–164, the "junk-miles middle")
|
|||
|
|
hard = Z4 + Z5 (≥ 165, threshold and up)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Seiler's polarized target is ~80% easy, < 10% moderate, ~20% hard. Pyramidal training is high easy, modest moderate, low hard. Threshold-heavy training is the opposite — most time in moderate.
|
|||
|
|
|
|||
|
|
### Cadence & stride
|
|||
|
|
|
|||
|
|
Both come directly from Garmin's lap data (`averageRunCadence`, `strideLength`) and from FIT records (`cadence`, `step_length`). Two gotchas:
|
|||
|
|
|
|||
|
|
1. **Cadence is already both-legs SPM** for this user's exports — do not double. Sanity range for running: 140–200 spm.
|
|||
|
|
2. **Stride length is in cm** in lap data (`strideLength: 78` → 78 cm → 0.78 m) and in **mm** in FIT records (`step_length: 1041` → 104 cm). Divide accordingly.
|
|||
|
|
|
|||
|
|
The notebook restricts form analysis to splits with cadence > 0 (drops walks/standing intervals) and to a controlled pace band (5:30–6:30 min/km) when comparing YoY so fitness changes don't contaminate the form signal.
|
|||
|
|
|
|||
|
|
### GPS route clustering
|
|||
|
|
|
|||
|
|
`analysis.cluster_routes(lats, lons, radius_km=0.25)`:
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
Greedy haversine clustering:
|
|||
|
|
For each unassigned point i:
|
|||
|
|
distances = haversine(point_i, all_other_points)
|
|||
|
|
neighbours = points within radius_km
|
|||
|
|
if len(neighbours) >= 2:
|
|||
|
|
assign all to next_label
|
|||
|
|
next_label += 1
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Per-route pace trends control for terrain — the same loop run in 2023 vs 2025 says way more about fitness than raw monthly pace. Adequate for a few hundred starts; switch to `sklearn.cluster.DBSCAN(metric='haversine')` for thousands.
|
|||
|
|
|
|||
|
|
### Banister fitness / fatigue / form (Performance Management Chart)
|
|||
|
|
|
|||
|
|
The standard endurance-training lens (TrainingPeaks PMC), implemented in `analysis.banister(daily_load, ctl_tau=42, atl_tau=7)`:
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
CTL_today = CTL_yesterday · exp(−1/τ_CTL) + load_today · (1 − exp(−1/τ_CTL)) τ_CTL = 42d
|
|||
|
|
ATL_today = ATL_yesterday · exp(−1/τ_ATL) + load_today · (1 − exp(−1/τ_ATL)) τ_ATL = 7d
|
|||
|
|
TSB_today = CTL_yesterday − ATL_yesterday # yesterday's values (TP convention)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Helper: `daily_training_load_series(conn, activity_types=('running','trail_running'))` returns the per-day summed `training_load` Series that feeds it. Rest days are filled with 0 — both EWMAs still update, ATL drops faster than CTL, TSB rises.
|
|||
|
|
|
|||
|
|
TSB interpretation (used in both notebooks 02 and 06):
|
|||
|
|
|
|||
|
|
| TSB | meaning |
|
|||
|
|
|---|---|
|
|||
|
|
| < −30 | severely fatigued — injury risk |
|
|||
|
|
| −10 to −30 | productive overload — heart of a build |
|
|||
|
|
| −10 to 0 | balanced building |
|
|||
|
|
| 0 to +10 | sharpening |
|
|||
|
|
| **+10 to +25** | **fresh / peaked — race-day target** |
|
|||
|
|
| > +25 | detrained (taper too long) |
|
|||
|
|
|
|||
|
|
[02_running.ipynb](notebooks/02_running.ipynb) plots the historical PMC; [06_race_plan.ipynb](notebooks/06_race_plan.ipynb) projects it forward through race day by converting plan-week km into estimated daily loads (separate `RACE_DAY_TL_PER_KM ≈ 7` for the race itself vs `~11` for training runs — race-day TL/km is empirically lower because ultras spread load over more time).
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 4 — Notebook tour
|
|||
|
|
|
|||
|
|
Each notebook bootstraps the same way:
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
import sys; sys.path.insert(0, '..')
|
|||
|
|
from analysis import open_conn, load_activities, load_wellness, ...
|
|||
|
|
conn = open_conn()
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
| # | Notebook | Covers |
|
|||
|
|
|---|---|---|
|
|||
|
|
| 01 | Overview | Data coverage by table, recent activities, monthly mileage YoY, wellness completeness |
|
|||
|
|
| 02 | Running | Weekly volume + rolling, pace-vs-HR scatter, aerobic-efficiency trend, ACWR, PRs by distance |
|
|||
|
|
| 03 | Recovery | Sleep composition, after-run vs after-rest HRV/RHR, correlations between training and recovery, weekly km vs recovery indicators |
|
|||
|
|
| 04 | Efficiency | YoY m/beat by distance bucket, HR/pace split, polynomial curve fits, easy-runs-only headline view |
|
|||
|
|
| 05 | Intra-run | (1) per-mile decoupling, (1b) per-second decoupling on race FITs, (2) cadence & stride at controlled pace, (3) GPS route clustering, (4) HR-zone time-in-zone (FIT-based) + polarized split |
|
|||
|
|
| 06 | Race plan | Periodised 17-week plan around 30K → 50K → 50-mile, with auto-tracking against actual recorded volume |
|
|||
|
|
|
|||
|
|
Notebooks 05 and 06 are generated from `_build_05.py` / `_build_06.py` build scripts. Edit those, then `uv run python notebooks/_build_NN.py` regenerates the .ipynb. Easier than editing 30 KB of JSON.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 5 — Setup
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
uv sync # creates .venv from pyproject.toml
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Kernel: in VSCode, pick the project's `.venv` (top-right of any notebook). If it's not discoverable:
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
uv run python -m ipykernel install --user --name garmin --display-name "garmin (.venv)"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Path A — official export
|
|||
|
|
|
|||
|
|
1. Go to <https://www.garmin.com/account/datamanagement/exportdata> and request the export. Garmin emails the link within 24–72 h.
|
|||
|
|
2. Download the zip (several hundred MB to a few GB) or, for the Takeout dump, the UUID-named folder.
|
|||
|
|
3. Ingest:
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
uv run ingest_export.py path/to/connect.zip
|
|||
|
|
# or, if already unzipped:
|
|||
|
|
uv run ingest_export.py path/to/<export>/
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Add `--dry-run` to preview without writing. The data is upserted on `activity_id` / `calendar_date`, so it's safe to re-ingest a refreshed export later.
|
|||
|
|
|
|||
|
|
For Takeout dumps specifically, follow the 4-step extract+ingest+link+precompute sequence in §1.
|
|||
|
|
|
|||
|
|
### Path B — live API sync
|
|||
|
|
|
|||
|
|
Garmin's SSO endpoint is rate-limited; you may hit 429s. If so, wait 30–60 min, switch networks, or use Path A.
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
uv run auth.py # prompts for email / password / MFA, saves OAuth to .secrets/
|
|||
|
|
uv run sync.py --full # 365-day backfill
|
|||
|
|
uv run sync.py # incremental thereafter (last 14 days)
|
|||
|
|
|
|||
|
|
uv run sync.py --days 90 # custom backfill window
|
|||
|
|
uv run sync.py --no-details # skip per-activity detail/splits (much faster)
|
|||
|
|
uv run sync.py --skip-activities # wellness only
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
`garth` is deprecated upstream (<https://github.com/matin/garth/discussions/222>). If the live sync stops working, fall back to Path A.
|