From 7688fc1dd38eb6d9984ea134876332a5ac7c1034 Mon Sep 17 00:00:00 2001 From: noise Date: Tue, 28 Apr 2026 18:10:00 -0400 Subject: [PATCH] Resilience pass: bulletproof recovery + FWS / fault polling + EG4 discovery refresh --- LVX6048/2026-04-28.md | 170 ++++++++++++++++++ LVX6048/bin/lvx-resolve-links | 45 +++-- LVX6048/config/powermon/powermon.yaml | 14 ++ LVX6048/config/powermon/powermon2.yaml | 14 ++ .../systemd/system/lvx-resolve-links.timer | 15 ++ LVX6048/etc/systemd/system/powermon.service | 5 + .../powermon.service.d/10-resolver.conf | 6 +- LVX6048/etc/systemd/system/powermon2.service | 2 + .../powermon2.service.d/10-resolver.conf | 6 +- LVX6048/install.sh | 3 +- LVX6048/powermon-patches/pi18.py | 3 + 11 files changed, 267 insertions(+), 16 deletions(-) create mode 100644 LVX6048/2026-04-28.md create mode 100644 LVX6048/etc/systemd/system/lvx-resolve-links.timer diff --git a/LVX6048/2026-04-28.md b/LVX6048/2026-04-28.md new file mode 100644 index 0000000..357ec4d --- /dev/null +++ b/LVX6048/2026-04-28.md @@ -0,0 +1,170 @@ +# 2026-04-28 — bulletproof recovery + FWS / fault polling + EG4 discovery refresh + +Mostly resilience work today, prompted by waking up to find both powermon +services inactive and no PV / battery data flowing from the inverters since +~06:11. Root-caused, recovered, and hardened the failure mode so it can't +strand the stack again. While in there: closed two related polling gaps +(faults weren't being captured at all, and EG4 discovery configs had +fallen out of the broker — InfluxDB exporter had stopped seeing pack data). + +## What happened + +1. **Morning incident: powermon offline since 06:11.** udev fired the + resolver during what looks like a sunrise cold-start window — both + inverters were briefly unresponsive on USB-HID (`/dev/hidraw0` + returned no PI18 reply; `/dev/hidraw1` returned 22 zero bytes for the + serial query). Resolver hit its 10 s internal retry, gave up, exited + with code 2. Powermon services had `Requires=lvx-resolve-links` so + systemd refused to start them, and there was no retry path — stack + stayed dead until manual intervention ~10 h later. + +2. **Recovery.** Re-ran the resolver manually; both inverters responded + normally on the second attempt. Started powermon services; PV + voltages (305.2 V / 307.2 V), battery voltages (54.0 V), and all the + usual telemetry flowed back within seconds. + +3. **Bulletproofing the resolver path.** Three coordinated changes so a + transient morning blip can't latch the whole stack offline: + + a. **Resolver → always-exit-0.** Best-effort policy: write whatever + symlinks it can resolve, log warnings for the missing ones, + always return success. Only-relink-on-change semantics so a + no-op timer-driven run doesn't churn the symlinks. When symlinks + *do* move (post-power-cycle), resolver auto-restarts powermon + so its open hidraw fds re-bind. + + b. **`Wants=` instead of `Requires=`** on both powermon services. + Resolver failure no longer propagates as "permanently failed" + to dependent units. Powermon's own `Restart=always` handles the + retry loop until `/dev/lvx6048-{1,2}` appear. + + c. **Periodic systemd timer** for the resolver + (`lvx-resolve-links.timer`) — `OnBootSec=2min`, + `OnUnitActiveSec=5min`. Belt-and-suspenders against any + hot-plug case where udev didn't fire (stuck-USB scenario like + this morning's). + + d. **`StartLimitIntervalSec=0`** on both powermon services — disables + systemd's default "stop trying after 5 fails in 10 s" rate-limiter, + so the retry loop continues indefinitely until success. + + Worst-case data gap is now ~2 min (timer cycle + couple polls) vs. + the previous "until a human notices and runs systemctl restart". + +4. **FWS fault polling added to powermon.** A separate gap surfaced + during the morning's troubleshooting: front-panel showed a "32" + indicator, and we discovered our `powermon.yaml` was polling + `GS / MOD / PIRI / ET` only — no `FWS`. Faults / warnings weren't + reaching MQTT at all. Added FWS (every 30 s) to both inverter + configs. + +5. **Defensive FWS fault-code decoder.** Same shape as the MOD `06`/`07` + crashes from earlier in the week — an unknown fault code would + `KeyError` and crash powermon. Pre-populated the OPTION dict with + placeholder labels for codes 00–99, then layered the known names on + top. Future unknown codes show as `"Fault 32"` rather than crashing. + Also unblocks adding the four PI18 setters (PE / PD / POPM / PF) to + the shim later — same pattern. + +6. **About the LCD's "32".** Direct FWS queries to both units returned + `fault_code=No fault` while `pv_loss_warning=enabled`. The LCD was + most likely showing the PV-loss warning code (no F/W prefix on this + firmware's display), correlating with the cold-start window. Not a + real fault — but worth re-confirming the LCD format next time so we + know whether 32-with-prefix is a thing on this hardware. + +7. **EG4 lifepower discovery configs vanished.** Independent finding: + when checking why `lvx6048_*_fault_code` wasn't appearing in HA, I + noticed zero retained `homeassistant/sensor/lifepower4_*/config` + topics on the broker. HA had cached entity registrations and was + showing live state, but the canonical discovery configs were gone — + likely cleared by an earlier `eg4-purge-orphans` invocation or + broker retention loss. InfluxDB exporter relies on those configs to + know what entities to publish, so pack voltages / SoCs had stopped + reaching InfluxDB. + + Fix: `systemctl restart eg4-battery.service`. Daemon re-publishes + all 207 retained discovery configs at startup (3 packs × ~69 + entities). HA refreshes its registry, InfluxDB resumes capturing. + +## Net effect + +| Layer | Before | After | +|---|---|---| +| Morning-blip recovery | manual `systemctl restart` (~10 h gap) | automatic ~2 min gap | +| Resolver exit on missing serial | code 2 (systemd marks unit failed) | always 0 (best-effort) | +| Resolver runs | only on udev hot-plug | + every 5 min via timer | +| Powermon ↔ resolver coupling | `Requires=` (failure cascades) | `Wants=` (independent retry) | +| Powermon retry budget | ~5 attempts (default StartLimit) | unbounded (`StartLimitIntervalSec=0`) | +| Fault-code → MQTT | not polled | FWS every 30 s | +| Unknown fault code | `KeyError` crash | shows as `"Fault NN"` | +| EG4 discovery configs | missing (purged) | 207 retained, InfluxDB unblocked | + +## Files touched + +``` +M LVX6048/bin/lvx-resolve-links (always-exit-0; relink-on-change; auto-restart powermon if symlinks move) +M LVX6048/config/powermon/powermon.yaml (added FWS poll @ 30s) +M LVX6048/config/powermon/powermon2.yaml (same) +M LVX6048/etc/systemd/system/powermon.service (StartLimitIntervalSec=0) +M LVX6048/etc/systemd/system/powermon.service.d/10-resolver.conf (Requires= -> Wants=) +M LVX6048/etc/systemd/system/powermon2.service (same) +M LVX6048/etc/systemd/system/powermon2.service.d/10-resolver.conf (same) +M LVX6048/install.sh (installs the timer) +M LVX6048/powermon-patches/pi18.py (defensive fault-code decoder, 00-99) +A LVX6048/etc/systemd/system/lvx-resolve-links.timer (every 5 min) +A LVX6048/2026-04-28.md (this file) +``` + +## How to verify + +```bash +# Timer running? +systemctl list-timers lvx-resolve-links.timer + +# Resolver idempotent? (no powermon bounce on no-change) +sudo systemctl restart lvx-resolve-links.service +journalctl -u lvx-resolve-links.service -n 8 --no-pager +# Expect: "(unchanged)" lines, no "restarting powermon" line. + +# Wants= confirmed? +systemctl show powermon.service -p Wants -p Requires +# Expect: Wants=...lvx-resolve-links.service... ; Requires= line should NOT +# include lvx-resolve-links. + +# Synthetic morning-blip test (optional, disruptive): +sudo rm /dev/lvx6048-{1,2} +sudo systemctl restart powermon.service +# Expect: powermon enters Restart loop; within 5 min the timer fires the +# resolver, recreates symlinks, powermon's next retry succeeds. + +# Fault polling alive +mosquitto_sub -h -u mqtt -P -W 35 -v \ + -t 'homeassistant/sensor/lvx6048_1_fault_code/state' \ + -t 'homeassistant/sensor/lvx6048_2_fault_code/state' +# Expect: "No fault" published every 30 s on each unit. +``` + +## Loose ends / next pass + +- **FWS warning booleans aren't getting individual HA entities.** The 16 + `enabled/disabled` flags (PV loss, over-temp, fan locked, …) are + decoded by powermon's screen formatter but the hass formatter filters + them because they lack `unit_of_measurement` / `device_class` / + `reading_type` metadata. Worth a separate pass to expose them as + `binary_sensor` entities so HA gets a real "PV loss" indicator + alongside `fault_code`. +- **LCD code format confirmation.** Next time you can look at the LCD, + check whether warning codes display as bare "32", "W32", or + something else — informs whether to add a friendly mapping for + warning codes in the docs. +- **Faster timer cadence?** Currently 5 min (worst-case ~2 min gap). + Could drop to 1–2 min if morning blips become more frequent. Trade- + off: more steady-state resolver runs (each costs one PI18 ID query + per inverter, harmless but visible in logs). +- **eg4-battery discovery-config retention.** The configs DID get + purged at some point — not certain what triggered it. Next + investigation: confirm whether `eg4-purge-orphans` is over-eager, + whether the broker is dropping retains, or whether a daemon restart + is needed. Worth a `mosquitto_sub` watch on the broker for retain + changes. diff --git a/LVX6048/bin/lvx-resolve-links b/LVX6048/bin/lvx-resolve-links index 02a4512..b90aded 100755 --- a/LVX6048/bin/lvx-resolve-links +++ b/LVX6048/bin/lvx-resolve-links @@ -89,25 +89,44 @@ async def main() -> int: break await asyncio.sleep(0.5) - if not sn_to_path: - print("no LVX6048 devices found on /dev/hidraw*", file=sys.stderr) - return 1 - - missing = [] + # Always-best-effort policy: exit 0 even if no expected serials were + # found, so a transient inverter blip (e.g. the unit being mid-cold-start + # at sunrise) doesn't permanently latch dependent services into a failed + # state via systemd `Requires=`. The periodic timer + powermon's own + # Restart=always handles convergence once the inverter recovers. + changed = False for sn, link in LINK_FOR_SERIAL.items(): if sn in sn_to_path: - _relink(link, sn_to_path[sn]) - print(f"symlink {link} -> {os.path.basename(sn_to_path[sn])}") + target = sn_to_path[sn] + current = os.readlink(link) if os.path.islink(link) else None + current_full = os.path.join(os.path.dirname(link), current) if current else None + if current_full != target: + _relink(link, target) + print(f"symlink {link} -> {os.path.basename(target)}") + changed = True + else: + print(f"symlink {link} -> {os.path.basename(target)} (unchanged)") else: - missing.append((link, sn)) - try: - if os.path.islink(link): + if os.path.islink(link): + try: os.unlink(link) - except FileNotFoundError: - pass + changed = True + except FileNotFoundError: + pass print(f"WARNING: {link} serial {sn} not found on any /dev/hidraw*") - return 0 if not missing else 2 + # If symlinks actually moved (e.g. hidraw indices flipped post-power-cycle), + # bounce powermon so its open hidraw fds re-bind to the right physical unit. + # Idempotent runs (no symlink change) leave powermon alone. + if changed: + print("symlinks changed — restarting powermon services") + import subprocess + subprocess.Popen( + ["/bin/systemctl", "--no-block", "restart", + "powermon.service", "powermon2.service"] + ) + + return 0 if __name__ == "__main__": diff --git a/LVX6048/config/powermon/powermon.yaml b/LVX6048/config/powermon/powermon.yaml index f05d82f..ddc4788 100644 --- a/LVX6048/config/powermon/powermon.yaml +++ b/LVX6048/config/powermon/powermon.yaml @@ -66,3 +66,17 @@ commands: type: hass discovery_prefix: homeassistant entity_id_prefix: lvx6048_1 + + # Fault + warning bits (2-digit fault_code + 16 boolean warning flags). + # Polled every 30 s — fast enough to be useful in HA, slow enough not to + # flood the bus during normal "No fault" steady state. + - command: FWS + trigger: + every: 30 + outputs: + - type: mqtt + topic: powermon/lvx6048_1/FWS + format: + type: hass + discovery_prefix: homeassistant + entity_id_prefix: lvx6048_1 diff --git a/LVX6048/config/powermon/powermon2.yaml b/LVX6048/config/powermon/powermon2.yaml index b2d56ea..6510416 100644 --- a/LVX6048/config/powermon/powermon2.yaml +++ b/LVX6048/config/powermon/powermon2.yaml @@ -63,3 +63,17 @@ commands: type: hass discovery_prefix: homeassistant entity_id_prefix: lvx6048_2 + + # Fault + warning bits (2-digit fault_code + 16 boolean warning flags). + # Polled every 30 s — fast enough to be useful in HA, slow enough not to + # flood the bus during normal "No fault" steady state. + - command: FWS + trigger: + every: 30 + outputs: + - type: mqtt + topic: powermon/lvx6048_2/FWS + format: + type: hass + discovery_prefix: homeassistant + entity_id_prefix: lvx6048_2 diff --git a/LVX6048/etc/systemd/system/lvx-resolve-links.timer b/LVX6048/etc/systemd/system/lvx-resolve-links.timer new file mode 100644 index 0000000..05d01d4 --- /dev/null +++ b/LVX6048/etc/systemd/system/lvx-resolve-links.timer @@ -0,0 +1,15 @@ +[Unit] +Description=Periodic re-resolve of LVX6048 hidraw symlinks (transient-failure backstop) +# Belt-and-suspenders against the udev RUN+= path: if the inverter is still +# enumerating at hot-plug time and the resolver finds no PI18-responsive +# device, this timer keeps trying every few minutes. The resolver itself is +# idempotent — runs that don't change anything are no-ops. + +[Timer] +OnBootSec=2min +OnUnitActiveSec=5min +AccuracySec=15s +Unit=lvx-resolve-links.service + +[Install] +WantedBy=timers.target diff --git a/LVX6048/etc/systemd/system/powermon.service b/LVX6048/etc/systemd/system/powermon.service index 46cda8a..058c59a 100644 --- a/LVX6048/etc/systemd/system/powermon.service +++ b/LVX6048/etc/systemd/system/powermon.service @@ -10,6 +10,11 @@ Group=dialout ExecStart=/home/noise/.local/bin/powermon -C /home/noise/.config/powermon/powermon.yaml Restart=always RestartSec=10 +# Disable systemd's default rate-limiter: when the inverter is briefly +# unreachable (sunrise cold-start / USB hiccup), we want to keep retrying +# every 10s indefinitely — not give up after 5 failures in 10s like the +# default StartLimitBurst=5 / StartLimitIntervalSec=10s would. +StartLimitIntervalSec=0 [Install] WantedBy=multi-user.target diff --git a/LVX6048/etc/systemd/system/powermon.service.d/10-resolver.conf b/LVX6048/etc/systemd/system/powermon.service.d/10-resolver.conf index 1882c55..e2ec33f 100644 --- a/LVX6048/etc/systemd/system/powermon.service.d/10-resolver.conf +++ b/LVX6048/etc/systemd/system/powermon.service.d/10-resolver.conf @@ -1,3 +1,7 @@ [Unit] +# Wants= instead of Requires=: a failed/inactive resolver should NOT latch +# powermon into a permanently-failed state. powermon's own Restart=always +# handles the retry loop until /dev/lvx6048-1 reappears (e.g. after the +# resolve-links periodic timer next runs successfully). After=lvx-resolve-links.service -Requires=lvx-resolve-links.service +Wants=lvx-resolve-links.service diff --git a/LVX6048/etc/systemd/system/powermon2.service b/LVX6048/etc/systemd/system/powermon2.service index c690a94..e702490 100644 --- a/LVX6048/etc/systemd/system/powermon2.service +++ b/LVX6048/etc/systemd/system/powermon2.service @@ -10,6 +10,8 @@ Group=dialout ExecStart=/home/noise/.local/bin/powermon -C /home/noise/.config/powermon/powermon2.yaml Restart=always RestartSec=10 +# Disable systemd's default rate-limiter (see powermon.service for rationale). +StartLimitIntervalSec=0 [Install] WantedBy=multi-user.target diff --git a/LVX6048/etc/systemd/system/powermon2.service.d/10-resolver.conf b/LVX6048/etc/systemd/system/powermon2.service.d/10-resolver.conf index 1882c55..65802f0 100644 --- a/LVX6048/etc/systemd/system/powermon2.service.d/10-resolver.conf +++ b/LVX6048/etc/systemd/system/powermon2.service.d/10-resolver.conf @@ -1,3 +1,7 @@ [Unit] +# Wants= instead of Requires=: a failed/inactive resolver should NOT latch +# powermon into a permanently-failed state. powermon's own Restart=always +# handles the retry loop until /dev/lvx6048-2 reappears (e.g. after the +# resolve-links periodic timer next runs successfully). After=lvx-resolve-links.service -Requires=lvx-resolve-links.service +Wants=lvx-resolve-links.service diff --git a/LVX6048/install.sh b/LVX6048/install.sh index 617f892..27ced53 100755 --- a/LVX6048/install.sh +++ b/LVX6048/install.sh @@ -61,6 +61,7 @@ sudo install -m 755 "$TMP_RESOLVER" /usr/local/sbin/lvx-resolve-links # --- 5. systemd units + drop-ins ------------------------------------------- msg "Installing systemd units" sudo install -m 644 "${BASE}/etc/systemd/system/lvx-resolve-links.service" /etc/systemd/system/ +sudo install -m 644 "${BASE}/etc/systemd/system/lvx-resolve-links.timer" /etc/systemd/system/ sudo install -m 644 "${BASE}/etc/systemd/system/lvx-control.service" /etc/systemd/system/ sudo install -m 644 "${BASE}/etc/systemd/system/powermon.service" /etc/systemd/system/ sudo install -m 644 "${BASE}/etc/systemd/system/powermon2.service" /etc/systemd/system/ @@ -94,7 +95,7 @@ done # --- 7. enable services ---------------------------------------------------- msg "Enabling services" -sudo systemctl enable --now lvx-resolve-links.service +sudo systemctl enable --now lvx-resolve-links.service lvx-resolve-links.timer # Only auto-start powermon services if credentials look real (no placeholder present) if grep -q '' "${HOME}/.config/powermon/powermon.yaml" 2>/dev/null; then cat <