8.6 KiB
2026-04-28 — bulletproof recovery + FWS / fault polling + EG4 discovery refresh
Mostly resilience work today, prompted by waking up to find both powermon services inactive and no PV / battery data flowing from the inverters since ~06:11. Root-caused, recovered, and hardened the failure mode so it can't strand the stack again. While in there: closed two related polling gaps (faults weren't being captured at all, and EG4 discovery configs had fallen out of the broker — InfluxDB exporter had stopped seeing pack data).
What happened
-
Morning incident: powermon offline since 06:11. udev fired the resolver during what looks like a sunrise cold-start window — both inverters were briefly unresponsive on USB-HID (
/dev/hidraw0returned no PI18 reply;/dev/hidraw1returned 22 zero bytes for the serial query). Resolver hit its 10 s internal retry, gave up, exited with code 2. Powermon services hadRequires=lvx-resolve-linksso systemd refused to start them, and there was no retry path — stack stayed dead until manual intervention ~10 h later. -
Recovery. Re-ran the resolver manually; both inverters responded normally on the second attempt. Started powermon services; PV voltages (305.2 V / 307.2 V), battery voltages (54.0 V), and all the usual telemetry flowed back within seconds.
-
Bulletproofing the resolver path. Three coordinated changes so a transient morning blip can't latch the whole stack offline:
a. Resolver → always-exit-0. Best-effort policy: write whatever symlinks it can resolve, log warnings for the missing ones, always return success. Only-relink-on-change semantics so a no-op timer-driven run doesn't churn the symlinks. When symlinks do move (post-power-cycle), resolver auto-restarts powermon so its open hidraw fds re-bind.
b.
Wants=instead ofRequires=on both powermon services. Resolver failure no longer propagates as "permanently failed" to dependent units. Powermon's ownRestart=alwayshandles the retry loop until/dev/lvx6048-{1,2}appear.c. Periodic systemd timer for the resolver (
lvx-resolve-links.timer) —OnBootSec=2min,OnUnitActiveSec=5min. Belt-and-suspenders against any hot-plug case where udev didn't fire (stuck-USB scenario like this morning's).d.
StartLimitIntervalSec=0on both powermon services — disables systemd's default "stop trying after 5 fails in 10 s" rate-limiter, so the retry loop continues indefinitely until success.Worst-case data gap is now ~2 min (timer cycle + couple polls) vs. the previous "until a human notices and runs systemctl restart".
-
FWS fault polling added to powermon. A separate gap surfaced during the morning's troubleshooting: front-panel showed a "32" indicator, and we discovered our
powermon.yamlwas pollingGS / MOD / PIRI / ETonly — noFWS. Faults / warnings weren't reaching MQTT at all. Added FWS (every 30 s) to both inverter configs. -
Defensive FWS fault-code decoder. Same shape as the MOD
06/07crashes from earlier in the week — an unknown fault code wouldKeyErrorand crash powermon. Pre-populated the OPTION dict with placeholder labels for codes 00–99, then layered the known names on top. Future unknown codes show as"Fault 32"rather than crashing. Also unblocks adding the four PI18 setters (PE / PD / POPM / PF) to the shim later — same pattern. -
About the LCD's "32". Direct FWS queries to both units returned
fault_code=No faultwhilepv_loss_warning=enabled. The LCD was most likely showing the PV-loss warning code (no F/W prefix on this firmware's display), correlating with the cold-start window. Not a real fault — but worth re-confirming the LCD format next time so we know whether 32-with-prefix is a thing on this hardware. -
EG4 lifepower discovery configs vanished. Independent finding: when checking why
lvx6048_*_fault_codewasn't appearing in HA, I noticed zero retainedhomeassistant/sensor/lifepower4_*/configtopics on the broker. HA had cached entity registrations and was showing live state, but the canonical discovery configs were gone — likely cleared by an earliereg4-purge-orphansinvocation or broker retention loss. InfluxDB exporter relies on those configs to know what entities to publish, so pack voltages / SoCs had stopped reaching InfluxDB.Fix:
systemctl restart eg4-battery.service. Daemon re-publishes all 207 retained discovery configs at startup (3 packs × ~69 entities). HA refreshes its registry, InfluxDB resumes capturing.
Net effect
| Layer | Before | After |
|---|---|---|
| Morning-blip recovery | manual systemctl restart (~10 h gap) |
automatic ~2 min gap |
| Resolver exit on missing serial | code 2 (systemd marks unit failed) | always 0 (best-effort) |
| Resolver runs | only on udev hot-plug | + every 5 min via timer |
| Powermon ↔ resolver coupling | Requires= (failure cascades) |
Wants= (independent retry) |
| Powermon retry budget | ~5 attempts (default StartLimit) | unbounded (StartLimitIntervalSec=0) |
| Fault-code → MQTT | not polled | FWS every 30 s |
| Unknown fault code | KeyError crash |
shows as "Fault NN" |
| EG4 discovery configs | missing (purged) | 207 retained, InfluxDB unblocked |
Files touched
M LVX6048/bin/lvx-resolve-links (always-exit-0; relink-on-change; auto-restart powermon if symlinks move)
M LVX6048/config/powermon/powermon.yaml (added FWS poll @ 30s)
M LVX6048/config/powermon/powermon2.yaml (same)
M LVX6048/etc/systemd/system/powermon.service (StartLimitIntervalSec=0)
M LVX6048/etc/systemd/system/powermon.service.d/10-resolver.conf (Requires= -> Wants=)
M LVX6048/etc/systemd/system/powermon2.service (same)
M LVX6048/etc/systemd/system/powermon2.service.d/10-resolver.conf (same)
M LVX6048/install.sh (installs the timer)
M LVX6048/powermon-patches/pi18.py (defensive fault-code decoder, 00-99)
A LVX6048/etc/systemd/system/lvx-resolve-links.timer (every 5 min)
A LVX6048/2026-04-28.md (this file)
How to verify
# Timer running?
systemctl list-timers lvx-resolve-links.timer
# Resolver idempotent? (no powermon bounce on no-change)
sudo systemctl restart lvx-resolve-links.service
journalctl -u lvx-resolve-links.service -n 8 --no-pager
# Expect: "(unchanged)" lines, no "restarting powermon" line.
# Wants= confirmed?
systemctl show powermon.service -p Wants -p Requires
# Expect: Wants=...lvx-resolve-links.service... ; Requires= line should NOT
# include lvx-resolve-links.
# Synthetic morning-blip test (optional, disruptive):
sudo rm /dev/lvx6048-{1,2}
sudo systemctl restart powermon.service
# Expect: powermon enters Restart loop; within 5 min the timer fires the
# resolver, recreates symlinks, powermon's next retry succeeds.
# Fault polling alive
mosquitto_sub -h <broker> -u mqtt -P <pass> -W 35 -v \
-t 'homeassistant/sensor/lvx6048_1_fault_code/state' \
-t 'homeassistant/sensor/lvx6048_2_fault_code/state'
# Expect: "No fault" published every 30 s on each unit.
Loose ends / next pass
- FWS warning booleans aren't getting individual HA entities. The 16
enabled/disabledflags (PV loss, over-temp, fan locked, …) are decoded by powermon's screen formatter but the hass formatter filters them because they lackunit_of_measurement/device_class/reading_typemetadata. Worth a separate pass to expose them asbinary_sensorentities so HA gets a real "PV loss" indicator alongsidefault_code. - LCD code format confirmation. Next time you can look at the LCD, check whether warning codes display as bare "32", "W32", or something else — informs whether to add a friendly mapping for warning codes in the docs.
- Faster timer cadence? Currently 5 min (worst-case ~2 min gap). Could drop to 1–2 min if morning blips become more frequent. Trade- off: more steady-state resolver runs (each costs one PI18 ID query per inverter, harmless but visible in logs).
- eg4-battery discovery-config retention. The configs DID get
purged at some point — not certain what triggered it. Next
investigation: confirm whether
eg4-purge-orphansis over-eager, whether the broker is dropping retains, or whether a daemon restart is needed. Worth amosquitto_subwatch on the broker for retain changes.