Files
shaggy-solar/LVX6048/2026-04-28.md

8.6 KiB
Raw Permalink Blame History

2026-04-28 — bulletproof recovery + FWS / fault polling + EG4 discovery refresh

Mostly resilience work today, prompted by waking up to find both powermon services inactive and no PV / battery data flowing from the inverters since ~06:11. Root-caused, recovered, and hardened the failure mode so it can't strand the stack again. While in there: closed two related polling gaps (faults weren't being captured at all, and EG4 discovery configs had fallen out of the broker — InfluxDB exporter had stopped seeing pack data).

What happened

  1. Morning incident: powermon offline since 06:11. udev fired the resolver during what looks like a sunrise cold-start window — both inverters were briefly unresponsive on USB-HID (/dev/hidraw0 returned no PI18 reply; /dev/hidraw1 returned 22 zero bytes for the serial query). Resolver hit its 10 s internal retry, gave up, exited with code 2. Powermon services had Requires=lvx-resolve-links so systemd refused to start them, and there was no retry path — stack stayed dead until manual intervention ~10 h later.

  2. Recovery. Re-ran the resolver manually; both inverters responded normally on the second attempt. Started powermon services; PV voltages (305.2 V / 307.2 V), battery voltages (54.0 V), and all the usual telemetry flowed back within seconds.

  3. Bulletproofing the resolver path. Three coordinated changes so a transient morning blip can't latch the whole stack offline:

    a. Resolver → always-exit-0. Best-effort policy: write whatever symlinks it can resolve, log warnings for the missing ones, always return success. Only-relink-on-change semantics so a no-op timer-driven run doesn't churn the symlinks. When symlinks do move (post-power-cycle), resolver auto-restarts powermon so its open hidraw fds re-bind.

    b. Wants= instead of Requires= on both powermon services. Resolver failure no longer propagates as "permanently failed" to dependent units. Powermon's own Restart=always handles the retry loop until /dev/lvx6048-{1,2} appear.

    c. Periodic systemd timer for the resolver (lvx-resolve-links.timer) — OnBootSec=2min, OnUnitActiveSec=5min. Belt-and-suspenders against any hot-plug case where udev didn't fire (stuck-USB scenario like this morning's).

    d. StartLimitIntervalSec=0 on both powermon services — disables systemd's default "stop trying after 5 fails in 10 s" rate-limiter, so the retry loop continues indefinitely until success.

    Worst-case data gap is now ~2 min (timer cycle + couple polls) vs. the previous "until a human notices and runs systemctl restart".

  4. FWS fault polling added to powermon. A separate gap surfaced during the morning's troubleshooting: front-panel showed a "32" indicator, and we discovered our powermon.yaml was polling GS / MOD / PIRI / ET only — no FWS. Faults / warnings weren't reaching MQTT at all. Added FWS (every 30 s) to both inverter configs.

  5. Defensive FWS fault-code decoder. Same shape as the MOD 06/07 crashes from earlier in the week — an unknown fault code would KeyError and crash powermon. Pre-populated the OPTION dict with placeholder labels for codes 0099, then layered the known names on top. Future unknown codes show as "Fault 32" rather than crashing. Also unblocks adding the four PI18 setters (PE / PD / POPM / PF) to the shim later — same pattern.

  6. About the LCD's "32". Direct FWS queries to both units returned fault_code=No fault while pv_loss_warning=enabled. The LCD was most likely showing the PV-loss warning code (no F/W prefix on this firmware's display), correlating with the cold-start window. Not a real fault — but worth re-confirming the LCD format next time so we know whether 32-with-prefix is a thing on this hardware.

  7. EG4 lifepower discovery configs vanished. Independent finding: when checking why lvx6048_*_fault_code wasn't appearing in HA, I noticed zero retained homeassistant/sensor/lifepower4_*/config topics on the broker. HA had cached entity registrations and was showing live state, but the canonical discovery configs were gone — likely cleared by an earlier eg4-purge-orphans invocation or broker retention loss. InfluxDB exporter relies on those configs to know what entities to publish, so pack voltages / SoCs had stopped reaching InfluxDB.

    Fix: systemctl restart eg4-battery.service. Daemon re-publishes all 207 retained discovery configs at startup (3 packs × ~69 entities). HA refreshes its registry, InfluxDB resumes capturing.

Net effect

Layer Before After
Morning-blip recovery manual systemctl restart (~10 h gap) automatic ~2 min gap
Resolver exit on missing serial code 2 (systemd marks unit failed) always 0 (best-effort)
Resolver runs only on udev hot-plug + every 5 min via timer
Powermon ↔ resolver coupling Requires= (failure cascades) Wants= (independent retry)
Powermon retry budget ~5 attempts (default StartLimit) unbounded (StartLimitIntervalSec=0)
Fault-code → MQTT not polled FWS every 30 s
Unknown fault code KeyError crash shows as "Fault NN"
EG4 discovery configs missing (purged) 207 retained, InfluxDB unblocked

Files touched

M  LVX6048/bin/lvx-resolve-links                                    (always-exit-0; relink-on-change; auto-restart powermon if symlinks move)
M  LVX6048/config/powermon/powermon.yaml                             (added FWS poll @ 30s)
M  LVX6048/config/powermon/powermon2.yaml                            (same)
M  LVX6048/etc/systemd/system/powermon.service                       (StartLimitIntervalSec=0)
M  LVX6048/etc/systemd/system/powermon.service.d/10-resolver.conf    (Requires= -> Wants=)
M  LVX6048/etc/systemd/system/powermon2.service                      (same)
M  LVX6048/etc/systemd/system/powermon2.service.d/10-resolver.conf   (same)
M  LVX6048/install.sh                                                 (installs the timer)
M  LVX6048/powermon-patches/pi18.py                                   (defensive fault-code decoder, 00-99)
A  LVX6048/etc/systemd/system/lvx-resolve-links.timer                 (every 5 min)
A  LVX6048/2026-04-28.md                                              (this file)

How to verify

# Timer running?
systemctl list-timers lvx-resolve-links.timer

# Resolver idempotent? (no powermon bounce on no-change)
sudo systemctl restart lvx-resolve-links.service
journalctl -u lvx-resolve-links.service -n 8 --no-pager
# Expect: "(unchanged)" lines, no "restarting powermon" line.

# Wants= confirmed?
systemctl show powermon.service -p Wants -p Requires
# Expect: Wants=...lvx-resolve-links.service... ; Requires= line should NOT
# include lvx-resolve-links.

# Synthetic morning-blip test (optional, disruptive):
sudo rm /dev/lvx6048-{1,2}
sudo systemctl restart powermon.service
# Expect: powermon enters Restart loop; within 5 min the timer fires the
# resolver, recreates symlinks, powermon's next retry succeeds.

# Fault polling alive
mosquitto_sub -h <broker> -u mqtt -P <pass> -W 35 -v \
    -t 'homeassistant/sensor/lvx6048_1_fault_code/state' \
    -t 'homeassistant/sensor/lvx6048_2_fault_code/state'
# Expect: "No fault" published every 30 s on each unit.

Loose ends / next pass

  • FWS warning booleans aren't getting individual HA entities. The 16 enabled/disabled flags (PV loss, over-temp, fan locked, …) are decoded by powermon's screen formatter but the hass formatter filters them because they lack unit_of_measurement / device_class / reading_type metadata. Worth a separate pass to expose them as binary_sensor entities so HA gets a real "PV loss" indicator alongside fault_code.
  • LCD code format confirmation. Next time you can look at the LCD, check whether warning codes display as bare "32", "W32", or something else — informs whether to add a friendly mapping for warning codes in the docs.
  • Faster timer cadence? Currently 5 min (worst-case ~2 min gap). Could drop to 12 min if morning blips become more frequent. Trade- off: more steady-state resolver runs (each costs one PI18 ID query per inverter, harmless but visible in logs).
  • eg4-battery discovery-config retention. The configs DID get purged at some point — not certain what triggered it. Next investigation: confirm whether eg4-purge-orphans is over-eager, whether the broker is dropping retains, or whether a daemon restart is needed. Worth a mosquitto_sub watch on the broker for retain changes.