Files
shaggy-solar/LVX6048/2026-04-28.md

171 lines
8.6 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# 2026-04-28 — bulletproof recovery + FWS / fault polling + EG4 discovery refresh
Mostly resilience work today, prompted by waking up to find both powermon
services inactive and no PV / battery data flowing from the inverters since
~06:11. Root-caused, recovered, and hardened the failure mode so it can't
strand the stack again. While in there: closed two related polling gaps
(faults weren't being captured at all, and EG4 discovery configs had
fallen out of the broker — InfluxDB exporter had stopped seeing pack data).
## What happened
1. **Morning incident: powermon offline since 06:11.** udev fired the
resolver during what looks like a sunrise cold-start window — both
inverters were briefly unresponsive on USB-HID (`/dev/hidraw0`
returned no PI18 reply; `/dev/hidraw1` returned 22 zero bytes for the
serial query). Resolver hit its 10 s internal retry, gave up, exited
with code 2. Powermon services had `Requires=lvx-resolve-links` so
systemd refused to start them, and there was no retry path — stack
stayed dead until manual intervention ~10 h later.
2. **Recovery.** Re-ran the resolver manually; both inverters responded
normally on the second attempt. Started powermon services; PV
voltages (305.2 V / 307.2 V), battery voltages (54.0 V), and all the
usual telemetry flowed back within seconds.
3. **Bulletproofing the resolver path.** Three coordinated changes so a
transient morning blip can't latch the whole stack offline:
a. **Resolver → always-exit-0.** Best-effort policy: write whatever
symlinks it can resolve, log warnings for the missing ones,
always return success. Only-relink-on-change semantics so a
no-op timer-driven run doesn't churn the symlinks. When symlinks
*do* move (post-power-cycle), resolver auto-restarts powermon
so its open hidraw fds re-bind.
b. **`Wants=` instead of `Requires=`** on both powermon services.
Resolver failure no longer propagates as "permanently failed"
to dependent units. Powermon's own `Restart=always` handles the
retry loop until `/dev/lvx6048-{1,2}` appear.
c. **Periodic systemd timer** for the resolver
(`lvx-resolve-links.timer`) — `OnBootSec=2min`,
`OnUnitActiveSec=5min`. Belt-and-suspenders against any
hot-plug case where udev didn't fire (stuck-USB scenario like
this morning's).
d. **`StartLimitIntervalSec=0`** on both powermon services — disables
systemd's default "stop trying after 5 fails in 10 s" rate-limiter,
so the retry loop continues indefinitely until success.
Worst-case data gap is now ~2 min (timer cycle + couple polls) vs.
the previous "until a human notices and runs systemctl restart".
4. **FWS fault polling added to powermon.** A separate gap surfaced
during the morning's troubleshooting: front-panel showed a "32"
indicator, and we discovered our `powermon.yaml` was polling
`GS / MOD / PIRI / ET` only — no `FWS`. Faults / warnings weren't
reaching MQTT at all. Added FWS (every 30 s) to both inverter
configs.
5. **Defensive FWS fault-code decoder.** Same shape as the MOD `06`/`07`
crashes from earlier in the week — an unknown fault code would
`KeyError` and crash powermon. Pre-populated the OPTION dict with
placeholder labels for codes 0099, then layered the known names on
top. Future unknown codes show as `"Fault 32"` rather than crashing.
Also unblocks adding the four PI18 setters (PE / PD / POPM / PF) to
the shim later — same pattern.
6. **About the LCD's "32".** Direct FWS queries to both units returned
`fault_code=No fault` while `pv_loss_warning=enabled`. The LCD was
most likely showing the PV-loss warning code (no F/W prefix on this
firmware's display), correlating with the cold-start window. Not a
real fault — but worth re-confirming the LCD format next time so we
know whether 32-with-prefix is a thing on this hardware.
7. **EG4 lifepower discovery configs vanished.** Independent finding:
when checking why `lvx6048_*_fault_code` wasn't appearing in HA, I
noticed zero retained `homeassistant/sensor/lifepower4_*/config`
topics on the broker. HA had cached entity registrations and was
showing live state, but the canonical discovery configs were gone —
likely cleared by an earlier `eg4-purge-orphans` invocation or
broker retention loss. InfluxDB exporter relies on those configs to
know what entities to publish, so pack voltages / SoCs had stopped
reaching InfluxDB.
Fix: `systemctl restart eg4-battery.service`. Daemon re-publishes
all 207 retained discovery configs at startup (3 packs × ~69
entities). HA refreshes its registry, InfluxDB resumes capturing.
## Net effect
| Layer | Before | After |
|---|---|---|
| Morning-blip recovery | manual `systemctl restart` (~10 h gap) | automatic ~2 min gap |
| Resolver exit on missing serial | code 2 (systemd marks unit failed) | always 0 (best-effort) |
| Resolver runs | only on udev hot-plug | + every 5 min via timer |
| Powermon ↔ resolver coupling | `Requires=` (failure cascades) | `Wants=` (independent retry) |
| Powermon retry budget | ~5 attempts (default StartLimit) | unbounded (`StartLimitIntervalSec=0`) |
| Fault-code → MQTT | not polled | FWS every 30 s |
| Unknown fault code | `KeyError` crash | shows as `"Fault NN"` |
| EG4 discovery configs | missing (purged) | 207 retained, InfluxDB unblocked |
## Files touched
```
M LVX6048/bin/lvx-resolve-links (always-exit-0; relink-on-change; auto-restart powermon if symlinks move)
M LVX6048/config/powermon/powermon.yaml (added FWS poll @ 30s)
M LVX6048/config/powermon/powermon2.yaml (same)
M LVX6048/etc/systemd/system/powermon.service (StartLimitIntervalSec=0)
M LVX6048/etc/systemd/system/powermon.service.d/10-resolver.conf (Requires= -> Wants=)
M LVX6048/etc/systemd/system/powermon2.service (same)
M LVX6048/etc/systemd/system/powermon2.service.d/10-resolver.conf (same)
M LVX6048/install.sh (installs the timer)
M LVX6048/powermon-patches/pi18.py (defensive fault-code decoder, 00-99)
A LVX6048/etc/systemd/system/lvx-resolve-links.timer (every 5 min)
A LVX6048/2026-04-28.md (this file)
```
## How to verify
```bash
# Timer running?
systemctl list-timers lvx-resolve-links.timer
# Resolver idempotent? (no powermon bounce on no-change)
sudo systemctl restart lvx-resolve-links.service
journalctl -u lvx-resolve-links.service -n 8 --no-pager
# Expect: "(unchanged)" lines, no "restarting powermon" line.
# Wants= confirmed?
systemctl show powermon.service -p Wants -p Requires
# Expect: Wants=...lvx-resolve-links.service... ; Requires= line should NOT
# include lvx-resolve-links.
# Synthetic morning-blip test (optional, disruptive):
sudo rm /dev/lvx6048-{1,2}
sudo systemctl restart powermon.service
# Expect: powermon enters Restart loop; within 5 min the timer fires the
# resolver, recreates symlinks, powermon's next retry succeeds.
# Fault polling alive
mosquitto_sub -h <broker> -u mqtt -P <pass> -W 35 -v \
-t 'homeassistant/sensor/lvx6048_1_fault_code/state' \
-t 'homeassistant/sensor/lvx6048_2_fault_code/state'
# Expect: "No fault" published every 30 s on each unit.
```
## Loose ends / next pass
- **FWS warning booleans aren't getting individual HA entities.** The 16
`enabled/disabled` flags (PV loss, over-temp, fan locked, …) are
decoded by powermon's screen formatter but the hass formatter filters
them because they lack `unit_of_measurement` / `device_class` /
`reading_type` metadata. Worth a separate pass to expose them as
`binary_sensor` entities so HA gets a real "PV loss" indicator
alongside `fault_code`.
- **LCD code format confirmation.** Next time you can look at the LCD,
check whether warning codes display as bare "32", "W32", or
something else — informs whether to add a friendly mapping for
warning codes in the docs.
- **Faster timer cadence?** Currently 5 min (worst-case ~2 min gap).
Could drop to 12 min if morning blips become more frequent. Trade-
off: more steady-state resolver runs (each costs one PI18 ID query
per inverter, harmless but visible in logs).
- **eg4-battery discovery-config retention.** The configs DID get
purged at some point — not certain what triggered it. Next
investigation: confirm whether `eg4-purge-orphans` is over-eager,
whether the broker is dropping retains, or whether a daemon restart
is needed. Worth a `mosquitto_sub` watch on the broker for retain
changes.