171 lines
8.6 KiB
Markdown
171 lines
8.6 KiB
Markdown
|
|
# 2026-04-28 — bulletproof recovery + FWS / fault polling + EG4 discovery refresh
|
|||
|
|
|
|||
|
|
Mostly resilience work today, prompted by waking up to find both powermon
|
|||
|
|
services inactive and no PV / battery data flowing from the inverters since
|
|||
|
|
~06:11. Root-caused, recovered, and hardened the failure mode so it can't
|
|||
|
|
strand the stack again. While in there: closed two related polling gaps
|
|||
|
|
(faults weren't being captured at all, and EG4 discovery configs had
|
|||
|
|
fallen out of the broker — InfluxDB exporter had stopped seeing pack data).
|
|||
|
|
|
|||
|
|
## What happened
|
|||
|
|
|
|||
|
|
1. **Morning incident: powermon offline since 06:11.** udev fired the
|
|||
|
|
resolver during what looks like a sunrise cold-start window — both
|
|||
|
|
inverters were briefly unresponsive on USB-HID (`/dev/hidraw0`
|
|||
|
|
returned no PI18 reply; `/dev/hidraw1` returned 22 zero bytes for the
|
|||
|
|
serial query). Resolver hit its 10 s internal retry, gave up, exited
|
|||
|
|
with code 2. Powermon services had `Requires=lvx-resolve-links` so
|
|||
|
|
systemd refused to start them, and there was no retry path — stack
|
|||
|
|
stayed dead until manual intervention ~10 h later.
|
|||
|
|
|
|||
|
|
2. **Recovery.** Re-ran the resolver manually; both inverters responded
|
|||
|
|
normally on the second attempt. Started powermon services; PV
|
|||
|
|
voltages (305.2 V / 307.2 V), battery voltages (54.0 V), and all the
|
|||
|
|
usual telemetry flowed back within seconds.
|
|||
|
|
|
|||
|
|
3. **Bulletproofing the resolver path.** Three coordinated changes so a
|
|||
|
|
transient morning blip can't latch the whole stack offline:
|
|||
|
|
|
|||
|
|
a. **Resolver → always-exit-0.** Best-effort policy: write whatever
|
|||
|
|
symlinks it can resolve, log warnings for the missing ones,
|
|||
|
|
always return success. Only-relink-on-change semantics so a
|
|||
|
|
no-op timer-driven run doesn't churn the symlinks. When symlinks
|
|||
|
|
*do* move (post-power-cycle), resolver auto-restarts powermon
|
|||
|
|
so its open hidraw fds re-bind.
|
|||
|
|
|
|||
|
|
b. **`Wants=` instead of `Requires=`** on both powermon services.
|
|||
|
|
Resolver failure no longer propagates as "permanently failed"
|
|||
|
|
to dependent units. Powermon's own `Restart=always` handles the
|
|||
|
|
retry loop until `/dev/lvx6048-{1,2}` appear.
|
|||
|
|
|
|||
|
|
c. **Periodic systemd timer** for the resolver
|
|||
|
|
(`lvx-resolve-links.timer`) — `OnBootSec=2min`,
|
|||
|
|
`OnUnitActiveSec=5min`. Belt-and-suspenders against any
|
|||
|
|
hot-plug case where udev didn't fire (stuck-USB scenario like
|
|||
|
|
this morning's).
|
|||
|
|
|
|||
|
|
d. **`StartLimitIntervalSec=0`** on both powermon services — disables
|
|||
|
|
systemd's default "stop trying after 5 fails in 10 s" rate-limiter,
|
|||
|
|
so the retry loop continues indefinitely until success.
|
|||
|
|
|
|||
|
|
Worst-case data gap is now ~2 min (timer cycle + couple polls) vs.
|
|||
|
|
the previous "until a human notices and runs systemctl restart".
|
|||
|
|
|
|||
|
|
4. **FWS fault polling added to powermon.** A separate gap surfaced
|
|||
|
|
during the morning's troubleshooting: front-panel showed a "32"
|
|||
|
|
indicator, and we discovered our `powermon.yaml` was polling
|
|||
|
|
`GS / MOD / PIRI / ET` only — no `FWS`. Faults / warnings weren't
|
|||
|
|
reaching MQTT at all. Added FWS (every 30 s) to both inverter
|
|||
|
|
configs.
|
|||
|
|
|
|||
|
|
5. **Defensive FWS fault-code decoder.** Same shape as the MOD `06`/`07`
|
|||
|
|
crashes from earlier in the week — an unknown fault code would
|
|||
|
|
`KeyError` and crash powermon. Pre-populated the OPTION dict with
|
|||
|
|
placeholder labels for codes 00–99, then layered the known names on
|
|||
|
|
top. Future unknown codes show as `"Fault 32"` rather than crashing.
|
|||
|
|
Also unblocks adding the four PI18 setters (PE / PD / POPM / PF) to
|
|||
|
|
the shim later — same pattern.
|
|||
|
|
|
|||
|
|
6. **About the LCD's "32".** Direct FWS queries to both units returned
|
|||
|
|
`fault_code=No fault` while `pv_loss_warning=enabled`. The LCD was
|
|||
|
|
most likely showing the PV-loss warning code (no F/W prefix on this
|
|||
|
|
firmware's display), correlating with the cold-start window. Not a
|
|||
|
|
real fault — but worth re-confirming the LCD format next time so we
|
|||
|
|
know whether 32-with-prefix is a thing on this hardware.
|
|||
|
|
|
|||
|
|
7. **EG4 lifepower discovery configs vanished.** Independent finding:
|
|||
|
|
when checking why `lvx6048_*_fault_code` wasn't appearing in HA, I
|
|||
|
|
noticed zero retained `homeassistant/sensor/lifepower4_*/config`
|
|||
|
|
topics on the broker. HA had cached entity registrations and was
|
|||
|
|
showing live state, but the canonical discovery configs were gone —
|
|||
|
|
likely cleared by an earlier `eg4-purge-orphans` invocation or
|
|||
|
|
broker retention loss. InfluxDB exporter relies on those configs to
|
|||
|
|
know what entities to publish, so pack voltages / SoCs had stopped
|
|||
|
|
reaching InfluxDB.
|
|||
|
|
|
|||
|
|
Fix: `systemctl restart eg4-battery.service`. Daemon re-publishes
|
|||
|
|
all 207 retained discovery configs at startup (3 packs × ~69
|
|||
|
|
entities). HA refreshes its registry, InfluxDB resumes capturing.
|
|||
|
|
|
|||
|
|
## Net effect
|
|||
|
|
|
|||
|
|
| Layer | Before | After |
|
|||
|
|
|---|---|---|
|
|||
|
|
| Morning-blip recovery | manual `systemctl restart` (~10 h gap) | automatic ~2 min gap |
|
|||
|
|
| Resolver exit on missing serial | code 2 (systemd marks unit failed) | always 0 (best-effort) |
|
|||
|
|
| Resolver runs | only on udev hot-plug | + every 5 min via timer |
|
|||
|
|
| Powermon ↔ resolver coupling | `Requires=` (failure cascades) | `Wants=` (independent retry) |
|
|||
|
|
| Powermon retry budget | ~5 attempts (default StartLimit) | unbounded (`StartLimitIntervalSec=0`) |
|
|||
|
|
| Fault-code → MQTT | not polled | FWS every 30 s |
|
|||
|
|
| Unknown fault code | `KeyError` crash | shows as `"Fault NN"` |
|
|||
|
|
| EG4 discovery configs | missing (purged) | 207 retained, InfluxDB unblocked |
|
|||
|
|
|
|||
|
|
## Files touched
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
M LVX6048/bin/lvx-resolve-links (always-exit-0; relink-on-change; auto-restart powermon if symlinks move)
|
|||
|
|
M LVX6048/config/powermon/powermon.yaml (added FWS poll @ 30s)
|
|||
|
|
M LVX6048/config/powermon/powermon2.yaml (same)
|
|||
|
|
M LVX6048/etc/systemd/system/powermon.service (StartLimitIntervalSec=0)
|
|||
|
|
M LVX6048/etc/systemd/system/powermon.service.d/10-resolver.conf (Requires= -> Wants=)
|
|||
|
|
M LVX6048/etc/systemd/system/powermon2.service (same)
|
|||
|
|
M LVX6048/etc/systemd/system/powermon2.service.d/10-resolver.conf (same)
|
|||
|
|
M LVX6048/install.sh (installs the timer)
|
|||
|
|
M LVX6048/powermon-patches/pi18.py (defensive fault-code decoder, 00-99)
|
|||
|
|
A LVX6048/etc/systemd/system/lvx-resolve-links.timer (every 5 min)
|
|||
|
|
A LVX6048/2026-04-28.md (this file)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## How to verify
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# Timer running?
|
|||
|
|
systemctl list-timers lvx-resolve-links.timer
|
|||
|
|
|
|||
|
|
# Resolver idempotent? (no powermon bounce on no-change)
|
|||
|
|
sudo systemctl restart lvx-resolve-links.service
|
|||
|
|
journalctl -u lvx-resolve-links.service -n 8 --no-pager
|
|||
|
|
# Expect: "(unchanged)" lines, no "restarting powermon" line.
|
|||
|
|
|
|||
|
|
# Wants= confirmed?
|
|||
|
|
systemctl show powermon.service -p Wants -p Requires
|
|||
|
|
# Expect: Wants=...lvx-resolve-links.service... ; Requires= line should NOT
|
|||
|
|
# include lvx-resolve-links.
|
|||
|
|
|
|||
|
|
# Synthetic morning-blip test (optional, disruptive):
|
|||
|
|
sudo rm /dev/lvx6048-{1,2}
|
|||
|
|
sudo systemctl restart powermon.service
|
|||
|
|
# Expect: powermon enters Restart loop; within 5 min the timer fires the
|
|||
|
|
# resolver, recreates symlinks, powermon's next retry succeeds.
|
|||
|
|
|
|||
|
|
# Fault polling alive
|
|||
|
|
mosquitto_sub -h <broker> -u mqtt -P <pass> -W 35 -v \
|
|||
|
|
-t 'homeassistant/sensor/lvx6048_1_fault_code/state' \
|
|||
|
|
-t 'homeassistant/sensor/lvx6048_2_fault_code/state'
|
|||
|
|
# Expect: "No fault" published every 30 s on each unit.
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## Loose ends / next pass
|
|||
|
|
|
|||
|
|
- **FWS warning booleans aren't getting individual HA entities.** The 16
|
|||
|
|
`enabled/disabled` flags (PV loss, over-temp, fan locked, …) are
|
|||
|
|
decoded by powermon's screen formatter but the hass formatter filters
|
|||
|
|
them because they lack `unit_of_measurement` / `device_class` /
|
|||
|
|
`reading_type` metadata. Worth a separate pass to expose them as
|
|||
|
|
`binary_sensor` entities so HA gets a real "PV loss" indicator
|
|||
|
|
alongside `fault_code`.
|
|||
|
|
- **LCD code format confirmation.** Next time you can look at the LCD,
|
|||
|
|
check whether warning codes display as bare "32", "W32", or
|
|||
|
|
something else — informs whether to add a friendly mapping for
|
|||
|
|
warning codes in the docs.
|
|||
|
|
- **Faster timer cadence?** Currently 5 min (worst-case ~2 min gap).
|
|||
|
|
Could drop to 1–2 min if morning blips become more frequent. Trade-
|
|||
|
|
off: more steady-state resolver runs (each costs one PI18 ID query
|
|||
|
|
per inverter, harmless but visible in logs).
|
|||
|
|
- **eg4-battery discovery-config retention.** The configs DID get
|
|||
|
|
purged at some point — not certain what triggered it. Next
|
|||
|
|
investigation: confirm whether `eg4-purge-orphans` is over-eager,
|
|||
|
|
whether the broker is dropping retains, or whether a daemon restart
|
|||
|
|
is needed. Worth a `mosquitto_sub` watch on the broker for retain
|
|||
|
|
changes.
|