Resilience pass: bulletproof recovery + FWS / fault polling + EG4 discovery refresh

This commit is contained in:
2026-04-28 18:10:00 -04:00
parent 04720c3b92
commit 7688fc1dd3
11 changed files with 267 additions and 16 deletions

170
LVX6048/2026-04-28.md Normal file
View File

@@ -0,0 +1,170 @@
# 2026-04-28 — bulletproof recovery + FWS / fault polling + EG4 discovery refresh
Mostly resilience work today, prompted by waking up to find both powermon
services inactive and no PV / battery data flowing from the inverters since
~06:11. Root-caused, recovered, and hardened the failure mode so it can't
strand the stack again. While in there: closed two related polling gaps
(faults weren't being captured at all, and EG4 discovery configs had
fallen out of the broker — InfluxDB exporter had stopped seeing pack data).
## What happened
1. **Morning incident: powermon offline since 06:11.** udev fired the
resolver during what looks like a sunrise cold-start window — both
inverters were briefly unresponsive on USB-HID (`/dev/hidraw0`
returned no PI18 reply; `/dev/hidraw1` returned 22 zero bytes for the
serial query). Resolver hit its 10 s internal retry, gave up, exited
with code 2. Powermon services had `Requires=lvx-resolve-links` so
systemd refused to start them, and there was no retry path — stack
stayed dead until manual intervention ~10 h later.
2. **Recovery.** Re-ran the resolver manually; both inverters responded
normally on the second attempt. Started powermon services; PV
voltages (305.2 V / 307.2 V), battery voltages (54.0 V), and all the
usual telemetry flowed back within seconds.
3. **Bulletproofing the resolver path.** Three coordinated changes so a
transient morning blip can't latch the whole stack offline:
a. **Resolver → always-exit-0.** Best-effort policy: write whatever
symlinks it can resolve, log warnings for the missing ones,
always return success. Only-relink-on-change semantics so a
no-op timer-driven run doesn't churn the symlinks. When symlinks
*do* move (post-power-cycle), resolver auto-restarts powermon
so its open hidraw fds re-bind.
b. **`Wants=` instead of `Requires=`** on both powermon services.
Resolver failure no longer propagates as "permanently failed"
to dependent units. Powermon's own `Restart=always` handles the
retry loop until `/dev/lvx6048-{1,2}` appear.
c. **Periodic systemd timer** for the resolver
(`lvx-resolve-links.timer`) — `OnBootSec=2min`,
`OnUnitActiveSec=5min`. Belt-and-suspenders against any
hot-plug case where udev didn't fire (stuck-USB scenario like
this morning's).
d. **`StartLimitIntervalSec=0`** on both powermon services — disables
systemd's default "stop trying after 5 fails in 10 s" rate-limiter,
so the retry loop continues indefinitely until success.
Worst-case data gap is now ~2 min (timer cycle + couple polls) vs.
the previous "until a human notices and runs systemctl restart".
4. **FWS fault polling added to powermon.** A separate gap surfaced
during the morning's troubleshooting: front-panel showed a "32"
indicator, and we discovered our `powermon.yaml` was polling
`GS / MOD / PIRI / ET` only — no `FWS`. Faults / warnings weren't
reaching MQTT at all. Added FWS (every 30 s) to both inverter
configs.
5. **Defensive FWS fault-code decoder.** Same shape as the MOD `06`/`07`
crashes from earlier in the week — an unknown fault code would
`KeyError` and crash powermon. Pre-populated the OPTION dict with
placeholder labels for codes 0099, then layered the known names on
top. Future unknown codes show as `"Fault 32"` rather than crashing.
Also unblocks adding the four PI18 setters (PE / PD / POPM / PF) to
the shim later — same pattern.
6. **About the LCD's "32".** Direct FWS queries to both units returned
`fault_code=No fault` while `pv_loss_warning=enabled`. The LCD was
most likely showing the PV-loss warning code (no F/W prefix on this
firmware's display), correlating with the cold-start window. Not a
real fault — but worth re-confirming the LCD format next time so we
know whether 32-with-prefix is a thing on this hardware.
7. **EG4 lifepower discovery configs vanished.** Independent finding:
when checking why `lvx6048_*_fault_code` wasn't appearing in HA, I
noticed zero retained `homeassistant/sensor/lifepower4_*/config`
topics on the broker. HA had cached entity registrations and was
showing live state, but the canonical discovery configs were gone —
likely cleared by an earlier `eg4-purge-orphans` invocation or
broker retention loss. InfluxDB exporter relies on those configs to
know what entities to publish, so pack voltages / SoCs had stopped
reaching InfluxDB.
Fix: `systemctl restart eg4-battery.service`. Daemon re-publishes
all 207 retained discovery configs at startup (3 packs × ~69
entities). HA refreshes its registry, InfluxDB resumes capturing.
## Net effect
| Layer | Before | After |
|---|---|---|
| Morning-blip recovery | manual `systemctl restart` (~10 h gap) | automatic ~2 min gap |
| Resolver exit on missing serial | code 2 (systemd marks unit failed) | always 0 (best-effort) |
| Resolver runs | only on udev hot-plug | + every 5 min via timer |
| Powermon ↔ resolver coupling | `Requires=` (failure cascades) | `Wants=` (independent retry) |
| Powermon retry budget | ~5 attempts (default StartLimit) | unbounded (`StartLimitIntervalSec=0`) |
| Fault-code → MQTT | not polled | FWS every 30 s |
| Unknown fault code | `KeyError` crash | shows as `"Fault NN"` |
| EG4 discovery configs | missing (purged) | 207 retained, InfluxDB unblocked |
## Files touched
```
M LVX6048/bin/lvx-resolve-links (always-exit-0; relink-on-change; auto-restart powermon if symlinks move)
M LVX6048/config/powermon/powermon.yaml (added FWS poll @ 30s)
M LVX6048/config/powermon/powermon2.yaml (same)
M LVX6048/etc/systemd/system/powermon.service (StartLimitIntervalSec=0)
M LVX6048/etc/systemd/system/powermon.service.d/10-resolver.conf (Requires= -> Wants=)
M LVX6048/etc/systemd/system/powermon2.service (same)
M LVX6048/etc/systemd/system/powermon2.service.d/10-resolver.conf (same)
M LVX6048/install.sh (installs the timer)
M LVX6048/powermon-patches/pi18.py (defensive fault-code decoder, 00-99)
A LVX6048/etc/systemd/system/lvx-resolve-links.timer (every 5 min)
A LVX6048/2026-04-28.md (this file)
```
## How to verify
```bash
# Timer running?
systemctl list-timers lvx-resolve-links.timer
# Resolver idempotent? (no powermon bounce on no-change)
sudo systemctl restart lvx-resolve-links.service
journalctl -u lvx-resolve-links.service -n 8 --no-pager
# Expect: "(unchanged)" lines, no "restarting powermon" line.
# Wants= confirmed?
systemctl show powermon.service -p Wants -p Requires
# Expect: Wants=...lvx-resolve-links.service... ; Requires= line should NOT
# include lvx-resolve-links.
# Synthetic morning-blip test (optional, disruptive):
sudo rm /dev/lvx6048-{1,2}
sudo systemctl restart powermon.service
# Expect: powermon enters Restart loop; within 5 min the timer fires the
# resolver, recreates symlinks, powermon's next retry succeeds.
# Fault polling alive
mosquitto_sub -h <broker> -u mqtt -P <pass> -W 35 -v \
-t 'homeassistant/sensor/lvx6048_1_fault_code/state' \
-t 'homeassistant/sensor/lvx6048_2_fault_code/state'
# Expect: "No fault" published every 30 s on each unit.
```
## Loose ends / next pass
- **FWS warning booleans aren't getting individual HA entities.** The 16
`enabled/disabled` flags (PV loss, over-temp, fan locked, …) are
decoded by powermon's screen formatter but the hass formatter filters
them because they lack `unit_of_measurement` / `device_class` /
`reading_type` metadata. Worth a separate pass to expose them as
`binary_sensor` entities so HA gets a real "PV loss" indicator
alongside `fault_code`.
- **LCD code format confirmation.** Next time you can look at the LCD,
check whether warning codes display as bare "32", "W32", or
something else — informs whether to add a friendly mapping for
warning codes in the docs.
- **Faster timer cadence?** Currently 5 min (worst-case ~2 min gap).
Could drop to 12 min if morning blips become more frequent. Trade-
off: more steady-state resolver runs (each costs one PI18 ID query
per inverter, harmless but visible in logs).
- **eg4-battery discovery-config retention.** The configs DID get
purged at some point — not certain what triggered it. Next
investigation: confirm whether `eg4-purge-orphans` is over-eager,
whether the broker is dropping retains, or whether a daemon restart
is needed. Worth a `mosquitto_sub` watch on the broker for retain
changes.

View File

@@ -89,25 +89,44 @@ async def main() -> int:
break break
await asyncio.sleep(0.5) await asyncio.sleep(0.5)
if not sn_to_path: # Always-best-effort policy: exit 0 even if no expected serials were
print("no LVX6048 devices found on /dev/hidraw*", file=sys.stderr) # found, so a transient inverter blip (e.g. the unit being mid-cold-start
return 1 # at sunrise) doesn't permanently latch dependent services into a failed
# state via systemd `Requires=`. The periodic timer + powermon's own
missing = [] # Restart=always handles convergence once the inverter recovers.
changed = False
for sn, link in LINK_FOR_SERIAL.items(): for sn, link in LINK_FOR_SERIAL.items():
if sn in sn_to_path: if sn in sn_to_path:
_relink(link, sn_to_path[sn]) target = sn_to_path[sn]
print(f"symlink {link} -> {os.path.basename(sn_to_path[sn])}") current = os.readlink(link) if os.path.islink(link) else None
current_full = os.path.join(os.path.dirname(link), current) if current else None
if current_full != target:
_relink(link, target)
print(f"symlink {link} -> {os.path.basename(target)}")
changed = True
else:
print(f"symlink {link} -> {os.path.basename(target)} (unchanged)")
else: else:
missing.append((link, sn)) if os.path.islink(link):
try: try:
if os.path.islink(link):
os.unlink(link) os.unlink(link)
except FileNotFoundError: changed = True
pass except FileNotFoundError:
pass
print(f"WARNING: {link} serial {sn} not found on any /dev/hidraw*") print(f"WARNING: {link} serial {sn} not found on any /dev/hidraw*")
return 0 if not missing else 2 # If symlinks actually moved (e.g. hidraw indices flipped post-power-cycle),
# bounce powermon so its open hidraw fds re-bind to the right physical unit.
# Idempotent runs (no symlink change) leave powermon alone.
if changed:
print("symlinks changed — restarting powermon services")
import subprocess
subprocess.Popen(
["/bin/systemctl", "--no-block", "restart",
"powermon.service", "powermon2.service"]
)
return 0
if __name__ == "__main__": if __name__ == "__main__":

View File

@@ -66,3 +66,17 @@ commands:
type: hass type: hass
discovery_prefix: homeassistant discovery_prefix: homeassistant
entity_id_prefix: lvx6048_1 entity_id_prefix: lvx6048_1
# Fault + warning bits (2-digit fault_code + 16 boolean warning flags).
# Polled every 30 s — fast enough to be useful in HA, slow enough not to
# flood the bus during normal "No fault" steady state.
- command: FWS
trigger:
every: 30
outputs:
- type: mqtt
topic: powermon/lvx6048_1/FWS
format:
type: hass
discovery_prefix: homeassistant
entity_id_prefix: lvx6048_1

View File

@@ -63,3 +63,17 @@ commands:
type: hass type: hass
discovery_prefix: homeassistant discovery_prefix: homeassistant
entity_id_prefix: lvx6048_2 entity_id_prefix: lvx6048_2
# Fault + warning bits (2-digit fault_code + 16 boolean warning flags).
# Polled every 30 s — fast enough to be useful in HA, slow enough not to
# flood the bus during normal "No fault" steady state.
- command: FWS
trigger:
every: 30
outputs:
- type: mqtt
topic: powermon/lvx6048_2/FWS
format:
type: hass
discovery_prefix: homeassistant
entity_id_prefix: lvx6048_2

View File

@@ -0,0 +1,15 @@
[Unit]
Description=Periodic re-resolve of LVX6048 hidraw symlinks (transient-failure backstop)
# Belt-and-suspenders against the udev RUN+= path: if the inverter is still
# enumerating at hot-plug time and the resolver finds no PI18-responsive
# device, this timer keeps trying every few minutes. The resolver itself is
# idempotent — runs that don't change anything are no-ops.
[Timer]
OnBootSec=2min
OnUnitActiveSec=5min
AccuracySec=15s
Unit=lvx-resolve-links.service
[Install]
WantedBy=timers.target

View File

@@ -10,6 +10,11 @@ Group=dialout
ExecStart=/home/noise/.local/bin/powermon -C /home/noise/.config/powermon/powermon.yaml ExecStart=/home/noise/.local/bin/powermon -C /home/noise/.config/powermon/powermon.yaml
Restart=always Restart=always
RestartSec=10 RestartSec=10
# Disable systemd's default rate-limiter: when the inverter is briefly
# unreachable (sunrise cold-start / USB hiccup), we want to keep retrying
# every 10s indefinitely — not give up after 5 failures in 10s like the
# default StartLimitBurst=5 / StartLimitIntervalSec=10s would.
StartLimitIntervalSec=0
[Install] [Install]
WantedBy=multi-user.target WantedBy=multi-user.target

View File

@@ -1,3 +1,7 @@
[Unit] [Unit]
# Wants= instead of Requires=: a failed/inactive resolver should NOT latch
# powermon into a permanently-failed state. powermon's own Restart=always
# handles the retry loop until /dev/lvx6048-1 reappears (e.g. after the
# resolve-links periodic timer next runs successfully).
After=lvx-resolve-links.service After=lvx-resolve-links.service
Requires=lvx-resolve-links.service Wants=lvx-resolve-links.service

View File

@@ -10,6 +10,8 @@ Group=dialout
ExecStart=/home/noise/.local/bin/powermon -C /home/noise/.config/powermon/powermon2.yaml ExecStart=/home/noise/.local/bin/powermon -C /home/noise/.config/powermon/powermon2.yaml
Restart=always Restart=always
RestartSec=10 RestartSec=10
# Disable systemd's default rate-limiter (see powermon.service for rationale).
StartLimitIntervalSec=0
[Install] [Install]
WantedBy=multi-user.target WantedBy=multi-user.target

View File

@@ -1,3 +1,7 @@
[Unit] [Unit]
# Wants= instead of Requires=: a failed/inactive resolver should NOT latch
# powermon into a permanently-failed state. powermon's own Restart=always
# handles the retry loop until /dev/lvx6048-2 reappears (e.g. after the
# resolve-links periodic timer next runs successfully).
After=lvx-resolve-links.service After=lvx-resolve-links.service
Requires=lvx-resolve-links.service Wants=lvx-resolve-links.service

View File

@@ -61,6 +61,7 @@ sudo install -m 755 "$TMP_RESOLVER" /usr/local/sbin/lvx-resolve-links
# --- 5. systemd units + drop-ins ------------------------------------------- # --- 5. systemd units + drop-ins -------------------------------------------
msg "Installing systemd units" msg "Installing systemd units"
sudo install -m 644 "${BASE}/etc/systemd/system/lvx-resolve-links.service" /etc/systemd/system/ sudo install -m 644 "${BASE}/etc/systemd/system/lvx-resolve-links.service" /etc/systemd/system/
sudo install -m 644 "${BASE}/etc/systemd/system/lvx-resolve-links.timer" /etc/systemd/system/
sudo install -m 644 "${BASE}/etc/systemd/system/lvx-control.service" /etc/systemd/system/ sudo install -m 644 "${BASE}/etc/systemd/system/lvx-control.service" /etc/systemd/system/
sudo install -m 644 "${BASE}/etc/systemd/system/powermon.service" /etc/systemd/system/ sudo install -m 644 "${BASE}/etc/systemd/system/powermon.service" /etc/systemd/system/
sudo install -m 644 "${BASE}/etc/systemd/system/powermon2.service" /etc/systemd/system/ sudo install -m 644 "${BASE}/etc/systemd/system/powermon2.service" /etc/systemd/system/
@@ -94,7 +95,7 @@ done
# --- 7. enable services ---------------------------------------------------- # --- 7. enable services ----------------------------------------------------
msg "Enabling services" msg "Enabling services"
sudo systemctl enable --now lvx-resolve-links.service sudo systemctl enable --now lvx-resolve-links.service lvx-resolve-links.timer
# Only auto-start powermon services if credentials look real (no placeholder present) # Only auto-start powermon services if credentials look real (no placeholder present)
if grep -q '<MQTT_PASSWORD>' "${HOME}/.config/powermon/powermon.yaml" 2>/dev/null; then if grep -q '<MQTT_PASSWORD>' "${HOME}/.config/powermon/powermon.yaml" 2>/dev/null; then
cat <<EOF cat <<EOF

View File

@@ -461,9 +461,12 @@ QUERY_COMMANDS = {
"command_type": CommandType.PI18_QUERY, "command_type": CommandType.PI18_QUERY,
"result_type": ResultType.COMMA_DELIMITED, "result_type": ResultType.COMMA_DELIMITED,
"reading_definitions": [ "reading_definitions": [
# Defensive: any 2-digit code outside this map shows as
# "Fault NN" rather than crashing the entire poller.
{"description": "Fault code", "reading_type": ReadingType.MESSAGE, {"description": "Fault code", "reading_type": ReadingType.MESSAGE,
"response_type": ResponseType.OPTION, "response_type": ResponseType.OPTION,
"options": { "options": {
**{f"{i:02d}": f"Fault {i:02d}" for i in range(0, 100)},
"00": "No fault", "00": "No fault",
"01": "Fan is locked", "01": "Fan is locked",
"02": "Over temperature", "02": "Over temperature",