Files
shaggy-solar/.claude/skills/solar-health-check/SKILL.md
noise aa97d65b0c Add solar monitoring/troubleshooting skills for agents
Four Skill-tool skills under .claude/skills/ that let an agent monitor and
troubleshoot the install (2x LVX6048, 6x EG4 LifePower4, OpenEVSE), grounded
in the real MQTT/HA topology rather than generic advice:

- solar-health-check  : whole-system sweep + cross-checks + R/Y/G verdict,
                        incl. cross-unit "silently-dead inverter" detection
- troubleshoot-inverter: FWS fault decode, parallel sync, USB link recovery
- troubleshoot-battery : per-pack imbalance vs SoC-counter-drift, RS485 silence
- power-usage         : PV/load/grid/battery balance + EVSE sessions

Shared lib:
- solar-snapshot : live MQTT capture (creds from powermon.yaml, no hardcoding)
- ha-history     : HA recorder lookback (token from ~/.config/ha/token)
REFERENCE.md documents topology, real HA entity_ids (doubled slug), known
issues, and a safe-remediation-only action policy (restarts yes; setters no).

Action boundary: diagnose + restart wedged daemons / recover USB links;
never touches inverter/battery setters or flash.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-24 11:46:20 -04:00

104 lines
6.0 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
name: solar-health-check
description: >-
Top-level health snapshot of the whole solar/power install — 2 LVX6048
inverters, 6 EG4 LifePower4 packs, and the OpenEVSE charger — with cross-checks
and a green/yellow/red verdict. Use when the user asks "how's the solar /
battery / power system doing", "is everything ok", "check the install", wants a
status report, or as the first step before deeper troubleshooting. For deep
dives into one subsystem, hand off to troubleshoot-inverter, troubleshoot-battery,
or power-usage.
---
# solar-health-check
A fast, read-only sweep of every subsystem that ends in a clear verdict. Do NOT
change settings here; if something needs a restart, that's allowed (see policy).
## 0. Load context
Skills run with the shell cwd at the repo root, so anchor paths there:
```bash
ROOT="$(git rev-parse --show-toplevel)"; SNAP="$ROOT/.claude/skills/lib/solar-snapshot"; HIST="$ROOT/.claude/skills/lib/ha-history"
```
Read `$ROOT/.claude/skills/REFERENCE.md` (system map, entity names, snapshot helper,
action policy) before proceeding.
## 1. Services up?
```bash
systemctl is-active powermon.service powermon2.service eg4-battery.service lvx-control.service lvx-resolve-links.service
```
`lvx-resolve-links` is a oneshot → expect `active`/`exited` (not `failed`). Any
`failed`/`inactive` on the others is RED. For a wedged data-plane daemon, a
restart is allowed (see §6).
## 2. Capture live telemetry
```bash
"$SNAP" -w 10 -g 'lvx6048_[12]_(device_mode|fault_code|battery_voltage|battery_capacity|ac_output_active_power|mppt1_input_power|mppt2_input_power|grid_voltage|inverter_heat_sink_temperature|parallel_instance_number)/' 'homeassistant/sensor/+/state'
"$SNAP" -w 16 -g 'lifepower4_[1-6]_(soc|pack_voltage|pack_current|cell_voltage_delta_mv|temperature_pcb)/' 'homeassistant/sensor/+/state'
"$SNAP" -w 6 'openevse/status' 'openevse/amp' 'openevse/power' 'openevse/session_energy'
```
If a family returns "(no messages)": the feeding daemon is silent → that subsystem
is RED regardless of `is-active` (running but not publishing). EVSE idle/unplugged
publishing nothing is normal — confirm via `openevse/status`.
## 3. Cross-checks (this is the value-add — single sensors can each look fine)
- **Battery voltage agreement**: each inverter's `battery_voltage` should be within
~1 V of the pack stack voltage (`pack_voltage` ≈ 5155 V). **Known anomaly:** the
inverter reading is *intermittently* wrong (correct ~54 V on 2026-06-20, ~910 V
after the Jun 22 reboot) — a post-reboot glitch, not a permanent bug. If it reads
~10 V, note it and suggest a powermon restart; use the `lifepower4_*` pack entities,
never the inverter reading, for any battery math (see REFERENCE known-issues).
- **Cross-unit PV production (catches a silently-dead inverter)**: compare
`lvx6048_1_mppt1_input_power` vs `lvx6048_2_mppt1_input_power`. In daylight (the
*other* unit clearly producing), one unit pinned at **0 W** = that inverter is down
and being masked by its sibling — RED → troubleshoot-inverter. This is exactly the
2026-06-20 fault-08 failure mode (unit 1 sat at 0 W for ~1.8 days). At night/heavy
shade both at 0 W is normal.
- **SoC spread across packs**: `max(soc) - min(soc)` over the 6 packs. BUT first
cross-check against `pack_voltage`/`cell_voltage_max`: the packs are paralleled, so
if all `pack_voltage` agree (±0.1 V) the packs are physically at the same charge and
any SoC spread is **counter drift**, not real imbalance (pack 6 ran 76 % while
reading the same 53.4 V / 3.337 V/cell as packs at 5055 % on 2026-06-24). Real
imbalance = pack voltages actually diverge. Drift → note it, recommend a calibration
charge; >20 % spread with diverging voltages = RED → troubleshoot-battery.
- **Cell imbalance**: any pack with `cell_voltage_delta_mv` > 50 = YELLOW, > 100 = RED.
- **Parallel master/slave**: exactly one inverter should report
`parallel_instance_number` 0 (master); the other 1+. Two masters or two slaves = RED.
- **Faults**: any `fault_code` non-zero, or `device_mode` = Fault = RED → troubleshoot-inverter.
- **Temps**: pack `temperature_pcb` > 55 °C or inverter heat-sink > 75 °C = YELLOW.
- **Power balance sanity**: PV in (`mppt*_input_power`) vs AC out vs pack
`pack_current` should roughly conserve. Gross mismatch = investigate via power-usage.
## 4. Verdict
Print a compact table (subsystem → state → one-line reason), then an overall
GREEN / YELLOW / RED with the top 13 issues and which deeper skill to run.
## 5. Recent error scan (only if anything looked off)
```bash
for s in powermon powermon2 eg4-battery lvx-control; do
echo "== $s =="; journalctl -u $s.service --since "15 min ago" --no-pager | grep -iE 'error|timeout|fail|crc|nak|reconnect' | tail -5
done
```
## 5b. Historical sanity — did anything fail while unattended? (needs HA token)
Live snapshots miss faults that already cleared and silent-unit spells that ended.
If `~/.config/ha/token` exists (see REFERENCE), scan the recorder for the last few
days. Use the REAL HA entity_ids (doubled slug — see REFERENCE), not MQTT names:
```bash
"$HIST" -s "5 days ago" -m fault sensor.lvx6048_01_lvx6048_1_fault_code sensor.lvx6048_02_lvx6048_2_fault_code
# silent-unit hunt: sample midday PV both units across recent days; one pinned 0 while
# the other produced = it was down. e.g. check a midday window per day:
"$HIST" -s "2 days ago" sensor.lvx6048_lvx6048_1_mppt1_input_power sensor.lvx6048_lvx6048_2_mppt1_input_power | head -40
```
Any fault-08 / silent-unit episode → report with timestamps and hand off to
troubleshoot-inverter §2§5. No token → say so and point the user at REFERENCE to add one.
## 6. Allowed remediation
If a daemon is `failed` or running-but-silent, restarting it is permitted:
```bash
sudo systemctl restart eg4-battery.service # or powermon / powermon2 / lvx-control
```
Re-run the relevant snapshot to confirm data resumes. Anything beyond a restart
(settings, flash, cabling) → report and hand the user the exact command. Never
publish to `solar/control/lvx6048/*` from this skill.