Files
shaggy-solar/.claude/skills/solar-health-check/SKILL.md
noise aa97d65b0c Add solar monitoring/troubleshooting skills for agents
Four Skill-tool skills under .claude/skills/ that let an agent monitor and
troubleshoot the install (2x LVX6048, 6x EG4 LifePower4, OpenEVSE), grounded
in the real MQTT/HA topology rather than generic advice:

- solar-health-check  : whole-system sweep + cross-checks + R/Y/G verdict,
                        incl. cross-unit "silently-dead inverter" detection
- troubleshoot-inverter: FWS fault decode, parallel sync, USB link recovery
- troubleshoot-battery : per-pack imbalance vs SoC-counter-drift, RS485 silence
- power-usage         : PV/load/grid/battery balance + EVSE sessions

Shared lib:
- solar-snapshot : live MQTT capture (creds from powermon.yaml, no hardcoding)
- ha-history     : HA recorder lookback (token from ~/.config/ha/token)
REFERENCE.md documents topology, real HA entity_ids (doubled slug), known
issues, and a safe-remediation-only action policy (restarts yes; setters no).

Action boundary: diagnose + restart wedged daemons / recover USB links;
never touches inverter/battery setters or flash.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-24 11:46:20 -04:00

6.0 KiB
Raw Blame History

name, description
name description
solar-health-check Top-level health snapshot of the whole solar/power install — 2 LVX6048 inverters, 6 EG4 LifePower4 packs, and the OpenEVSE charger — with cross-checks and a green/yellow/red verdict. Use when the user asks "how's the solar / battery / power system doing", "is everything ok", "check the install", wants a status report, or as the first step before deeper troubleshooting. For deep dives into one subsystem, hand off to troubleshoot-inverter, troubleshoot-battery, or power-usage.

solar-health-check

A fast, read-only sweep of every subsystem that ends in a clear verdict. Do NOT change settings here; if something needs a restart, that's allowed (see policy).

0. Load context

Skills run with the shell cwd at the repo root, so anchor paths there:

ROOT="$(git rev-parse --show-toplevel)"; SNAP="$ROOT/.claude/skills/lib/solar-snapshot"; HIST="$ROOT/.claude/skills/lib/ha-history"

Read $ROOT/.claude/skills/REFERENCE.md (system map, entity names, snapshot helper, action policy) before proceeding.

1. Services up?

systemctl is-active powermon.service powermon2.service eg4-battery.service lvx-control.service lvx-resolve-links.service

lvx-resolve-links is a oneshot → expect active/exited (not failed). Any failed/inactive on the others is RED. For a wedged data-plane daemon, a restart is allowed (see §6).

2. Capture live telemetry

"$SNAP" -w 10 -g 'lvx6048_[12]_(device_mode|fault_code|battery_voltage|battery_capacity|ac_output_active_power|mppt1_input_power|mppt2_input_power|grid_voltage|inverter_heat_sink_temperature|parallel_instance_number)/' 'homeassistant/sensor/+/state'
"$SNAP" -w 16 -g 'lifepower4_[1-6]_(soc|pack_voltage|pack_current|cell_voltage_delta_mv|temperature_pcb)/' 'homeassistant/sensor/+/state'
"$SNAP" -w 6 'openevse/status' 'openevse/amp' 'openevse/power' 'openevse/session_energy'

If a family returns "(no messages)": the feeding daemon is silent → that subsystem is RED regardless of is-active (running but not publishing). EVSE idle/unplugged publishing nothing is normal — confirm via openevse/status.

3. Cross-checks (this is the value-add — single sensors can each look fine)

  • Battery voltage agreement: each inverter's battery_voltage should be within ~1 V of the pack stack voltage (pack_voltage ≈ 5155 V). Known anomaly: the inverter reading is intermittently wrong (correct ~54 V on 2026-06-20, ~910 V after the Jun 22 reboot) — a post-reboot glitch, not a permanent bug. If it reads ~10 V, note it and suggest a powermon restart; use the lifepower4_* pack entities, never the inverter reading, for any battery math (see REFERENCE known-issues).
  • Cross-unit PV production (catches a silently-dead inverter): compare lvx6048_1_mppt1_input_power vs lvx6048_2_mppt1_input_power. In daylight (the other unit clearly producing), one unit pinned at 0 W = that inverter is down and being masked by its sibling — RED → troubleshoot-inverter. This is exactly the 2026-06-20 fault-08 failure mode (unit 1 sat at 0 W for ~1.8 days). At night/heavy shade both at 0 W is normal.
  • SoC spread across packs: max(soc) - min(soc) over the 6 packs. BUT first cross-check against pack_voltage/cell_voltage_max: the packs are paralleled, so if all pack_voltage agree (±0.1 V) the packs are physically at the same charge and any SoC spread is counter drift, not real imbalance (pack 6 ran 76 % while reading the same 53.4 V / 3.337 V/cell as packs at 5055 % on 2026-06-24). Real imbalance = pack voltages actually diverge. Drift → note it, recommend a calibration charge; >20 % spread with diverging voltages = RED → troubleshoot-battery.
  • Cell imbalance: any pack with cell_voltage_delta_mv > 50 = YELLOW, > 100 = RED.
  • Parallel master/slave: exactly one inverter should report parallel_instance_number 0 (master); the other 1+. Two masters or two slaves = RED.
  • Faults: any fault_code non-zero, or device_mode = Fault = RED → troubleshoot-inverter.
  • Temps: pack temperature_pcb > 55 °C or inverter heat-sink > 75 °C = YELLOW.
  • Power balance sanity: PV in (mppt*_input_power) vs AC out vs pack pack_current should roughly conserve. Gross mismatch = investigate via power-usage.

4. Verdict

Print a compact table (subsystem → state → one-line reason), then an overall GREEN / YELLOW / RED with the top 13 issues and which deeper skill to run.

5. Recent error scan (only if anything looked off)

for s in powermon powermon2 eg4-battery lvx-control; do
  echo "== $s =="; journalctl -u $s.service --since "15 min ago" --no-pager | grep -iE 'error|timeout|fail|crc|nak|reconnect' | tail -5
done

5b. Historical sanity — did anything fail while unattended? (needs HA token)

Live snapshots miss faults that already cleared and silent-unit spells that ended. If ~/.config/ha/token exists (see REFERENCE), scan the recorder for the last few days. Use the REAL HA entity_ids (doubled slug — see REFERENCE), not MQTT names:

"$HIST" -s "5 days ago" -m fault sensor.lvx6048_01_lvx6048_1_fault_code sensor.lvx6048_02_lvx6048_2_fault_code
# silent-unit hunt: sample midday PV both units across recent days; one pinned 0 while
# the other produced = it was down. e.g. check a midday window per day:
"$HIST" -s "2 days ago" sensor.lvx6048_lvx6048_1_mppt1_input_power sensor.lvx6048_lvx6048_2_mppt1_input_power | head -40

Any fault-08 / silent-unit episode → report with timestamps and hand off to troubleshoot-inverter §2§5. No token → say so and point the user at REFERENCE to add one.

6. Allowed remediation

If a daemon is failed or running-but-silent, restarting it is permitted:

sudo systemctl restart eg4-battery.service     # or powermon / powermon2 / lvx-control

Re-run the relevant snapshot to confirm data resumes. Anything beyond a restart (settings, flash, cabling) → report and hand the user the exact command. Never publish to solar/control/lvx6048/* from this skill.