Files
shaggy-solar/.claude/skills/troubleshoot-battery/SKILL.md
noise aa97d65b0c Add solar monitoring/troubleshooting skills for agents
Four Skill-tool skills under .claude/skills/ that let an agent monitor and
troubleshoot the install (2x LVX6048, 6x EG4 LifePower4, OpenEVSE), grounded
in the real MQTT/HA topology rather than generic advice:

- solar-health-check  : whole-system sweep + cross-checks + R/Y/G verdict,
                        incl. cross-unit "silently-dead inverter" detection
- troubleshoot-inverter: FWS fault decode, parallel sync, USB link recovery
- troubleshoot-battery : per-pack imbalance vs SoC-counter-drift, RS485 silence
- power-usage         : PV/load/grid/battery balance + EVSE sessions

Shared lib:
- solar-snapshot : live MQTT capture (creds from powermon.yaml, no hardcoding)
- ha-history     : HA recorder lookback (token from ~/.config/ha/token)
REFERENCE.md documents topology, real HA entity_ids (doubled slug), known
issues, and a safe-remediation-only action policy (restarts yes; setters no).

Action boundary: diagnose + restart wedged daemons / recover USB links;
never touches inverter/battery setters or flash.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-24 11:46:20 -04:00

78 lines
3.9 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
name: troubleshoot-battery
description: >-
Diagnose the 6× EG4 LifePower4 v2 battery packs — per-pack SoC, cell imbalance,
SoC drift, RS485/Modbus comms silence, temperature, and warning/protection bits.
Use when a pack reads oddly, packs disagree on SoC, a pack stopped reporting,
cells look imbalanced, the user mentions "battery problem / pack down / SoC wrong
/ imbalance / one battery", or after solar-health-check flags the stack. Read-only
plus a safe eg4-battery daemon restart; never writes BMS settings.
---
# troubleshoot-battery
## 0. Load context
Shell cwd is the repo root; anchor paths there:
```bash
ROOT="$(git rev-parse --show-toplevel)"; SNAP="$ROOT/.claude/skills/lib/solar-snapshot"
```
Read `$ROOT/.claude/skills/REFERENCE.md`. There are **6 packs** `lifepower4_1..6_*`,
all served by `eg4-battery.service` (one FTDI RS485 adapter per pack). Pack config:
`~/.config/eg4-battery/eg4-battery.yaml`.
## 1. Are all 6 packs reporting?
```bash
systemctl is-active eg4-battery.service
"$SNAP" -w 16 -g 'lifepower4_[1-6]_(soc|pack_voltage|pack_current)/' 'homeassistant/sensor/+/state'
```
- Fewer than 6 packs in the output → a pack is **silent on RS485**, go to §4.
- All 6 present → go to §2/§3.
## 2. SoC spread & drift
```bash
"$SNAP" -w 16 -g 'lifepower4_[1-6]_(soc|soc_alt|pack_voltage)/' 'homeassistant/sensor/+/state'
```
- Compute `max(soc) - min(soc)`. >10 % = imbalance worth noting; >20 % = significant.
**Pack 6 historically runs high** (it's the oddball: Modbus addr `0x01`/115200 vs
`0x40`/9600 for packs 15) — judge it on its own, don't assume it tracks 15.
- **SoC drift is a known design limitation**, not a live fault: the coulomb counter
never re-anchors because the bank rarely reaches 100 % to reset. See memory
`project_eg4_soc_drift_remediation`. If SoC looks wrong but `pack_voltage` is
sane, suspect drift, not a dead pack. Voltage→SoC sanity for LFP at rest:
~51.2 V ≈ low, ~53.5 V ≈ mid, ~54+ V ≈ high (loaded/charging skews this).
## 3. Cell imbalance, temperature, protection bits
```bash
"$SNAP" -w 16 -g 'lifepower4_[1-6]_(cell_voltage_delta_mv|cell_voltage_min|cell_voltage_max|temperature_pcb)/' 'homeassistant/sensor/+/state'
# For a specific suspect pack N, pull all 16 cells + bits:
"$SNAP" -w 14 -g 'lifepower4_3_(cell_[0-9]+_voltage|warning|protection|temperature)' 'homeassistant/sensor/+/state'
```
- `cell_voltage_delta_mv`: <30 mV good, 3050 mV watch, >50 mV imbalanced, >100 mV
bad (a weak/failing cell, or the pack simply needs a long absorb to balance).
- Any warning/protection bit set → read its name; over-/under-voltage,
over-temp, and over-current protections will also explain a pack dropping current.
- Temps: `temperature_pcb` > 55 °C = watch.
## 4. RS485 / Modbus comms silence (a pack missing from §1)
```bash
journalctl -u eg4-battery.service --since "15 min ago" --no-pager | grep -iE 'timeout|crc|nak|error|no response|pack|addr' | tail -30
ls -l /dev/serial/by-id/ | grep -i ft232 # all 6 FTDI adapters enumerated?
grep -E 'name:|port:|address|baud' ~/.config/eg4-battery/eg4-battery.yaml
```
- Missing FTDI under `by-id` → USB/adapter/cable issue for that pack (hardware →
report to user; don't unplug things yourself).
- FTDI present but pack times out → check it isn't demoted by an inter-pack
daisy-chain (memory `project_eg4_daisy_chain_silences_slaves`: each pack must be on
its own dongle; a chain silences slaves). Also confirm the pack's `address`/`baud`
in config match the unit (pack 6 legitimately differs).
- Daemon wedged after a USB re-enumerate → restart is ALLOWED:
```bash
sudo systemctl restart eg4-battery.service
```
Then re-run §1 to confirm all 6 return.
## 5. Report
Per-pack table (SoC, voltage, current, delta_mv, max temp, any bits) + the stack
spread, whether it's drift vs a real fault, comms state, what you restarted, and any
hardware action left for the user. Do not write BMS registers or thresholds.