docs/QA-INSTRUCTIONS.md

# Codebase Quality Assessment -- LLM Instructions

**Purpose:** Reproducible, automated quality scoring for the Impakt codebase.
Run this assessment at any milestone (feature complete, pre-release, periodic review) to track quality over time. Results go into `docs/QA-<DATETIME>.md` using the template in `docs/QA-TEMPLATE.md`.

---

## Prerequisites

```bash
cd /Users/noise/Code/impakt
uv sync --dev
```

All commands below assume `uv run` prefix and the project root as working directory.

---

## Step 1: Collect Raw Metrics

Run these commands and record the output. They are independent and can run in parallel.

### 1a. Inventory

```bash
# Source file count and line count
find src -name "*.py" | wc -l
find src -name "*.py" -exec cat {} + | wc -l

# Test file count and line count
find tests -name "*.py" | wc -l
find tests -name "*.py" -exec cat {} + | wc -l
```

### 1b. Test Suite

```bash
# Test count and pass/fail
uv run python -m pytest --tb=no -q 2>&1 | tail -5

# Test collection (verify count)
uv run python -m pytest --co -q 2>&1 | tail -3
```

### 1c. Type Safety

```bash
# mypy strict error count and file breakdown
uv run python -m mypy src/impakt --ignore-missing-imports 2>&1 | tail -3

# Error category breakdown
uv run python -m mypy src/impakt --ignore-missing-imports 2>&1 \
  | grep "error:" | sed 's/.*\[/[/' | sort | uniq -c | sort -rn
```

### 1d. Lint

```bash
# Ruff violation count
uv run python -m ruff check src/ 2>&1 | tail -5

# Violation breakdown by rule
uv run python -m ruff check src/ 2>&1 \
  | grep -oE '[A-Z][0-9]+' | sort | uniq -c | sort -rn
```

### 1e. Complexity

```bash
# File size distribution
find src/impakt -name "*.py" -exec wc -l {} + | grep -v total \
  | awk '{print $1}' | sort -n | awk '
BEGIN {n=0; sum=0}
{a[n++]=$1; sum+=$1}
END {
  printf "Min: %d, Max: %d, Mean: %.0f, Median: %d\n", a[0], a[n-1], sum/n, a[int(n/2)]
  over300=0; for(i=0;i<n;i++) if(a[i]>300) over300++
  printf "Files >300 lines: %d / %d\n", over300, n
}'

# High-complexity files (branch density)
find src/impakt -name "*.py" ! -path "*__pycache__*" | while read f; do
  ifs=$(grep -cE '^\s*(if |elif )' "$f" || true)
  fors=$(grep -cE '^\s*(for |while )' "$f" || true)
  excepts=$(grep -cE '^\s*except' "$f" || true)
  complexity=$((ifs + fors + excepts + 1))
  if [ "$complexity" -gt 15 ]; then
    lines=$(wc -l < "$f")
    echo "  $complexity  $f ($lines lines)"
  fi
done | sort -rn
```

### 1f. Documentation

```bash
# Docstring coverage (approximate)
echo "Docstrings: $(grep -rcE '^\s+"""' src/impakt/ --include='*.py' | awk -F: '{s+=$2}END{print s}')"
echo "Definitions: $(grep -rcE '^\s*(def |class )' src/impakt/ --include='*.py' | awk -F: '{s+=$2}END{print s}')"

# __all__ coverage
for d in src/impakt/*/; do
  [ -d "$d" ] || continue
  mod=$(basename "$d")
  init="${d}__init__.py"
  [ -f "$init" ] && has_all=$(grep -c "__all__" "$init" || echo 0) || has_all="-"
  echo "  $mod: __all__=$has_all"
done
```

### 1g. Security

```bash
# Dangerous patterns
echo "eval/exec: $(grep -rn 'eval(\|exec(' src/impakt/ --include='*.py' | grep -v '# noqa' | wc -l | tr -d ' ')"
echo "subprocess/os.system: $(grep -rn 'subprocess\|os\.system\|os\.popen' src/impakt/ --include='*.py' | wc -l | tr -d ' ')"
echo "bare except: $(grep -rn 'except:' src/impakt/ --include='*.py' | wc -l | tr -d ' ')"
echo "hardcoded secrets: $(grep -rin 'password\s*=\|secret\s*=\|api_key\s*=\|token\s*=' src/impakt/ --include='*.py' | wc -l | tr -d ' ')"
```

### 1h. Maintainability

```bash
# Debt markers
echo "TODO: $(grep -rc 'TODO' src/impakt/ --include='*.py' | awk -F: '{s+=$2}END{print s+0}')"
echo "FIXME: $(grep -rc 'FIXME' src/impakt/ --include='*.py' | awk -F: '{s+=$2}END{print s+0}')"
echo "HACK: $(grep -rc 'HACK' src/impakt/ --include='*.py' | awk -F: '{s+=$2}END{print s+0}')"

# Logging usage
echo "logging calls: $(grep -rc 'logger\.\|logging\.' src/impakt/ --include='*.py' | awk -F: '{s+=$2}END{print s+0}')"

# Error handling
echo "try/except blocks: $(grep -rc 'try:' src/impakt/ --include='*.py' | awk -F: '{s+=$2}END{print s+0}')"
echo "bare excepts: $(grep -rc 'except:' src/impakt/ --include='*.py' | awk -F: '{s+=$2}END{print s+0}')"

# Internal coupling
echo "internal imports: $(grep -rc 'from impakt\.' src/impakt/ --include='*.py' | awk -F: '{s+=$2}END{print s+0}')"
```

---

## Step 2: Score Each Dimension

Use the raw metrics to assign a 0.0-10.0 score per dimension using these rubrics.

### Test Health (weight: 20%)

| Score | Criteria |
|-------|----------|
| 10 | 100% pass, test:source ratio >= 0.5, coverage >= 80%, integration tests present |
| 8 | 100% pass, ratio 0.2-0.5, integration tests present |
| 6 | 100% pass, ratio < 0.2 or missing coverage measurement |
| 4 | >95% pass |
| 2 | >80% pass |
| 0 | <80% pass or no tests |

### Type Safety (weight: 15%)

| Score | Criteria |
|-------|----------|
| 10 | mypy strict, 0 errors |
| 8 | mypy strict, <10 errors |
| 6 | mypy strict, <50 errors OR mypy non-strict with 0 errors |
| 4 | mypy configured but >50 errors |
| 2 | no type checker configured |
| 0 | no type hints |

### Lint Hygiene (weight: 10%)

| Score | Criteria |
|-------|----------|
| 10 | 0 violations |
| 8 | <10 violations, all auto-fixable |
| 6 | <100 violations, mostly auto-fixable |
| 4 | 100-500 violations |
| 2 | >500 violations |
| 0 | no linter configured |

### Architecture (weight: 15%)

| Score | Criteria |
|-------|----------|
| 10 | Clear layering, all modules have `__all__`, low coupling, plugin system |
| 9 | Clear layering, most modules have `__all__`, plugin system |
| 7 | Identifiable layers, some modules lack public API definition |
| 5 | Flat structure, unclear module boundaries |
| 3 | God objects, circular dependencies |
| 0 | Single-file or completely unstructured |

Evaluate qualitatively: read `__init__.py` files, check import patterns, assess layer violations (does `io` import from `web`? does `channel` depend on `plot`?).

### Documentation (weight: 10%)

| Score | Criteria |
|-------|----------|
| 10 | >90% docstring coverage, generated API docs, architectural docs with diagrams |
| 8 | >80% docstring coverage, README with architecture, no generated API docs |
| 6 | >60% docstring coverage, basic README |
| 4 | <60% docstring coverage |
| 2 | README only |
| 0 | No documentation |

### Complexity (weight: 10%)

| Score | Criteria |
|-------|----------|
| 10 | Median file <150 lines, 0 files >300, max complexity <15 |
| 8 | Median <150, <=3 files >300, high-complexity files are justified (parsers, UI assembly) |
| 6 | Median <200, <=10 files >300 |
| 4 | Median >200 or many files >500 |
| 2 | Multiple files >1000 lines |
| 0 | Monolithic files, no decomposition |

### Security (weight: 10%)

| Score | Criteria |
|-------|----------|
| 10 | No eval/exec, no subprocess, no secrets, input validation at boundaries |
| 9 | eval/exec present but sandboxed (restricted builtins, blocklist), no secrets |
| 7 | eval/exec partially sandboxed, or subprocess with shell=False |
| 5 | Unsandboxed eval/exec or subprocess with shell=True |
| 3 | Hardcoded secrets present |
| 0 | Multiple critical security anti-patterns |

### Maintainability (weight: 10%)

| Score | Criteria |
|-------|----------|
| 10 | 0 debt markers, consistent error handling, logging, modern tooling |
| 8 | <5 debt markers, no bare excepts, logging present |
| 6 | <20 debt markers, some bare excepts |
| 4 | >20 debt markers or many bare excepts |
| 2 | Inconsistent patterns, no logging |
| 0 | No discernible standards |

---

## Step 3: Compute Composite Score

```
composite = (
    test_health     * 0.20 +
    type_safety     * 0.15 +
    lint_hygiene    * 0.10 +
    architecture    * 0.15 +
    documentation   * 0.10 +
    complexity      * 0.10 +
    security        * 0.10 +
    maintainability * 0.10
) * 10
```

### Grade Scale

| Range | Grade | Meaning |
|-------|-------|---------|
| 90-100 | A | Production-grade, minimal tech debt |
| 80-89 | B+ | Strong codebase, minor gaps |
| 70-79 | B | Good foundation, actionable improvements exist |
| 60-69 | C+ | Functional but accumulating debt |
| 50-59 | C | Significant quality concerns |
| <50 | D | Major remediation needed |

---

## Step 4: Write Report

Copy `docs/QA-TEMPLATE.md` to `docs/QA-<YYYY-MM-DD_HHMM>.md` and fill in:

1. **Summary block at the top** (immediately after the `#` heading):
   - Blockquote with composite score, grade, and one-sentence summary
   - Quick-reference dimension score table
   - The summary sentence should note overall quality posture and the most significant change since the last assessment (or "Baseline assessment" if first run)
2. All raw metric values
3. All dimension scores with brief justification
4. Composite score and grade
5. Delta from previous assessment (if one exists)
6. Top 3-5 actionable improvements
7. Acknowledgment of any scoring judgment calls

---

## Notes for LLM Execution

- Run all Step 1 commands. Do not skip any -- partial data leads to unreliable scores.
- When scoring Architecture, actually read the import graph. Do not guess from file names alone.
- For Security, read the context around any `eval`/`exec`/`subprocess` hits to assess sandboxing.
- Be consistent: same rubric, same thresholds, every time. If a metric falls between rubric rows, interpolate linearly.
- If the project adds coverage measurement (`pytest-cov`), factor it into Test Health.
- Record the exact command output, not just the derived score. Future assessments need the raw data to compute deltas.
- Check for previous `QA-*.md` files in `docs/` to compute improvement trends.
Added AQA Process and Scoring 2026-04-11 05:05:50 -04:00			`# Codebase Quality Assessment -- LLM Instructions`

			`Purpose: Reproducible, automated quality scoring for the Impakt codebase.`
			Run this assessment at any milestone (feature complete, pre-release, periodic review) to track quality over time. Results go into `docs/QA-<DATETIME>.md` using the template in `docs/QA-TEMPLATE.md`.

			`---`

			`## Prerequisites`

			```bash
			`cd /Users/noise/Code/impakt`
			`uv sync --dev`
			```

			All commands below assume `uv run` prefix and the project root as working directory.

			`---`

			`## Step 1: Collect Raw Metrics`

			`Run these commands and record the output. They are independent and can run in parallel.`

			`### 1a. Inventory`

			```bash
			`# Source file count and line count`
			`find src -name "*.py" \| wc -l`
			`find src -name "*.py" -exec cat {} + \| wc -l`

			`# Test file count and line count`
			`find tests -name "*.py" \| wc -l`
			`find tests -name "*.py" -exec cat {} + \| wc -l`
			```

			`### 1b. Test Suite`

			```bash
			`# Test count and pass/fail`
			`uv run python -m pytest --tb=no -q 2>&1 \| tail -5`

			`# Test collection (verify count)`
			`uv run python -m pytest --co -q 2>&1 \| tail -3`
			```

			`### 1c. Type Safety`

			```bash
			`# mypy strict error count and file breakdown`
			`uv run python -m mypy src/impakt --ignore-missing-imports 2>&1 \| tail -3`

			`# Error category breakdown`
			`uv run python -m mypy src/impakt --ignore-missing-imports 2>&1 \`
			`\| grep "error:" \| sed 's/.*\[/[/' \| sort \| uniq -c \| sort -rn`
			```

			`### 1d. Lint`

			```bash
			`# Ruff violation count`
			`uv run python -m ruff check src/ 2>&1 \| tail -5`

			`# Violation breakdown by rule`
			`uv run python -m ruff check src/ 2>&1 \`
			`\| grep -oE '[A-Z][0-9]+' \| sort \| uniq -c \| sort -rn`
			```

			`### 1e. Complexity`

			```bash
			`# File size distribution`
			`find src/impakt -name "*.py" -exec wc -l {} + \| grep -v total \`
			`\| awk '{print $1}' \| sort -n \| awk '`
			`BEGIN {n=0; sum=0}`
			`{a[n++]=$1; sum+=$1}`
			`END {`
			`printf "Min: %d, Max: %d, Mean: %.0f, Median: %d\n", a[0], a[n-1], sum/n, a[int(n/2)]`
			`over300=0; for(i=0;i<n;i++) if(a[i]>300) over300++`
			`printf "Files >300 lines: %d / %d\n", over300, n`
			`}'`

			`# High-complexity files (branch density)`
			`find src/impakt -name ".py" ! -path "__pycache__*" \| while read f; do`
			`ifs=$(grep -cE '^\s*(if \|elif )' "$f" \|\| true)`
			`fors=$(grep -cE '^\s*(for \|while )' "$f" \|\| true)`
			`excepts=$(grep -cE '^\s*except' "$f" \|\| true)`
			`complexity=$((ifs + fors + excepts + 1))`
			`if [ "$complexity" -gt 15 ]; then`
			`lines=$(wc -l < "$f")`
			`echo " $complexity $f ($lines lines)"`
			`fi`
			`done \| sort -rn`
			```

			`### 1f. Documentation`

			```bash
			`# Docstring coverage (approximate)`
			`echo "Docstrings: $(grep -rcE '^\s+"""' src/impakt/ --include='*.py' \| awk -F: '{s+=$2}END{print s}')"`
			`echo "Definitions: $(grep -rcE '^\s(def \|class )' src/impakt/ --include='.py' \| awk -F: '{s+=$2}END{print s}')"`

			`# __all__ coverage`
			`for d in src/impakt/*/; do`
			`[ -d "$d" ] \|\| continue`
			`mod=$(basename "$d")`
			`init="${d}__init__.py"`
			`[ -f "$init" ] && has_all=$(grep -c "__all__" "$init" \|\| echo 0) \|\| has_all="-"`
			`echo " $mod: __all__=$has_all"`
			`done`
			```

			`### 1g. Security`

			```bash
			`# Dangerous patterns`
			`echo "eval/exec: $(grep -rn 'eval(\\|exec(' src/impakt/ --include='*.py' \| grep -v '# noqa' \| wc -l \| tr -d ' ')"`
			`echo "subprocess/os.system: $(grep -rn 'subprocess\\|os\.system\\|os\.popen' src/impakt/ --include='*.py' \| wc -l \| tr -d ' ')"`
			`echo "bare except: $(grep -rn 'except:' src/impakt/ --include='*.py' \| wc -l \| tr -d ' ')"`
			`echo "hardcoded secrets: $(grep -rin 'password\s=\\|secret\s=\\|api_key\s=\\|token\s=' src/impakt/ --include='*.py' \| wc -l \| tr -d ' ')"`
			```

			`### 1h. Maintainability`

			```bash
			`# Debt markers`
			`echo "TODO: $(grep -rc 'TODO' src/impakt/ --include='*.py' \| awk -F: '{s+=$2}END{print s+0}')"`
			`echo "FIXME: $(grep -rc 'FIXME' src/impakt/ --include='*.py' \| awk -F: '{s+=$2}END{print s+0}')"`
			`echo "HACK: $(grep -rc 'HACK' src/impakt/ --include='*.py' \| awk -F: '{s+=$2}END{print s+0}')"`

			`# Logging usage`
			`echo "logging calls: $(grep -rc 'logger\.\\|logging\.' src/impakt/ --include='*.py' \| awk -F: '{s+=$2}END{print s+0}')"`

			`# Error handling`
			`echo "try/except blocks: $(grep -rc 'try:' src/impakt/ --include='*.py' \| awk -F: '{s+=$2}END{print s+0}')"`
			`echo "bare excepts: $(grep -rc 'except:' src/impakt/ --include='*.py' \| awk -F: '{s+=$2}END{print s+0}')"`

			`# Internal coupling`
			`echo "internal imports: $(grep -rc 'from impakt\.' src/impakt/ --include='*.py' \| awk -F: '{s+=$2}END{print s+0}')"`
			```

			`---`

			`## Step 2: Score Each Dimension`

			`Use the raw metrics to assign a 0.0-10.0 score per dimension using these rubrics.`

			`### Test Health (weight: 20%)`

			`\| Score \| Criteria \|`
			`\|-------\|----------\|`
			`\| 10 \| 100% pass, test:source ratio >= 0.5, coverage >= 80%, integration tests present \|`
			`\| 8 \| 100% pass, ratio 0.2-0.5, integration tests present \|`
			`\| 6 \| 100% pass, ratio < 0.2 or missing coverage measurement \|`
			`\| 4 \| >95% pass \|`
			`\| 2 \| >80% pass \|`
			`\| 0 \| <80% pass or no tests \|`

			`### Type Safety (weight: 15%)`

			`\| Score \| Criteria \|`
			`\|-------\|----------\|`
			`\| 10 \| mypy strict, 0 errors \|`
			`\| 8 \| mypy strict, <10 errors \|`
			`\| 6 \| mypy strict, <50 errors OR mypy non-strict with 0 errors \|`
			`\| 4 \| mypy configured but >50 errors \|`
			`\| 2 \| no type checker configured \|`
			`\| 0 \| no type hints \|`

			`### Lint Hygiene (weight: 10%)`

			`\| Score \| Criteria \|`
			`\|-------\|----------\|`
			`\| 10 \| 0 violations \|`
			`\| 8 \| <10 violations, all auto-fixable \|`
			`\| 6 \| <100 violations, mostly auto-fixable \|`
			`\| 4 \| 100-500 violations \|`
			`\| 2 \| >500 violations \|`
			`\| 0 \| no linter configured \|`

			`### Architecture (weight: 15%)`

			`\| Score \| Criteria \|`
			`\|-------\|----------\|`
			\| 10 \| Clear layering, all modules have `__all__`, low coupling, plugin system \|
			\| 9 \| Clear layering, most modules have `__all__`, plugin system \|
			`\| 7 \| Identifiable layers, some modules lack public API definition \|`
			`\| 5 \| Flat structure, unclear module boundaries \|`
			`\| 3 \| God objects, circular dependencies \|`
			`\| 0 \| Single-file or completely unstructured \|`

			Evaluate qualitatively: read `__init__.py` files, check import patterns, assess layer violations (does `io` import from `web`? does `channel` depend on `plot`?).

			`### Documentation (weight: 10%)`

			`\| Score \| Criteria \|`
			`\|-------\|----------\|`
			`\| 10 \| >90% docstring coverage, generated API docs, architectural docs with diagrams \|`
			`\| 8 \| >80% docstring coverage, README with architecture, no generated API docs \|`
			`\| 6 \| >60% docstring coverage, basic README \|`
			`\| 4 \| <60% docstring coverage \|`
			`\| 2 \| README only \|`
			`\| 0 \| No documentation \|`

			`### Complexity (weight: 10%)`

			`\| Score \| Criteria \|`
			`\|-------\|----------\|`
			`\| 10 \| Median file <150 lines, 0 files >300, max complexity <15 \|`
			`\| 8 \| Median <150, <=3 files >300, high-complexity files are justified (parsers, UI assembly) \|`
			`\| 6 \| Median <200, <=10 files >300 \|`
			`\| 4 \| Median >200 or many files >500 \|`
			`\| 2 \| Multiple files >1000 lines \|`
			`\| 0 \| Monolithic files, no decomposition \|`

			`### Security (weight: 10%)`

			`\| Score \| Criteria \|`
			`\|-------\|----------\|`
			`\| 10 \| No eval/exec, no subprocess, no secrets, input validation at boundaries \|`
			`\| 9 \| eval/exec present but sandboxed (restricted builtins, blocklist), no secrets \|`
			`\| 7 \| eval/exec partially sandboxed, or subprocess with shell=False \|`
			`\| 5 \| Unsandboxed eval/exec or subprocess with shell=True \|`
			`\| 3 \| Hardcoded secrets present \|`
			`\| 0 \| Multiple critical security anti-patterns \|`

			`### Maintainability (weight: 10%)`

			`\| Score \| Criteria \|`
			`\|-------\|----------\|`
			`\| 10 \| 0 debt markers, consistent error handling, logging, modern tooling \|`
			`\| 8 \| <5 debt markers, no bare excepts, logging present \|`
			`\| 6 \| <20 debt markers, some bare excepts \|`
			`\| 4 \| >20 debt markers or many bare excepts \|`
			`\| 2 \| Inconsistent patterns, no logging \|`
			`\| 0 \| No discernible standards \|`

			`---`

			`## Step 3: Compute Composite Score`

			```
			`composite = (`
			`test_health * 0.20 +`
			`type_safety * 0.15 +`
			`lint_hygiene * 0.10 +`
			`architecture * 0.15 +`
			`documentation * 0.10 +`
			`complexity * 0.10 +`
			`security * 0.10 +`
			`maintainability * 0.10`
			`) * 10`
			```

			`### Grade Scale`

			`\| Range \| Grade \| Meaning \|`
			`\|-------\|-------\|---------\|`
			`\| 90-100 \| A \| Production-grade, minimal tech debt \|`
			`\| 80-89 \| B+ \| Strong codebase, minor gaps \|`
			`\| 70-79 \| B \| Good foundation, actionable improvements exist \|`
			`\| 60-69 \| C+ \| Functional but accumulating debt \|`
			`\| 50-59 \| C \| Significant quality concerns \|`
			`\| <50 \| D \| Major remediation needed \|`

			`---`

			`## Step 4: Write Report`

			Copy `docs/QA-TEMPLATE.md` to `docs/QA-<YYYY-MM-DD_HHMM>.md` and fill in:

Added QA process - subagent, md template, scorecards. 2026-04-11 06:31:01 -04:00			1. Summary block at the top (immediately after the `#` heading):
			`- Blockquote with composite score, grade, and one-sentence summary`
			`- Quick-reference dimension score table`
			`- The summary sentence should note overall quality posture and the most significant change since the last assessment (or "Baseline assessment" if first run)`
			`2. All raw metric values`
			`3. All dimension scores with brief justification`
			`4. Composite score and grade`
			`5. Delta from previous assessment (if one exists)`
			`6. Top 3-5 actionable improvements`
			`7. Acknowledgment of any scoring judgment calls`
Added AQA Process and Scoring 2026-04-11 05:05:50 -04:00
			`---`

			`## Notes for LLM Execution`

			`- Run all Step 1 commands. Do not skip any -- partial data leads to unreliable scores.`
			`- When scoring Architecture, actually read the import graph. Do not guess from file names alone.`
			- For Security, read the context around any `eval`/`exec`/`subprocess` hits to assess sandboxing.
			`- Be consistent: same rubric, same thresholds, every time. If a metric falls between rubric rows, interpolate linearly.`
			- If the project adds coverage measurement (`pytest-cov`), factor it into Test Health.
			`- Record the exact command output, not just the derived score. Future assessments need the raw data to compute deltas.`
			- Check for previous `QA-*.md` files in `docs/` to compute improvement trends.