Added AQA Process and Scoring

2026-04-11 05:05:50 -04:00
parent 68422dd304
commit 40c8b0f132
3 changed files with 608 additions and 0 deletions
--- a/docs/QA-2026-04-11_0459.md
+++ b/docs/QA-2026-04-11_0459.md
@@ -0,0 +1,174 @@
+# Quality Assessment -- 2026-04-11
+
+**Version:** 0.1.0
+**Assessed by:** Claude Opus 4.6
+**Previous assessment:** None (baseline)
+
+---
+
+## Inventory
+
+| Metric | Value |
+|--------|-------|
+| Source files | 72 |
+| Source lines | 10,325 |
+| Test files | 30 |
+| Test lines | 2,736 |
+| Test:source ratio | 0.26 |
+| Direct dependencies | 10 core + 1 optional + 4 dev |
+
+---
+
+## Raw Metrics
+
+### Test Suite
+
+```
+240 passed, 7 warnings in 9.15s
+```
+
+- Tests collected: 240
+- Tests passed: 240
+- Tests failed: 0
+- Test duration: 9.15s
+
+### Type Safety (mypy --strict)
+
+```
+Found 49 errors in 20 files (checked 72 source files)
+```
+
+- Total errors: 49
+- Files with errors: 20 / 72 (72% clean)
+- Top error categories:
+  - `[type-arg]` 17 -- missing generic parameters on `dict`, `list`
+  - `[attr-defined]` 9 -- attribute access on loosely typed objects
+  - `[no-any-return]` 5 -- returning `Any` from typed functions
+  - `[var-annotated]` 3
+  - `[import-untyped]` 3
+  - `[assignment]` 3
+
+### Lint (ruff)
+
+```
+Found 89 errors.
+[*] 73 fixable with the `--fix` option (9 hidden fixes can be enabled with the `--unsafe-fixes` option).
+```
+
+- Total violations: 89
+- Auto-fixable: 73 (82%)
+- Top violation rules:
+  - `F401` 61 -- unused imports
+  - `I001` 9 -- unsorted imports
+  - `F601` 8 -- duplicate dictionary keys
+  - `E501` 5 -- line too long
+  - `F841` 3 -- unused variables
+  - `F541` 1 -- f-string without placeholder
+
+### Complexity
+
+- File size: min=1 / median=133 / mean=143 / max=693
+- Files >300 lines: 6 / 72
+- High-complexity files (branch density >15):
+
+```
+  80  src/impakt/io/mme.py (693 lines)           -- ISO 13499 parser, justified
+  44  src/impakt/web/components/criteria.py (343) -- UI assembly with protocol logic
+  30  src/impakt/channel/model.py (456)           -- core data model, multiple classes
+  27  src/impakt/web/state.py (274)               -- app state with multi-test support
+  27  src/impakt/protocol/euro_ncap.py (238)      -- sliding-scale scoring tables
+  25  src/impakt/web/callbacks/plot_callbacks.py (249) -- transform pipeline orchestration
+  21  src/impakt/protocol/iihs.py (180)           -- G/A/M/P rating logic
+  20  src/impakt/plot/engine.py (257)             -- Plotly rendering with corridors
+  19  src/impakt/script/cli.py (140)              -- CLI arg parsing
+  17  src/impakt/web/components/channel_grid.py (368) -- DataTable assembly
+  16  src/impakt/web/callbacks/channel_callbacks.py (195) -- selection/filter callbacks
+```
+
+### Documentation
+
+- Docstring coverage: 414 / 454 definitions (91%)
+- Modules with `__all__`: 6 / 11 public modules
+- README: 1,266 lines with 16+ Mermaid diagrams
+- Architectural diagrams: yes
+
+### Security
+
+- eval/exec (sandboxed): 1 -- `math_expr.py:151`, restricted builtins `{}` + token blocklist
+- eval/exec (unsandboxed): 0
+- subprocess: 0
+- Hardcoded secrets: 0
+- Bare except: 0
+
+### Maintainability
+
+- TODO: 0
+- FIXME: 0
+- HACK: 0
+- Logging calls: 48
+- try/except blocks: 52
+- Internal imports (coupling): 190
+
+---
+
+## Scorecard
+
+| # | Dimension | Weight | Score | Weighted | Justification |
+|---|-----------|--------|-------|----------|---------------|
+| 1 | Test Health | 20% | 8.0 | 16.0 | 240/240 pass. 0.26 ratio (below 0.5 ideal). Integration tests with 5 real datasets. No coverage % configured. |
+| 2 | Type Safety | 15% | 6.5 | 9.8 | mypy strict enabled (strong). 49 errors remain, concentrated in web layer. Mostly cosmetic (`type-arg`). |
+| 3 | Lint Hygiene | 10% | 6.0 | 6.0 | 89 violations but 82% auto-fixable. Dominated by unused imports (F401). 8 duplicate dict keys need manual fix. |
+| 4 | Architecture | 15% | 9.0 | 13.5 | Clean 4-layer design. Plugin system. Immutable data patterns. 6/11 modules export `__all__`. No layer violations detected. |
+| 5 | Documentation | 10% | 9.0 | 9.0 | 91% docstring coverage. Comprehensive README with Mermaid diagrams. STATUS.md. No generated API reference. |
+| 6 | Complexity | 10% | 7.0 | 7.0 | Median 133 (healthy). 6 files >300. `mme.py` at 693/80 complexity is the outlier -- justified as a format parser. |
+| 7 | Security | 10% | 8.5 | 8.5 | Single eval with `{"__builtins__": {}}` + blocklist. No subprocess, no secrets. AST-based eval would eliminate residual risk. |
+| 8 | Maintainability | 10% | 8.5 | 8.5 | Zero debt markers. Zero bare excepts. Consistent logging. Modern tooling (uv, hatchling, ruff, mypy). |
+
+### Composite Score: **78.3 / 100**
+### Grade: **B**
+
+---
+
+## Delta from Previous Assessment
+
+| Dimension | Previous | Current | Change |
+|-----------|----------|---------|--------|
+| Test Health | -- | 8.0 | -- |
+| Type Safety | -- | 6.5 | -- |
+| Lint Hygiene | -- | 6.0 | -- |
+| Architecture | -- | 9.0 | -- |
+| Documentation | -- | 9.0 | -- |
+| Complexity | -- | 7.0 | -- |
+| Security | -- | 8.5 | -- |
+| Maintainability | -- | 8.5 | -- |
+| **Composite** | **--** | **78.3** | **baseline** |
+
+---
+
+## Top Improvements Since Last Assessment
+
+Baseline assessment -- no prior data.
+
+---
+
+## Recommended Actions (Priority Order)
+
+| # | Action | Effort | Impact | Dimensions Affected |
+|---|--------|--------|--------|---------------------|
+| 1 | Run `uv run ruff check --fix src/` to clear 73 auto-fixable violations | 1 min | +2.0 lint | Lint |
+| 2 | Fix 8 duplicate dict keys in `channel/lookup.py` (F601) | 10 min | +0.5 lint | Lint |
+| 3 | Add `--cov --cov-report=term` to pytest, target 80%+ | 30 min | +1.0 test | Test Health |
+| 4 | Resolve 17 `[type-arg]` mypy errors (add `dict[str, X]` generics) | 1 hr | +1.0 type | Type Safety |
+| 5 | Add `__all__` to `io`, `plugin`, `report`, `script`, `template` | 30 min | +0.5 arch | Architecture |
+
+**Projected score after all 5 actions: ~83 (B+)**
+
+---
+
+## Notes
+
+- **Architecture score is qualitative.** Import graph was inspected: no layer violations found (data layer does not import from web/plot). The `web` module correctly sits at the top of the dependency tree.
+- **Complexity scoring for `mme.py`** was lenient because format parsers inherently have high branch density. If it grows beyond ~800 lines, consider extracting sub-parsers.
+- **The eval in `math_expr.py`** is sandboxed (empty `__builtins__`, token blocklist for `import`, `exec`, `eval`, `subprocess`, `os`, `sys`, `__`). An AST-based evaluator would be safer but is lower priority given the blocklist approach.
+- **Test:source ratio of 0.26** is common for alpha-stage projects with stable computation modules. The ratio should improve as edge-case tests are added. Protocol and criteria modules are the highest-value targets for additional test coverage.
+- **Unused imports (F401)** account for 69% of all lint violations. These are likely remnants from refactoring and are trivially fixable.
--- a/docs/QA-INSTRUCTIONS.md
+++ b/docs/QA-INSTRUCTIONS.md
@@ -0,0 +1,287 @@
+# Codebase Quality Assessment -- LLM Instructions
+
+**Purpose:** Reproducible, automated quality scoring for the Impakt codebase.
+Run this assessment at any milestone (feature complete, pre-release, periodic review) to track quality over time. Results go into `docs/QA-<DATETIME>.md` using the template in `docs/QA-TEMPLATE.md`.
+
+---
+
+## Prerequisites
+
+```bash
+cd /Users/noise/Code/impakt
+uv sync --dev
+```
+
+All commands below assume `uv run` prefix and the project root as working directory.
+
+---
+
+## Step 1: Collect Raw Metrics
+
+Run these commands and record the output. They are independent and can run in parallel.
+
+### 1a. Inventory
+
+```bash
+# Source file count and line count
+find src -name "*.py" | wc -l
+find src -name "*.py" -exec cat {} + | wc -l
+
+# Test file count and line count
+find tests -name "*.py" | wc -l
+find tests -name "*.py" -exec cat {} + | wc -l
+```
+
+### 1b. Test Suite
+
+```bash
+# Test count and pass/fail
+uv run python -m pytest --tb=no -q 2>&1 | tail -5
+
+# Test collection (verify count)
+uv run python -m pytest --co -q 2>&1 | tail -3
+```
+
+### 1c. Type Safety
+
+```bash
+# mypy strict error count and file breakdown
+uv run python -m mypy src/impakt --ignore-missing-imports 2>&1 | tail -3
+
+# Error category breakdown
+uv run python -m mypy src/impakt --ignore-missing-imports 2>&1 \
+  | grep "error:" | sed 's/.*\[/[/' | sort | uniq -c | sort -rn
+```
+
+### 1d. Lint
+
+```bash
+# Ruff violation count
+uv run python -m ruff check src/ 2>&1 | tail -5
+
+# Violation breakdown by rule
+uv run python -m ruff check src/ 2>&1 \
+  | grep -oE '[A-Z][0-9]+' | sort | uniq -c | sort -rn
+```
+
+### 1e. Complexity
+
+```bash
+# File size distribution
+find src/impakt -name "*.py" -exec wc -l {} + | grep -v total \
+  | awk '{print $1}' | sort -n | awk '
+BEGIN {n=0; sum=0}
+{a[n++]=$1; sum+=$1}
+END {
+  printf "Min: %d, Max: %d, Mean: %.0f, Median: %d\n", a[0], a[n-1], sum/n, a[int(n/2)]
+  over300=0; for(i=0;i<n;i++) if(a[i]>300) over300++
+  printf "Files >300 lines: %d / %d\n", over300, n
+}'
+
+# High-complexity files (branch density)
+find src/impakt -name "*.py" ! -path "*__pycache__*" | while read f; do
+  ifs=$(grep -cE '^\s*(if |elif )' "$f" || true)
+  fors=$(grep -cE '^\s*(for |while )' "$f" || true)
+  excepts=$(grep -cE '^\s*except' "$f" || true)
+  complexity=$((ifs + fors + excepts + 1))
+  if [ "$complexity" -gt 15 ]; then
+    lines=$(wc -l < "$f")
+    echo "  $complexity  $f ($lines lines)"
+  fi
+done | sort -rn
+```
+
+### 1f. Documentation
+
+```bash
+# Docstring coverage (approximate)
+echo "Docstrings: $(grep -rcE '^\s+"""' src/impakt/ --include='*.py' | awk -F: '{s+=$2}END{print s}')"
+echo "Definitions: $(grep -rcE '^\s*(def |class )' src/impakt/ --include='*.py' | awk -F: '{s+=$2}END{print s}')"
+
+# __all__ coverage
+for d in src/impakt/*/; do
+  [ -d "$d" ] || continue
+  mod=$(basename "$d")
+  init="${d}__init__.py"
+  [ -f "$init" ] && has_all=$(grep -c "__all__" "$init" || echo 0) || has_all="-"
+  echo "  $mod: __all__=$has_all"
+done
+```
+
+### 1g. Security
+
+```bash
+# Dangerous patterns
+echo "eval/exec: $(grep -rn 'eval(\|exec(' src/impakt/ --include='*.py' | grep -v '# noqa' | wc -l | tr -d ' ')"
+echo "subprocess/os.system: $(grep -rn 'subprocess\|os\.system\|os\.popen' src/impakt/ --include='*.py' | wc -l | tr -d ' ')"
+echo "bare except: $(grep -rn 'except:' src/impakt/ --include='*.py' | wc -l | tr -d ' ')"
+echo "hardcoded secrets: $(grep -rin 'password\s*=\|secret\s*=\|api_key\s*=\|token\s*=' src/impakt/ --include='*.py' | wc -l | tr -d ' ')"
+```
+
+### 1h. Maintainability
+
+```bash
+# Debt markers
+echo "TODO: $(grep -rc 'TODO' src/impakt/ --include='*.py' | awk -F: '{s+=$2}END{print s+0}')"
+echo "FIXME: $(grep -rc 'FIXME' src/impakt/ --include='*.py' | awk -F: '{s+=$2}END{print s+0}')"
+echo "HACK: $(grep -rc 'HACK' src/impakt/ --include='*.py' | awk -F: '{s+=$2}END{print s+0}')"
+
+# Logging usage
+echo "logging calls: $(grep -rc 'logger\.\|logging\.' src/impakt/ --include='*.py' | awk -F: '{s+=$2}END{print s+0}')"
+
+# Error handling
+echo "try/except blocks: $(grep -rc 'try:' src/impakt/ --include='*.py' | awk -F: '{s+=$2}END{print s+0}')"
+echo "bare excepts: $(grep -rc 'except:' src/impakt/ --include='*.py' | awk -F: '{s+=$2}END{print s+0}')"
+
+# Internal coupling
+echo "internal imports: $(grep -rc 'from impakt\.' src/impakt/ --include='*.py' | awk -F: '{s+=$2}END{print s+0}')"
+```
+
+---
+
+## Step 2: Score Each Dimension
+
+Use the raw metrics to assign a 0.0-10.0 score per dimension using these rubrics.
+
+### Test Health (weight: 20%)
+
+| Score | Criteria |
+|-------|----------|
+| 10 | 100% pass, test:source ratio >= 0.5, coverage >= 80%, integration tests present |
+| 8 | 100% pass, ratio 0.2-0.5, integration tests present |
+| 6 | 100% pass, ratio < 0.2 or missing coverage measurement |
+| 4 | >95% pass |
+| 2 | >80% pass |
+| 0 | <80% pass or no tests |
+
+### Type Safety (weight: 15%)
+
+| Score | Criteria |
+|-------|----------|
+| 10 | mypy strict, 0 errors |
+| 8 | mypy strict, <10 errors |
+| 6 | mypy strict, <50 errors OR mypy non-strict with 0 errors |
+| 4 | mypy configured but >50 errors |
+| 2 | no type checker configured |
+| 0 | no type hints |
+
+### Lint Hygiene (weight: 10%)
+
+| Score | Criteria |
+|-------|----------|
+| 10 | 0 violations |
+| 8 | <10 violations, all auto-fixable |
+| 6 | <100 violations, mostly auto-fixable |
+| 4 | 100-500 violations |
+| 2 | >500 violations |
+| 0 | no linter configured |
+
+### Architecture (weight: 15%)
+
+| Score | Criteria |
+|-------|----------|
+| 10 | Clear layering, all modules have `__all__`, low coupling, plugin system |
+| 9 | Clear layering, most modules have `__all__`, plugin system |
+| 7 | Identifiable layers, some modules lack public API definition |
+| 5 | Flat structure, unclear module boundaries |
+| 3 | God objects, circular dependencies |
+| 0 | Single-file or completely unstructured |
+
+Evaluate qualitatively: read `__init__.py` files, check import patterns, assess layer violations (does `io` import from `web`? does `channel` depend on `plot`?).
+
+### Documentation (weight: 10%)
+
+| Score | Criteria |
+|-------|----------|
+| 10 | >90% docstring coverage, generated API docs, architectural docs with diagrams |
+| 8 | >80% docstring coverage, README with architecture, no generated API docs |
+| 6 | >60% docstring coverage, basic README |
+| 4 | <60% docstring coverage |
+| 2 | README only |
+| 0 | No documentation |
+
+### Complexity (weight: 10%)
+
+| Score | Criteria |
+|-------|----------|
+| 10 | Median file <150 lines, 0 files >300, max complexity <15 |
+| 8 | Median <150, <=3 files >300, high-complexity files are justified (parsers, UI assembly) |
+| 6 | Median <200, <=10 files >300 |
+| 4 | Median >200 or many files >500 |
+| 2 | Multiple files >1000 lines |
+| 0 | Monolithic files, no decomposition |
+
+### Security (weight: 10%)
+
+| Score | Criteria |
+|-------|----------|
+| 10 | No eval/exec, no subprocess, no secrets, input validation at boundaries |
+| 9 | eval/exec present but sandboxed (restricted builtins, blocklist), no secrets |
+| 7 | eval/exec partially sandboxed, or subprocess with shell=False |
+| 5 | Unsandboxed eval/exec or subprocess with shell=True |
+| 3 | Hardcoded secrets present |
+| 0 | Multiple critical security anti-patterns |
+
+### Maintainability (weight: 10%)
+
+| Score | Criteria |
+|-------|----------|
+| 10 | 0 debt markers, consistent error handling, logging, modern tooling |
+| 8 | <5 debt markers, no bare excepts, logging present |
+| 6 | <20 debt markers, some bare excepts |
+| 4 | >20 debt markers or many bare excepts |
+| 2 | Inconsistent patterns, no logging |
+| 0 | No discernible standards |
+
+---
+
+## Step 3: Compute Composite Score
+
+```
+composite = (
+    test_health     * 0.20 +
+    type_safety     * 0.15 +
+    lint_hygiene    * 0.10 +
+    architecture    * 0.15 +
+    documentation   * 0.10 +
+    complexity      * 0.10 +
+    security        * 0.10 +
+    maintainability * 0.10
+) * 10
+```
+
+### Grade Scale
+
+| Range | Grade | Meaning |
+|-------|-------|---------|
+| 90-100 | A | Production-grade, minimal tech debt |
+| 80-89 | B+ | Strong codebase, minor gaps |
+| 70-79 | B | Good foundation, actionable improvements exist |
+| 60-69 | C+ | Functional but accumulating debt |
+| 50-59 | C | Significant quality concerns |
+| <50 | D | Major remediation needed |
+
+---
+
+## Step 4: Write Report
+
+Copy `docs/QA-TEMPLATE.md` to `docs/QA-<YYYY-MM-DD_HHMM>.md` and fill in:
+
+1. All raw metric values
+2. All dimension scores with brief justification
+3. Composite score and grade
+4. Delta from previous assessment (if one exists)
+5. Top 3-5 actionable improvements
+6. Acknowledgment of any scoring judgment calls
+
+---
+
+## Notes for LLM Execution
+
+- Run all Step 1 commands. Do not skip any -- partial data leads to unreliable scores.
+- When scoring Architecture, actually read the import graph. Do not guess from file names alone.
+- For Security, read the context around any `eval`/`exec`/`subprocess` hits to assess sandboxing.
+- Be consistent: same rubric, same thresholds, every time. If a metric falls between rubric rows, interpolate linearly.
+- If the project adds coverage measurement (`pytest-cov`), factor it into Test Health.
+- Record the exact command output, not just the derived score. Future assessments need the raw data to compute deltas.
+- Check for previous `QA-*.md` files in `docs/` to compute improvement trends.
--- a/docs/QA-TEMPLATE.md
+++ b/docs/QA-TEMPLATE.md
@@ -0,0 +1,147 @@
+# Quality Assessment -- [DATE]
+
+**Version:** [VERSION]
+**Assessed by:** [human / LLM model]
+**Previous assessment:** [filename or "None (baseline)"]
+
+---
+
+## Inventory
+
+| Metric | Value |
+|--------|-------|
+| Source files | |
+| Source lines | |
+| Test files | |
+| Test lines | |
+| Test:source ratio | |
+| Direct dependencies | |
+
+---
+
+## Raw Metrics
+
+### Test Suite
+
+```
+[paste pytest output]
+```
+
+- Tests collected:
+- Tests passed:
+- Tests failed:
+- Test duration:
+
+### Type Safety (mypy --strict)
+
+```
+[paste mypy summary]
+```
+
+- Total errors:
+- Files with errors: / total
+- Top error categories:
+
+### Lint (ruff)
+
+```
+[paste ruff summary]
+```
+
+- Total violations:
+- Auto-fixable:
+- Top violation rules:
+
+### Complexity
+
+- File size: min= / median= / mean= / max=
+- Files >300 lines: / total
+- High-complexity files (branch density >15):
+
+```
+[paste high-complexity file list]
+```
+
+### Documentation
+
+- Docstring coverage: / definitions ( %)
+- Modules with `__all__`:  / total
+- README: lines
+- Architectural diagrams: yes/no
+
+### Security
+
+- eval/exec (sandboxed):
+- eval/exec (unsandboxed):
+- subprocess:
+- Hardcoded secrets:
+- Bare except:
+
+### Maintainability
+
+- TODO:
+- FIXME:
+- HACK:
+- Logging calls:
+- try/except blocks:
+- Internal imports (coupling):
+
+---
+
+## Scorecard
+
+| # | Dimension | Weight | Score | Weighted | Justification |
+|---|-----------|--------|-------|----------|---------------|
+| 1 | Test Health | 20% | /10 | | |
+| 2 | Type Safety | 15% | /10 | | |
+| 3 | Lint Hygiene | 10% | /10 | | |
+| 4 | Architecture | 15% | /10 | | |
+| 5 | Documentation | 10% | /10 | | |
+| 6 | Complexity | 10% | /10 | | |
+| 7 | Security | 10% | /10 | | |
+| 8 | Maintainability | 10% | /10 | | |
+
+### Composite Score: **[ ] / 100**
+### Grade: **[ ]**
+
+---
+
+## Delta from Previous Assessment
+
+| Dimension | Previous | Current | Change |
+|-----------|----------|---------|--------|
+| Test Health | | | |
+| Type Safety | | | |
+| Lint Hygiene | | | |
+| Architecture | | | |
+| Documentation | | | |
+| Complexity | | | |
+| Security | | | |
+| Maintainability | | | |
+| **Composite** | | | |
+
+---
+
+## Top Improvements Since Last Assessment
+
+1.
+2.
+3.
+
+---
+
+## Recommended Actions (Priority Order)
+
+| # | Action | Effort | Impact | Dimensions Affected |
+|---|--------|--------|--------|---------------------|
+| 1 | | | | |
+| 2 | | | | |
+| 3 | | | | |
+| 4 | | | | |
+| 5 | | | | |
+
+---
+
+## Notes
+
+[Any judgment calls, caveats, or context for this assessment.]