diff --git a/docs/QA-2026-04-11_0459.md b/docs/QA-2026-04-11_0459.md new file mode 100644 index 0000000..38227be --- /dev/null +++ b/docs/QA-2026-04-11_0459.md @@ -0,0 +1,174 @@ +# Quality Assessment -- 2026-04-11 + +**Version:** 0.1.0 +**Assessed by:** Claude Opus 4.6 +**Previous assessment:** None (baseline) + +--- + +## Inventory + +| Metric | Value | +|--------|-------| +| Source files | 72 | +| Source lines | 10,325 | +| Test files | 30 | +| Test lines | 2,736 | +| Test:source ratio | 0.26 | +| Direct dependencies | 10 core + 1 optional + 4 dev | + +--- + +## Raw Metrics + +### Test Suite + +``` +240 passed, 7 warnings in 9.15s +``` + +- Tests collected: 240 +- Tests passed: 240 +- Tests failed: 0 +- Test duration: 9.15s + +### Type Safety (mypy --strict) + +``` +Found 49 errors in 20 files (checked 72 source files) +``` + +- Total errors: 49 +- Files with errors: 20 / 72 (72% clean) +- Top error categories: + - `[type-arg]` 17 -- missing generic parameters on `dict`, `list` + - `[attr-defined]` 9 -- attribute access on loosely typed objects + - `[no-any-return]` 5 -- returning `Any` from typed functions + - `[var-annotated]` 3 + - `[import-untyped]` 3 + - `[assignment]` 3 + +### Lint (ruff) + +``` +Found 89 errors. +[*] 73 fixable with the `--fix` option (9 hidden fixes can be enabled with the `--unsafe-fixes` option). +``` + +- Total violations: 89 +- Auto-fixable: 73 (82%) +- Top violation rules: + - `F401` 61 -- unused imports + - `I001` 9 -- unsorted imports + - `F601` 8 -- duplicate dictionary keys + - `E501` 5 -- line too long + - `F841` 3 -- unused variables + - `F541` 1 -- f-string without placeholder + +### Complexity + +- File size: min=1 / median=133 / mean=143 / max=693 +- Files >300 lines: 6 / 72 +- High-complexity files (branch density >15): + +``` + 80 src/impakt/io/mme.py (693 lines) -- ISO 13499 parser, justified + 44 src/impakt/web/components/criteria.py (343) -- UI assembly with protocol logic + 30 src/impakt/channel/model.py (456) -- core data model, multiple classes + 27 src/impakt/web/state.py (274) -- app state with multi-test support + 27 src/impakt/protocol/euro_ncap.py (238) -- sliding-scale scoring tables + 25 src/impakt/web/callbacks/plot_callbacks.py (249) -- transform pipeline orchestration + 21 src/impakt/protocol/iihs.py (180) -- G/A/M/P rating logic + 20 src/impakt/plot/engine.py (257) -- Plotly rendering with corridors + 19 src/impakt/script/cli.py (140) -- CLI arg parsing + 17 src/impakt/web/components/channel_grid.py (368) -- DataTable assembly + 16 src/impakt/web/callbacks/channel_callbacks.py (195) -- selection/filter callbacks +``` + +### Documentation + +- Docstring coverage: 414 / 454 definitions (91%) +- Modules with `__all__`: 6 / 11 public modules +- README: 1,266 lines with 16+ Mermaid diagrams +- Architectural diagrams: yes + +### Security + +- eval/exec (sandboxed): 1 -- `math_expr.py:151`, restricted builtins `{}` + token blocklist +- eval/exec (unsandboxed): 0 +- subprocess: 0 +- Hardcoded secrets: 0 +- Bare except: 0 + +### Maintainability + +- TODO: 0 +- FIXME: 0 +- HACK: 0 +- Logging calls: 48 +- try/except blocks: 52 +- Internal imports (coupling): 190 + +--- + +## Scorecard + +| # | Dimension | Weight | Score | Weighted | Justification | +|---|-----------|--------|-------|----------|---------------| +| 1 | Test Health | 20% | 8.0 | 16.0 | 240/240 pass. 0.26 ratio (below 0.5 ideal). Integration tests with 5 real datasets. No coverage % configured. | +| 2 | Type Safety | 15% | 6.5 | 9.8 | mypy strict enabled (strong). 49 errors remain, concentrated in web layer. Mostly cosmetic (`type-arg`). | +| 3 | Lint Hygiene | 10% | 6.0 | 6.0 | 89 violations but 82% auto-fixable. Dominated by unused imports (F401). 8 duplicate dict keys need manual fix. | +| 4 | Architecture | 15% | 9.0 | 13.5 | Clean 4-layer design. Plugin system. Immutable data patterns. 6/11 modules export `__all__`. No layer violations detected. | +| 5 | Documentation | 10% | 9.0 | 9.0 | 91% docstring coverage. Comprehensive README with Mermaid diagrams. STATUS.md. No generated API reference. | +| 6 | Complexity | 10% | 7.0 | 7.0 | Median 133 (healthy). 6 files >300. `mme.py` at 693/80 complexity is the outlier -- justified as a format parser. | +| 7 | Security | 10% | 8.5 | 8.5 | Single eval with `{"__builtins__": {}}` + blocklist. No subprocess, no secrets. AST-based eval would eliminate residual risk. | +| 8 | Maintainability | 10% | 8.5 | 8.5 | Zero debt markers. Zero bare excepts. Consistent logging. Modern tooling (uv, hatchling, ruff, mypy). | + +### Composite Score: **78.3 / 100** +### Grade: **B** + +--- + +## Delta from Previous Assessment + +| Dimension | Previous | Current | Change | +|-----------|----------|---------|--------| +| Test Health | -- | 8.0 | -- | +| Type Safety | -- | 6.5 | -- | +| Lint Hygiene | -- | 6.0 | -- | +| Architecture | -- | 9.0 | -- | +| Documentation | -- | 9.0 | -- | +| Complexity | -- | 7.0 | -- | +| Security | -- | 8.5 | -- | +| Maintainability | -- | 8.5 | -- | +| **Composite** | **--** | **78.3** | **baseline** | + +--- + +## Top Improvements Since Last Assessment + +Baseline assessment -- no prior data. + +--- + +## Recommended Actions (Priority Order) + +| # | Action | Effort | Impact | Dimensions Affected | +|---|--------|--------|--------|---------------------| +| 1 | Run `uv run ruff check --fix src/` to clear 73 auto-fixable violations | 1 min | +2.0 lint | Lint | +| 2 | Fix 8 duplicate dict keys in `channel/lookup.py` (F601) | 10 min | +0.5 lint | Lint | +| 3 | Add `--cov --cov-report=term` to pytest, target 80%+ | 30 min | +1.0 test | Test Health | +| 4 | Resolve 17 `[type-arg]` mypy errors (add `dict[str, X]` generics) | 1 hr | +1.0 type | Type Safety | +| 5 | Add `__all__` to `io`, `plugin`, `report`, `script`, `template` | 30 min | +0.5 arch | Architecture | + +**Projected score after all 5 actions: ~83 (B+)** + +--- + +## Notes + +- **Architecture score is qualitative.** Import graph was inspected: no layer violations found (data layer does not import from web/plot). The `web` module correctly sits at the top of the dependency tree. +- **Complexity scoring for `mme.py`** was lenient because format parsers inherently have high branch density. If it grows beyond ~800 lines, consider extracting sub-parsers. +- **The eval in `math_expr.py`** is sandboxed (empty `__builtins__`, token blocklist for `import`, `exec`, `eval`, `subprocess`, `os`, `sys`, `__`). An AST-based evaluator would be safer but is lower priority given the blocklist approach. +- **Test:source ratio of 0.26** is common for alpha-stage projects with stable computation modules. The ratio should improve as edge-case tests are added. Protocol and criteria modules are the highest-value targets for additional test coverage. +- **Unused imports (F401)** account for 69% of all lint violations. These are likely remnants from refactoring and are trivially fixable. diff --git a/docs/QA-INSTRUCTIONS.md b/docs/QA-INSTRUCTIONS.md new file mode 100644 index 0000000..626065e --- /dev/null +++ b/docs/QA-INSTRUCTIONS.md @@ -0,0 +1,287 @@ +# Codebase Quality Assessment -- LLM Instructions + +**Purpose:** Reproducible, automated quality scoring for the Impakt codebase. +Run this assessment at any milestone (feature complete, pre-release, periodic review) to track quality over time. Results go into `docs/QA-.md` using the template in `docs/QA-TEMPLATE.md`. + +--- + +## Prerequisites + +```bash +cd /Users/noise/Code/impakt +uv sync --dev +``` + +All commands below assume `uv run` prefix and the project root as working directory. + +--- + +## Step 1: Collect Raw Metrics + +Run these commands and record the output. They are independent and can run in parallel. + +### 1a. Inventory + +```bash +# Source file count and line count +find src -name "*.py" | wc -l +find src -name "*.py" -exec cat {} + | wc -l + +# Test file count and line count +find tests -name "*.py" | wc -l +find tests -name "*.py" -exec cat {} + | wc -l +``` + +### 1b. Test Suite + +```bash +# Test count and pass/fail +uv run python -m pytest --tb=no -q 2>&1 | tail -5 + +# Test collection (verify count) +uv run python -m pytest --co -q 2>&1 | tail -3 +``` + +### 1c. Type Safety + +```bash +# mypy strict error count and file breakdown +uv run python -m mypy src/impakt --ignore-missing-imports 2>&1 | tail -3 + +# Error category breakdown +uv run python -m mypy src/impakt --ignore-missing-imports 2>&1 \ + | grep "error:" | sed 's/.*\[/[/' | sort | uniq -c | sort -rn +``` + +### 1d. Lint + +```bash +# Ruff violation count +uv run python -m ruff check src/ 2>&1 | tail -5 + +# Violation breakdown by rule +uv run python -m ruff check src/ 2>&1 \ + | grep -oE '[A-Z][0-9]+' | sort | uniq -c | sort -rn +``` + +### 1e. Complexity + +```bash +# File size distribution +find src/impakt -name "*.py" -exec wc -l {} + | grep -v total \ + | awk '{print $1}' | sort -n | awk ' +BEGIN {n=0; sum=0} +{a[n++]=$1; sum+=$1} +END { + printf "Min: %d, Max: %d, Mean: %.0f, Median: %d\n", a[0], a[n-1], sum/n, a[int(n/2)] + over300=0; for(i=0;i300) over300++ + printf "Files >300 lines: %d / %d\n", over300, n +}' + +# High-complexity files (branch density) +find src/impakt -name "*.py" ! -path "*__pycache__*" | while read f; do + ifs=$(grep -cE '^\s*(if |elif )' "$f" || true) + fors=$(grep -cE '^\s*(for |while )' "$f" || true) + excepts=$(grep -cE '^\s*except' "$f" || true) + complexity=$((ifs + fors + excepts + 1)) + if [ "$complexity" -gt 15 ]; then + lines=$(wc -l < "$f") + echo " $complexity $f ($lines lines)" + fi +done | sort -rn +``` + +### 1f. Documentation + +```bash +# Docstring coverage (approximate) +echo "Docstrings: $(grep -rcE '^\s+"""' src/impakt/ --include='*.py' | awk -F: '{s+=$2}END{print s}')" +echo "Definitions: $(grep -rcE '^\s*(def |class )' src/impakt/ --include='*.py' | awk -F: '{s+=$2}END{print s}')" + +# __all__ coverage +for d in src/impakt/*/; do + [ -d "$d" ] || continue + mod=$(basename "$d") + init="${d}__init__.py" + [ -f "$init" ] && has_all=$(grep -c "__all__" "$init" || echo 0) || has_all="-" + echo " $mod: __all__=$has_all" +done +``` + +### 1g. Security + +```bash +# Dangerous patterns +echo "eval/exec: $(grep -rn 'eval(\|exec(' src/impakt/ --include='*.py' | grep -v '# noqa' | wc -l | tr -d ' ')" +echo "subprocess/os.system: $(grep -rn 'subprocess\|os\.system\|os\.popen' src/impakt/ --include='*.py' | wc -l | tr -d ' ')" +echo "bare except: $(grep -rn 'except:' src/impakt/ --include='*.py' | wc -l | tr -d ' ')" +echo "hardcoded secrets: $(grep -rin 'password\s*=\|secret\s*=\|api_key\s*=\|token\s*=' src/impakt/ --include='*.py' | wc -l | tr -d ' ')" +``` + +### 1h. Maintainability + +```bash +# Debt markers +echo "TODO: $(grep -rc 'TODO' src/impakt/ --include='*.py' | awk -F: '{s+=$2}END{print s+0}')" +echo "FIXME: $(grep -rc 'FIXME' src/impakt/ --include='*.py' | awk -F: '{s+=$2}END{print s+0}')" +echo "HACK: $(grep -rc 'HACK' src/impakt/ --include='*.py' | awk -F: '{s+=$2}END{print s+0}')" + +# Logging usage +echo "logging calls: $(grep -rc 'logger\.\|logging\.' src/impakt/ --include='*.py' | awk -F: '{s+=$2}END{print s+0}')" + +# Error handling +echo "try/except blocks: $(grep -rc 'try:' src/impakt/ --include='*.py' | awk -F: '{s+=$2}END{print s+0}')" +echo "bare excepts: $(grep -rc 'except:' src/impakt/ --include='*.py' | awk -F: '{s+=$2}END{print s+0}')" + +# Internal coupling +echo "internal imports: $(grep -rc 'from impakt\.' src/impakt/ --include='*.py' | awk -F: '{s+=$2}END{print s+0}')" +``` + +--- + +## Step 2: Score Each Dimension + +Use the raw metrics to assign a 0.0-10.0 score per dimension using these rubrics. + +### Test Health (weight: 20%) + +| Score | Criteria | +|-------|----------| +| 10 | 100% pass, test:source ratio >= 0.5, coverage >= 80%, integration tests present | +| 8 | 100% pass, ratio 0.2-0.5, integration tests present | +| 6 | 100% pass, ratio < 0.2 or missing coverage measurement | +| 4 | >95% pass | +| 2 | >80% pass | +| 0 | <80% pass or no tests | + +### Type Safety (weight: 15%) + +| Score | Criteria | +|-------|----------| +| 10 | mypy strict, 0 errors | +| 8 | mypy strict, <10 errors | +| 6 | mypy strict, <50 errors OR mypy non-strict with 0 errors | +| 4 | mypy configured but >50 errors | +| 2 | no type checker configured | +| 0 | no type hints | + +### Lint Hygiene (weight: 10%) + +| Score | Criteria | +|-------|----------| +| 10 | 0 violations | +| 8 | <10 violations, all auto-fixable | +| 6 | <100 violations, mostly auto-fixable | +| 4 | 100-500 violations | +| 2 | >500 violations | +| 0 | no linter configured | + +### Architecture (weight: 15%) + +| Score | Criteria | +|-------|----------| +| 10 | Clear layering, all modules have `__all__`, low coupling, plugin system | +| 9 | Clear layering, most modules have `__all__`, plugin system | +| 7 | Identifiable layers, some modules lack public API definition | +| 5 | Flat structure, unclear module boundaries | +| 3 | God objects, circular dependencies | +| 0 | Single-file or completely unstructured | + +Evaluate qualitatively: read `__init__.py` files, check import patterns, assess layer violations (does `io` import from `web`? does `channel` depend on `plot`?). + +### Documentation (weight: 10%) + +| Score | Criteria | +|-------|----------| +| 10 | >90% docstring coverage, generated API docs, architectural docs with diagrams | +| 8 | >80% docstring coverage, README with architecture, no generated API docs | +| 6 | >60% docstring coverage, basic README | +| 4 | <60% docstring coverage | +| 2 | README only | +| 0 | No documentation | + +### Complexity (weight: 10%) + +| Score | Criteria | +|-------|----------| +| 10 | Median file <150 lines, 0 files >300, max complexity <15 | +| 8 | Median <150, <=3 files >300, high-complexity files are justified (parsers, UI assembly) | +| 6 | Median <200, <=10 files >300 | +| 4 | Median >200 or many files >500 | +| 2 | Multiple files >1000 lines | +| 0 | Monolithic files, no decomposition | + +### Security (weight: 10%) + +| Score | Criteria | +|-------|----------| +| 10 | No eval/exec, no subprocess, no secrets, input validation at boundaries | +| 9 | eval/exec present but sandboxed (restricted builtins, blocklist), no secrets | +| 7 | eval/exec partially sandboxed, or subprocess with shell=False | +| 5 | Unsandboxed eval/exec or subprocess with shell=True | +| 3 | Hardcoded secrets present | +| 0 | Multiple critical security anti-patterns | + +### Maintainability (weight: 10%) + +| Score | Criteria | +|-------|----------| +| 10 | 0 debt markers, consistent error handling, logging, modern tooling | +| 8 | <5 debt markers, no bare excepts, logging present | +| 6 | <20 debt markers, some bare excepts | +| 4 | >20 debt markers or many bare excepts | +| 2 | Inconsistent patterns, no logging | +| 0 | No discernible standards | + +--- + +## Step 3: Compute Composite Score + +``` +composite = ( + test_health * 0.20 + + type_safety * 0.15 + + lint_hygiene * 0.10 + + architecture * 0.15 + + documentation * 0.10 + + complexity * 0.10 + + security * 0.10 + + maintainability * 0.10 +) * 10 +``` + +### Grade Scale + +| Range | Grade | Meaning | +|-------|-------|---------| +| 90-100 | A | Production-grade, minimal tech debt | +| 80-89 | B+ | Strong codebase, minor gaps | +| 70-79 | B | Good foundation, actionable improvements exist | +| 60-69 | C+ | Functional but accumulating debt | +| 50-59 | C | Significant quality concerns | +| <50 | D | Major remediation needed | + +--- + +## Step 4: Write Report + +Copy `docs/QA-TEMPLATE.md` to `docs/QA-.md` and fill in: + +1. All raw metric values +2. All dimension scores with brief justification +3. Composite score and grade +4. Delta from previous assessment (if one exists) +5. Top 3-5 actionable improvements +6. Acknowledgment of any scoring judgment calls + +--- + +## Notes for LLM Execution + +- Run all Step 1 commands. Do not skip any -- partial data leads to unreliable scores. +- When scoring Architecture, actually read the import graph. Do not guess from file names alone. +- For Security, read the context around any `eval`/`exec`/`subprocess` hits to assess sandboxing. +- Be consistent: same rubric, same thresholds, every time. If a metric falls between rubric rows, interpolate linearly. +- If the project adds coverage measurement (`pytest-cov`), factor it into Test Health. +- Record the exact command output, not just the derived score. Future assessments need the raw data to compute deltas. +- Check for previous `QA-*.md` files in `docs/` to compute improvement trends. diff --git a/docs/QA-TEMPLATE.md b/docs/QA-TEMPLATE.md new file mode 100644 index 0000000..b34609f --- /dev/null +++ b/docs/QA-TEMPLATE.md @@ -0,0 +1,147 @@ +# Quality Assessment -- [DATE] + +**Version:** [VERSION] +**Assessed by:** [human / LLM model] +**Previous assessment:** [filename or "None (baseline)"] + +--- + +## Inventory + +| Metric | Value | +|--------|-------| +| Source files | | +| Source lines | | +| Test files | | +| Test lines | | +| Test:source ratio | | +| Direct dependencies | | + +--- + +## Raw Metrics + +### Test Suite + +``` +[paste pytest output] +``` + +- Tests collected: +- Tests passed: +- Tests failed: +- Test duration: + +### Type Safety (mypy --strict) + +``` +[paste mypy summary] +``` + +- Total errors: +- Files with errors: / total +- Top error categories: + +### Lint (ruff) + +``` +[paste ruff summary] +``` + +- Total violations: +- Auto-fixable: +- Top violation rules: + +### Complexity + +- File size: min= / median= / mean= / max= +- Files >300 lines: / total +- High-complexity files (branch density >15): + +``` +[paste high-complexity file list] +``` + +### Documentation + +- Docstring coverage: / definitions ( %) +- Modules with `__all__`: / total +- README: lines +- Architectural diagrams: yes/no + +### Security + +- eval/exec (sandboxed): +- eval/exec (unsandboxed): +- subprocess: +- Hardcoded secrets: +- Bare except: + +### Maintainability + +- TODO: +- FIXME: +- HACK: +- Logging calls: +- try/except blocks: +- Internal imports (coupling): + +--- + +## Scorecard + +| # | Dimension | Weight | Score | Weighted | Justification | +|---|-----------|--------|-------|----------|---------------| +| 1 | Test Health | 20% | /10 | | | +| 2 | Type Safety | 15% | /10 | | | +| 3 | Lint Hygiene | 10% | /10 | | | +| 4 | Architecture | 15% | /10 | | | +| 5 | Documentation | 10% | /10 | | | +| 6 | Complexity | 10% | /10 | | | +| 7 | Security | 10% | /10 | | | +| 8 | Maintainability | 10% | /10 | | | + +### Composite Score: **[ ] / 100** +### Grade: **[ ]** + +--- + +## Delta from Previous Assessment + +| Dimension | Previous | Current | Change | +|-----------|----------|---------|--------| +| Test Health | | | | +| Type Safety | | | | +| Lint Hygiene | | | | +| Architecture | | | | +| Documentation | | | | +| Complexity | | | | +| Security | | | | +| Maintainability | | | | +| **Composite** | | | | + +--- + +## Top Improvements Since Last Assessment + +1. +2. +3. + +--- + +## Recommended Actions (Priority Order) + +| # | Action | Effort | Impact | Dimensions Affected | +|---|--------|--------|--------|---------------------| +| 1 | | | | | +| 2 | | | | | +| 3 | | | | | +| 4 | | | | | +| 5 | | | | | + +--- + +## Notes + +[Any judgment calls, caveats, or context for this assessment.]