Codebase Quality Assessment -- LLM Instructions
Purpose: Reproducible, automated quality scoring for the Impakt codebase.
Run this assessment at any milestone (feature complete, pre-release, periodic review) to track quality over time. Results go into docs/QA-<DATETIME>.md using the template in docs/QA-TEMPLATE.md.
Prerequisites
All commands below assume uv run prefix and the project root as working directory.
Step 1: Collect Raw Metrics
Run these commands and record the output. They are independent and can run in parallel.
1a. Inventory
1b. Test Suite
1c. Type Safety
1d. Lint
1e. Complexity
1f. Documentation
1g. Security
1h. Maintainability
Step 2: Score Each Dimension
Use the raw metrics to assign a 0.0-10.0 score per dimension using these rubrics.
Test Health (weight: 20%)
| Score |
Criteria |
| 10 |
100% pass, test:source ratio >= 0.5, coverage >= 80%, integration tests present |
| 8 |
100% pass, ratio 0.2-0.5, integration tests present |
| 6 |
100% pass, ratio < 0.2 or missing coverage measurement |
| 4 |
>95% pass |
| 2 |
>80% pass |
| 0 |
<80% pass or no tests |
Type Safety (weight: 15%)
| Score |
Criteria |
| 10 |
mypy strict, 0 errors |
| 8 |
mypy strict, <10 errors |
| 6 |
mypy strict, <50 errors OR mypy non-strict with 0 errors |
| 4 |
mypy configured but >50 errors |
| 2 |
no type checker configured |
| 0 |
no type hints |
Lint Hygiene (weight: 10%)
| Score |
Criteria |
| 10 |
0 violations |
| 8 |
<10 violations, all auto-fixable |
| 6 |
<100 violations, mostly auto-fixable |
| 4 |
100-500 violations |
| 2 |
>500 violations |
| 0 |
no linter configured |
Architecture (weight: 15%)
| Score |
Criteria |
| 10 |
Clear layering, all modules have __all__, low coupling, plugin system |
| 9 |
Clear layering, most modules have __all__, plugin system |
| 7 |
Identifiable layers, some modules lack public API definition |
| 5 |
Flat structure, unclear module boundaries |
| 3 |
God objects, circular dependencies |
| 0 |
Single-file or completely unstructured |
Evaluate qualitatively: read __init__.py files, check import patterns, assess layer violations (does io import from web? does channel depend on plot?).
Documentation (weight: 10%)
| Score |
Criteria |
| 10 |
>90% docstring coverage, generated API docs, architectural docs with diagrams |
| 8 |
>80% docstring coverage, README with architecture, no generated API docs |
| 6 |
>60% docstring coverage, basic README |
| 4 |
<60% docstring coverage |
| 2 |
README only |
| 0 |
No documentation |
Complexity (weight: 10%)
| Score |
Criteria |
| 10 |
Median file <150 lines, 0 files >300, max complexity <15 |
| 8 |
Median <150, <=3 files >300, high-complexity files are justified (parsers, UI assembly) |
| 6 |
Median <200, <=10 files >300 |
| 4 |
Median >200 or many files >500 |
| 2 |
Multiple files >1000 lines |
| 0 |
Monolithic files, no decomposition |
Security (weight: 10%)
| Score |
Criteria |
| 10 |
No eval/exec, no subprocess, no secrets, input validation at boundaries |
| 9 |
eval/exec present but sandboxed (restricted builtins, blocklist), no secrets |
| 7 |
eval/exec partially sandboxed, or subprocess with shell=False |
| 5 |
Unsandboxed eval/exec or subprocess with shell=True |
| 3 |
Hardcoded secrets present |
| 0 |
Multiple critical security anti-patterns |
Maintainability (weight: 10%)
| Score |
Criteria |
| 10 |
0 debt markers, consistent error handling, logging, modern tooling |
| 8 |
<5 debt markers, no bare excepts, logging present |
| 6 |
<20 debt markers, some bare excepts |
| 4 |
>20 debt markers or many bare excepts |
| 2 |
Inconsistent patterns, no logging |
| 0 |
No discernible standards |
Step 3: Compute Composite Score
Grade Scale
| Range |
Grade |
Meaning |
| 90-100 |
A |
Production-grade, minimal tech debt |
| 80-89 |
B+ |
Strong codebase, minor gaps |
| 70-79 |
B |
Good foundation, actionable improvements exist |
| 60-69 |
C+ |
Functional but accumulating debt |
| 50-59 |
C |
Significant quality concerns |
| <50 |
D |
Major remediation needed |
Step 4: Write Report
Copy docs/QA-TEMPLATE.md to docs/QA-<YYYY-MM-DD_HHMM>.md and fill in:
- Summary block at the top (immediately after the
# heading):
- Blockquote with composite score, grade, and one-sentence summary
- Quick-reference dimension score table
- The summary sentence should note overall quality posture and the most significant change since the last assessment (or "Baseline assessment" if first run)
- All raw metric values
- All dimension scores with brief justification
- Composite score and grade
- Delta from previous assessment (if one exists)
- Top 3-5 actionable improvements
- Acknowledgment of any scoring judgment calls
Notes for LLM Execution
- Run all Step 1 commands. Do not skip any -- partial data leads to unreliable scores.
- When scoring Architecture, actually read the import graph. Do not guess from file names alone.
- For Security, read the context around any
eval/exec/subprocess hits to assess sandboxing.
- Be consistent: same rubric, same thresholds, every time. If a metric falls between rubric rows, interpolate linearly.
- If the project adds coverage measurement (
pytest-cov), factor it into Test Health.
- Record the exact command output, not just the derived score. Future assessments need the raw data to compute deltas.
- Check for previous
QA-*.md files in docs/ to compute improvement trends.