Files

noisedestroyers 794de9c721 Added QA process - subagent, md template, scorecards.

2026-04-11 06:31:01 -04:00

9.4 KiB

Raw Blame History

Codebase Quality Assessment -- LLM Instructions

Purpose: Reproducible, automated quality scoring for the Impakt codebase. Run this assessment at any milestone (feature complete, pre-release, periodic review) to track quality over time. Results go into docs/QA-<DATETIME>.md using the template in docs/QA-TEMPLATE.md.

Prerequisites

cd /Users/noise/Code/impakt
uv sync --dev

All commands below assume uv run prefix and the project root as working directory.

Step 1: Collect Raw Metrics

Run these commands and record the output. They are independent and can run in parallel.

1a. Inventory

# Source file count and line count
find src -name "*.py" | wc -l
find src -name "*.py" -exec cat {} + | wc -l

# Test file count and line count
find tests -name "*.py" | wc -l
find tests -name "*.py" -exec cat {} + | wc -l

1b. Test Suite

# Test count and pass/fail
uv run python -m pytest --tb=no -q 2>&1 | tail -5

# Test collection (verify count)
uv run python -m pytest --co -q 2>&1 | tail -3

1c. Type Safety

# mypy strict error count and file breakdown
uv run python -m mypy src/impakt --ignore-missing-imports 2>&1 | tail -3

# Error category breakdown
uv run python -m mypy src/impakt --ignore-missing-imports 2>&1 \
  | grep "error:" | sed 's/.*\[/[/' | sort | uniq -c | sort -rn

1d. Lint

# Ruff violation count
uv run python -m ruff check src/ 2>&1 | tail -5

# Violation breakdown by rule
uv run python -m ruff check src/ 2>&1 \
  | grep -oE '[A-Z][0-9]+' | sort | uniq -c | sort -rn

1e. Complexity

# File size distribution
find src/impakt -name "*.py" -exec wc -l {} + | grep -v total \
  | awk '{print $1}' | sort -n | awk '
BEGIN {n=0; sum=0}
{a[n++]=$1; sum+=$1}
END {
  printf "Min: %d, Max: %d, Mean: %.0f, Median: %d\n", a[0], a[n-1], sum/n, a[int(n/2)]
  over300=0; for(i=0;i<n;i++) if(a[i]>300) over300++
  printf "Files >300 lines: %d / %d\n", over300, n
}'

# High-complexity files (branch density)
find src/impakt -name "*.py" ! -path "*__pycache__*" | while read f; do
  ifs=$(grep -cE '^\s*(if |elif )' "$f" || true)
  fors=$(grep -cE '^\s*(for |while )' "$f" || true)
  excepts=$(grep -cE '^\s*except' "$f" || true)
  complexity=$((ifs + fors + excepts + 1))
  if [ "$complexity" -gt 15 ]; then
    lines=$(wc -l < "$f")
    echo "  $complexity  $f ($lines lines)"
  fi
done | sort -rn

1f. Documentation

# Docstring coverage (approximate)
echo "Docstrings: $(grep -rcE '^\s+"""' src/impakt/ --include='*.py' | awk -F: '{s+=$2}END{print s}')"
echo "Definitions: $(grep -rcE '^\s*(def |class )' src/impakt/ --include='*.py' | awk -F: '{s+=$2}END{print s}')"

# __all__ coverage
for d in src/impakt/*/; do
  [ -d "$d" ] || continue
  mod=$(basename "$d")
  init="${d}__init__.py"
  [ -f "$init" ] && has_all=$(grep -c "__all__" "$init" || echo 0) || has_all="-"
  echo "  $mod: __all__=$has_all"
done

1g. Security

# Dangerous patterns
echo "eval/exec: $(grep -rn 'eval(\|exec(' src/impakt/ --include='*.py' | grep -v '# noqa' | wc -l | tr -d ' ')"
echo "subprocess/os.system: $(grep -rn 'subprocess\|os\.system\|os\.popen' src/impakt/ --include='*.py' | wc -l | tr -d ' ')"
echo "bare except: $(grep -rn 'except:' src/impakt/ --include='*.py' | wc -l | tr -d ' ')"
echo "hardcoded secrets: $(grep -rin 'password\s*=\|secret\s*=\|api_key\s*=\|token\s*=' src/impakt/ --include='*.py' | wc -l | tr -d ' ')"

1h. Maintainability

# Debt markers
echo "TODO: $(grep -rc 'TODO' src/impakt/ --include='*.py' | awk -F: '{s+=$2}END{print s+0}')"
echo "FIXME: $(grep -rc 'FIXME' src/impakt/ --include='*.py' | awk -F: '{s+=$2}END{print s+0}')"
echo "HACK: $(grep -rc 'HACK' src/impakt/ --include='*.py' | awk -F: '{s+=$2}END{print s+0}')"

# Logging usage
echo "logging calls: $(grep -rc 'logger\.\|logging\.' src/impakt/ --include='*.py' | awk -F: '{s+=$2}END{print s+0}')"

# Error handling
echo "try/except blocks: $(grep -rc 'try:' src/impakt/ --include='*.py' | awk -F: '{s+=$2}END{print s+0}')"
echo "bare excepts: $(grep -rc 'except:' src/impakt/ --include='*.py' | awk -F: '{s+=$2}END{print s+0}')"

# Internal coupling
echo "internal imports: $(grep -rc 'from impakt\.' src/impakt/ --include='*.py' | awk -F: '{s+=$2}END{print s+0}')"

Step 2: Score Each Dimension

Use the raw metrics to assign a 0.0-10.0 score per dimension using these rubrics.

Test Health (weight: 20%)

Score	Criteria
10	100% pass, test:source ratio >= 0.5, coverage >= 80%, integration tests present
8	100% pass, ratio 0.2-0.5, integration tests present
6	100% pass, ratio < 0.2 or missing coverage measurement
4	>95% pass
2	>80% pass
0	<80% pass or no tests

Type Safety (weight: 15%)

Score	Criteria
10	mypy strict, 0 errors
8	mypy strict, <10 errors
6	mypy strict, <50 errors OR mypy non-strict with 0 errors
4	mypy configured but >50 errors
2	no type checker configured
0	no type hints

Lint Hygiene (weight: 10%)

Score	Criteria
10	0 violations
8	<10 violations, all auto-fixable
6	<100 violations, mostly auto-fixable
4	100-500 violations
2	>500 violations
0	no linter configured

Architecture (weight: 15%)

Score	Criteria
10	Clear layering, all modules have `__all__`, low coupling, plugin system
9	Clear layering, most modules have `__all__`, plugin system
7	Identifiable layers, some modules lack public API definition
5	Flat structure, unclear module boundaries
3	God objects, circular dependencies
0	Single-file or completely unstructured

Evaluate qualitatively: read __init__.py files, check import patterns, assess layer violations (does io import from web? does channel depend on plot?).

Documentation (weight: 10%)

Score	Criteria
10	>90% docstring coverage, generated API docs, architectural docs with diagrams
8	>80% docstring coverage, README with architecture, no generated API docs
6	>60% docstring coverage, basic README
4	<60% docstring coverage
2	README only
0	No documentation

Complexity (weight: 10%)

Score	Criteria
10	Median file <150 lines, 0 files >300, max complexity <15
8	Median <150, <=3 files >300, high-complexity files are justified (parsers, UI assembly)
6	Median <200, <=10 files >300
4	Median >200 or many files >500
2	Multiple files >1000 lines
0	Monolithic files, no decomposition

Security (weight: 10%)

Score	Criteria
10	No eval/exec, no subprocess, no secrets, input validation at boundaries
9	eval/exec present but sandboxed (restricted builtins, blocklist), no secrets
7	eval/exec partially sandboxed, or subprocess with shell=False
5	Unsandboxed eval/exec or subprocess with shell=True
3	Hardcoded secrets present
0	Multiple critical security anti-patterns

Maintainability (weight: 10%)

Score	Criteria
10	0 debt markers, consistent error handling, logging, modern tooling
8	<5 debt markers, no bare excepts, logging present
6	<20 debt markers, some bare excepts
4	>20 debt markers or many bare excepts
2	Inconsistent patterns, no logging
0	No discernible standards

Step 3: Compute Composite Score

composite = (
    test_health     * 0.20 +
    type_safety     * 0.15 +
    lint_hygiene    * 0.10 +
    architecture    * 0.15 +
    documentation   * 0.10 +
    complexity      * 0.10 +
    security        * 0.10 +
    maintainability * 0.10
) * 10

Grade Scale

Range	Grade	Meaning
90-100	A	Production-grade, minimal tech debt
80-89	B+	Strong codebase, minor gaps
70-79	B	Good foundation, actionable improvements exist
60-69	C+	Functional but accumulating debt
50-59	C	Significant quality concerns
<50	D	Major remediation needed

Step 4: Write Report

Copy docs/QA-TEMPLATE.md to docs/QA-<YYYY-MM-DD_HHMM>.md and fill in:

Summary block at the top (immediately after the # heading):
- Blockquote with composite score, grade, and one-sentence summary
- Quick-reference dimension score table
- The summary sentence should note overall quality posture and the most significant change since the last assessment (or "Baseline assessment" if first run)
All raw metric values
All dimension scores with brief justification
Composite score and grade
Delta from previous assessment (if one exists)
Top 3-5 actionable improvements
Acknowledgment of any scoring judgment calls

Notes for LLM Execution

Run all Step 1 commands. Do not skip any -- partial data leads to unreliable scores.
When scoring Architecture, actually read the import graph. Do not guess from file names alone.
For Security, read the context around any eval/exec/subprocess hits to assess sandboxing.
Be consistent: same rubric, same thresholds, every time. If a metric falls between rubric rows, interpolate linearly.
If the project adds coverage measurement (pytest-cov), factor it into Test Health.
Record the exact command output, not just the derived score. Future assessments need the raw data to compute deltas.
Check for previous QA-*.md files in docs/ to compute improvement trends.

9.4 KiB Raw Blame History