Files

noisedestroyers 40c8b0f132 Added AQA Process and Scoring

2026-04-11 05:05:50 -04:00

9.1 KiB

Raw Blame History

Codebase Quality Assessment -- LLM Instructions

Purpose: Reproducible, automated quality scoring for the Impakt codebase. Run this assessment at any milestone (feature complete, pre-release, periodic review) to track quality over time. Results go into docs/QA-<DATETIME>.md using the template in docs/QA-TEMPLATE.md.

Prerequisites

cd /Users/noise/Code/impakt
uv sync --dev

All commands below assume uv run prefix and the project root as working directory.

Step 1: Collect Raw Metrics

Run these commands and record the output. They are independent and can run in parallel.

1a. Inventory

# Source file count and line count
find src -name "*.py" | wc -l
find src -name "*.py" -exec cat {} + | wc -l

# Test file count and line count
find tests -name "*.py" | wc -l
find tests -name "*.py" -exec cat {} + | wc -l

1b. Test Suite

# Test count and pass/fail
uv run python -m pytest --tb=no -q 2>&1 | tail -5

# Test collection (verify count)
uv run python -m pytest --co -q 2>&1 | tail -3

1c. Type Safety

# mypy strict error count and file breakdown
uv run python -m mypy src/impakt --ignore-missing-imports 2>&1 | tail -3

# Error category breakdown
uv run python -m mypy src/impakt --ignore-missing-imports 2>&1 \
  | grep "error:" | sed 's/.*\[/[/' | sort | uniq -c | sort -rn

1d. Lint

# Ruff violation count
uv run python -m ruff check src/ 2>&1 | tail -5

# Violation breakdown by rule
uv run python -m ruff check src/ 2>&1 \
  | grep -oE '[A-Z][0-9]+' | sort | uniq -c | sort -rn

1e. Complexity

# File size distribution
find src/impakt -name "*.py" -exec wc -l {} + | grep -v total \
  | awk '{print $1}' | sort -n | awk '
BEGIN {n=0; sum=0}
{a[n++]=$1; sum+=$1}
END {
  printf "Min: %d, Max: %d, Mean: %.0f, Median: %d\n", a[0], a[n-1], sum/n, a[int(n/2)]
  over300=0; for(i=0;i<n;i++) if(a[i]>300) over300++
  printf "Files >300 lines: %d / %d\n", over300, n
}'

# High-complexity files (branch density)
find src/impakt -name "*.py" ! -path "*__pycache__*" | while read f; do
  ifs=$(grep -cE '^\s*(if |elif )' "$f" || true)
  fors=$(grep -cE '^\s*(for |while )' "$f" || true)
  excepts=$(grep -cE '^\s*except' "$f" || true)
  complexity=$((ifs + fors + excepts + 1))
  if [ "$complexity" -gt 15 ]; then
    lines=$(wc -l < "$f")
    echo "  $complexity  $f ($lines lines)"
  fi
done | sort -rn

1f. Documentation

# Docstring coverage (approximate)
echo "Docstrings: $(grep -rcE '^\s+"""' src/impakt/ --include='*.py' | awk -F: '{s+=$2}END{print s}')"
echo "Definitions: $(grep -rcE '^\s*(def |class )' src/impakt/ --include='*.py' | awk -F: '{s+=$2}END{print s}')"

# __all__ coverage
for d in src/impakt/*/; do
  [ -d "$d" ] || continue
  mod=$(basename "$d")
  init="${d}__init__.py"
  [ -f "$init" ] && has_all=$(grep -c "__all__" "$init" || echo 0) || has_all="-"
  echo "  $mod: __all__=$has_all"
done

1g. Security

# Dangerous patterns
echo "eval/exec: $(grep -rn 'eval(\|exec(' src/impakt/ --include='*.py' | grep -v '# noqa' | wc -l | tr -d ' ')"
echo "subprocess/os.system: $(grep -rn 'subprocess\|os\.system\|os\.popen' src/impakt/ --include='*.py' | wc -l | tr -d ' ')"
echo "bare except: $(grep -rn 'except:' src/impakt/ --include='*.py' | wc -l | tr -d ' ')"
echo "hardcoded secrets: $(grep -rin 'password\s*=\|secret\s*=\|api_key\s*=\|token\s*=' src/impakt/ --include='*.py' | wc -l | tr -d ' ')"

1h. Maintainability

# Debt markers
echo "TODO: $(grep -rc 'TODO' src/impakt/ --include='*.py' | awk -F: '{s+=$2}END{print s+0}')"
echo "FIXME: $(grep -rc 'FIXME' src/impakt/ --include='*.py' | awk -F: '{s+=$2}END{print s+0}')"
echo "HACK: $(grep -rc 'HACK' src/impakt/ --include='*.py' | awk -F: '{s+=$2}END{print s+0}')"

# Logging usage
echo "logging calls: $(grep -rc 'logger\.\|logging\.' src/impakt/ --include='*.py' | awk -F: '{s+=$2}END{print s+0}')"

# Error handling
echo "try/except blocks: $(grep -rc 'try:' src/impakt/ --include='*.py' | awk -F: '{s+=$2}END{print s+0}')"
echo "bare excepts: $(grep -rc 'except:' src/impakt/ --include='*.py' | awk -F: '{s+=$2}END{print s+0}')"

# Internal coupling
echo "internal imports: $(grep -rc 'from impakt\.' src/impakt/ --include='*.py' | awk -F: '{s+=$2}END{print s+0}')"

Step 2: Score Each Dimension

Use the raw metrics to assign a 0.0-10.0 score per dimension using these rubrics.

Test Health (weight: 20%)

Score	Criteria
10	100% pass, test:source ratio >= 0.5, coverage >= 80%, integration tests present
8	100% pass, ratio 0.2-0.5, integration tests present
6	100% pass, ratio < 0.2 or missing coverage measurement
4	>95% pass
2	>80% pass
0	<80% pass or no tests

Type Safety (weight: 15%)

Score	Criteria
10	mypy strict, 0 errors
8	mypy strict, <10 errors
6	mypy strict, <50 errors OR mypy non-strict with 0 errors
4	mypy configured but >50 errors
2	no type checker configured
0	no type hints

Lint Hygiene (weight: 10%)

Score	Criteria
10	0 violations
8	<10 violations, all auto-fixable
6	<100 violations, mostly auto-fixable
4	100-500 violations
2	>500 violations
0	no linter configured

Architecture (weight: 15%)

Score	Criteria
10	Clear layering, all modules have `__all__`, low coupling, plugin system
9	Clear layering, most modules have `__all__`, plugin system
7	Identifiable layers, some modules lack public API definition
5	Flat structure, unclear module boundaries
3	God objects, circular dependencies
0	Single-file or completely unstructured

Evaluate qualitatively: read __init__.py files, check import patterns, assess layer violations (does io import from web? does channel depend on plot?).

Documentation (weight: 10%)

Score	Criteria
10	>90% docstring coverage, generated API docs, architectural docs with diagrams
8	>80% docstring coverage, README with architecture, no generated API docs
6	>60% docstring coverage, basic README
4	<60% docstring coverage
2	README only
0	No documentation

Complexity (weight: 10%)

Score	Criteria
10	Median file <150 lines, 0 files >300, max complexity <15
8	Median <150, <=3 files >300, high-complexity files are justified (parsers, UI assembly)
6	Median <200, <=10 files >300
4	Median >200 or many files >500
2	Multiple files >1000 lines
0	Monolithic files, no decomposition

Security (weight: 10%)

Score	Criteria
10	No eval/exec, no subprocess, no secrets, input validation at boundaries
9	eval/exec present but sandboxed (restricted builtins, blocklist), no secrets
7	eval/exec partially sandboxed, or subprocess with shell=False
5	Unsandboxed eval/exec or subprocess with shell=True
3	Hardcoded secrets present
0	Multiple critical security anti-patterns

Maintainability (weight: 10%)

Score	Criteria
10	0 debt markers, consistent error handling, logging, modern tooling
8	<5 debt markers, no bare excepts, logging present
6	<20 debt markers, some bare excepts
4	>20 debt markers or many bare excepts
2	Inconsistent patterns, no logging
0	No discernible standards

Step 3: Compute Composite Score

composite = (
    test_health     * 0.20 +
    type_safety     * 0.15 +
    lint_hygiene    * 0.10 +
    architecture    * 0.15 +
    documentation   * 0.10 +
    complexity      * 0.10 +
    security        * 0.10 +
    maintainability * 0.10
) * 10

Grade Scale

Range	Grade	Meaning
90-100	A	Production-grade, minimal tech debt
80-89	B+	Strong codebase, minor gaps
70-79	B	Good foundation, actionable improvements exist
60-69	C+	Functional but accumulating debt
50-59	C	Significant quality concerns
<50	D	Major remediation needed

Step 4: Write Report

Copy docs/QA-TEMPLATE.md to docs/QA-<YYYY-MM-DD_HHMM>.md and fill in:

All raw metric values
All dimension scores with brief justification
Composite score and grade
Delta from previous assessment (if one exists)
Top 3-5 actionable improvements
Acknowledgment of any scoring judgment calls

Notes for LLM Execution

Run all Step 1 commands. Do not skip any -- partial data leads to unreliable scores.
When scoring Architecture, actually read the import graph. Do not guess from file names alone.
For Security, read the context around any eval/exec/subprocess hits to assess sandboxing.
Be consistent: same rubric, same thresholds, every time. If a metric falls between rubric rows, interpolate linearly.
If the project adds coverage measurement (pytest-cov), factor it into Test Health.
Record the exact command output, not just the derived score. Future assessments need the raw data to compute deltas.
Check for previous QA-*.md files in docs/ to compute improvement trends.

9.1 KiB Raw Blame History