Files
localgenai/testing/qwen3-coder-30b/TestingPlan.md
noisedestroyers 2c4bfefa95 Initial commit: localgenai stack
Containerized local LLM stack for the Framework Desktop / Strix Halo,
plus the OpenCode harness on the Mac side.

- pyinfra/framework/: pyinfra deploy targeting the box
  - llama.cpp (Vulkan), vLLM (ROCm), Ollama (ROCm with HSA override
    for gfx1151), OpenWebUI
  - Beszel (host + container + AMD GPU dashboard via sysfs)
  - OpenLIT (LLM fleet metrics)
  - Phoenix (per-trace agent waterfall)
  - OpenHands (autonomous agent in a Docker sandbox)
- opencode/: OpenCode config + Phoenix bridge plugin (OTel exporter)
  - install.sh deploys to ~/.config/opencode/
- StrixHaloSetup.md / StrixHaloMemory.md / Roadmap.md / TODO.md:
  documentation and planning
- testing/qwen3-coder-30b/: small evaluation harness

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 11:35:10 -04:00

4.9 KiB

Local LLM Performance Testing Plan

Overview

This document outlines a comprehensive testing framework to compare the performance and quality of different local hosted LLMs on coding tasks. The framework will measure both speed and quality metrics to provide actionable insights for model selection and optimization.

Objectives

  • Evaluate response time and throughput of different LLMs
  • Assess code quality and correctness of generated outputs
  • Measure resource utilization during code generation
  • Provide standardized comparison between different models
  • Enable reproducible testing environments

Key Metrics to Measure

Performance Metrics

  1. Response Time: Time from request to first token
  2. Tokens Per Second (TPS): Generation speed
  3. Total Latency: Complete generation time
  4. Resource Usage:
    • CPU utilization
    • Memory consumption
    • GPU utilization (for GPU-enabled models)
    • VRAM usage (for GPU-enabled models)

Quality Metrics

  1. Code Syntax Accuracy: Valid syntax according to language rules
  2. Functional Correctness: Code produces expected outputs
  3. Code Efficiency: Algorithmic efficiency and best practices
  4. Code Readability: Code structure and documentation
  5. Task Completion Rate: Properly solving the given coding challenge

Testing Framework Architecture

1. Test Runner Component

  • Initialize and configure different LLM instances
  • Manage test execution workflow
  • Coordinate parallel execution of multiple models
  • Handle test parameters and configurations

2. Performance Monitoring

  • Real-time resource usage tracking
  • Timing measurements for each request
  • Memory allocation monitoring
  • GPU utilization tracking

3. Task Definition Catalog

  • Collection of coding challenges at various difficulty levels
  • Standardized test cases with expected outputs
  • Task categorization by programming language and complexity
  • Reproducible test scenarios

4. Evaluation Engine

  • Automated code quality assessment
  • Functional correctness validation
  • Syntax checking and linting
  • Comparison against expected outputs

5. Reporting System

  • Performance comparison dashboards
  • Quality assessment reports
  • Resource usage visualizations
  • Summary statistics and trend analysis

Test Scenarios

Basic Programming Tasks

  • Variable assignments and data structures
  • Function implementations
  • Control flow (loops, conditionals)
  • Error handling and validation

Intermediate Complexity

  • Algorithm implementation
  • Data structure manipulation
  • API integration examples
  • Modular code design

Advanced Challenges

  • Multi-threading and concurrency
  • Mathematical and scientific computing
  • System design and architecture
  • Code optimization and refactoring

Implementation Approach

Phase 1: Core Framework Development

  1. Create base testing infrastructure
  2. Implement model interface abstraction
  3. Build test runner and execution engine
  4. Develop basic performance monitoring

Phase 2: Task and Evaluation System

  1. Design coding challenge catalog
  2. Implement automated quality assessment
  3. Create validation mechanisms
  4. Develop reporting templates

Phase 3: Advanced Features

  1. Parallel execution support
  2. Advanced visualization capabilities
  3. Statistical analysis and confidence intervals
  4. Configuration management and customization

Expected Output

Test Reports

  • Detailed performance metrics per model
  • Quality assessment scores
  • Resource usage comparisons
  • Statistical significance analysis
  • Recommendations for model selection

Visualization Components

  • Performance comparison charts
  • Resource utilization graphs
  • Quality trend analysis
  • Multi-model side-by-side comparisons

Configuration Options

Model Parameters

  • Context window size
  • Sampling parameters (temperature, top-p, etc.)
  • System prompts and templates
  • Hardware specifications

Test Parameters

  • Number of iterations per task
  • Timeout thresholds
  • Resource monitoring intervals
  • Quality evaluation criteria weights

Integration Points

Existing Tools

  • Local LLM APIs (e.g., text-generation-webui)
  • Python-based LLM libraries
  • System monitoring tools (htop, nvidia-smi)
  • Code linting and validation tools

Data Storage

  • CSV/JSON output files
  • Database for historical comparisons
  • Version control integration
  • Cloud storage for sharing results

Success Criteria

  • Reproducible test results
  • Clear performance differentiation between models
  • Comprehensive quality assessment coverage
  • Scalable architecture for additional models
  • User-friendly interface for result interpretation

Risks and Mitigation

  1. Hardware Variability: Address with controlled test environments
  2. Inconsistent Results: Use multiple iterations and statistical analysis
  3. Resource Limitations: Implement proper monitoring and resource allocation
  4. Quality Assessment Bias: Use multiple evaluation criteria and human validation