Containerized local LLM stack for the Framework Desktop / Strix Halo,
plus the OpenCode harness on the Mac side.
- pyinfra/framework/: pyinfra deploy targeting the box
- llama.cpp (Vulkan), vLLM (ROCm), Ollama (ROCm with HSA override
for gfx1151), OpenWebUI
- Beszel (host + container + AMD GPU dashboard via sysfs)
- OpenLIT (LLM fleet metrics)
- Phoenix (per-trace agent waterfall)
- OpenHands (autonomous agent in a Docker sandbox)
- opencode/: OpenCode config + Phoenix bridge plugin (OTel exporter)
- install.sh deploys to ~/.config/opencode/
- StrixHaloSetup.md / StrixHaloMemory.md / Roadmap.md / TODO.md:
documentation and planning
- testing/qwen3-coder-30b/: small evaluation harness
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
4.9 KiB
4.9 KiB
Local LLM Performance Testing Plan
Overview
This document outlines a comprehensive testing framework to compare the performance and quality of different local hosted LLMs on coding tasks. The framework will measure both speed and quality metrics to provide actionable insights for model selection and optimization.
Objectives
- Evaluate response time and throughput of different LLMs
- Assess code quality and correctness of generated outputs
- Measure resource utilization during code generation
- Provide standardized comparison between different models
- Enable reproducible testing environments
Key Metrics to Measure
Performance Metrics
- Response Time: Time from request to first token
- Tokens Per Second (TPS): Generation speed
- Total Latency: Complete generation time
- Resource Usage:
- CPU utilization
- Memory consumption
- GPU utilization (for GPU-enabled models)
- VRAM usage (for GPU-enabled models)
Quality Metrics
- Code Syntax Accuracy: Valid syntax according to language rules
- Functional Correctness: Code produces expected outputs
- Code Efficiency: Algorithmic efficiency and best practices
- Code Readability: Code structure and documentation
- Task Completion Rate: Properly solving the given coding challenge
Testing Framework Architecture
1. Test Runner Component
- Initialize and configure different LLM instances
- Manage test execution workflow
- Coordinate parallel execution of multiple models
- Handle test parameters and configurations
2. Performance Monitoring
- Real-time resource usage tracking
- Timing measurements for each request
- Memory allocation monitoring
- GPU utilization tracking
3. Task Definition Catalog
- Collection of coding challenges at various difficulty levels
- Standardized test cases with expected outputs
- Task categorization by programming language and complexity
- Reproducible test scenarios
4. Evaluation Engine
- Automated code quality assessment
- Functional correctness validation
- Syntax checking and linting
- Comparison against expected outputs
5. Reporting System
- Performance comparison dashboards
- Quality assessment reports
- Resource usage visualizations
- Summary statistics and trend analysis
Test Scenarios
Basic Programming Tasks
- Variable assignments and data structures
- Function implementations
- Control flow (loops, conditionals)
- Error handling and validation
Intermediate Complexity
- Algorithm implementation
- Data structure manipulation
- API integration examples
- Modular code design
Advanced Challenges
- Multi-threading and concurrency
- Mathematical and scientific computing
- System design and architecture
- Code optimization and refactoring
Implementation Approach
Phase 1: Core Framework Development
- Create base testing infrastructure
- Implement model interface abstraction
- Build test runner and execution engine
- Develop basic performance monitoring
Phase 2: Task and Evaluation System
- Design coding challenge catalog
- Implement automated quality assessment
- Create validation mechanisms
- Develop reporting templates
Phase 3: Advanced Features
- Parallel execution support
- Advanced visualization capabilities
- Statistical analysis and confidence intervals
- Configuration management and customization
Expected Output
Test Reports
- Detailed performance metrics per model
- Quality assessment scores
- Resource usage comparisons
- Statistical significance analysis
- Recommendations for model selection
Visualization Components
- Performance comparison charts
- Resource utilization graphs
- Quality trend analysis
- Multi-model side-by-side comparisons
Configuration Options
Model Parameters
- Context window size
- Sampling parameters (temperature, top-p, etc.)
- System prompts and templates
- Hardware specifications
Test Parameters
- Number of iterations per task
- Timeout thresholds
- Resource monitoring intervals
- Quality evaluation criteria weights
Integration Points
Existing Tools
- Local LLM APIs (e.g., text-generation-webui)
- Python-based LLM libraries
- System monitoring tools (htop, nvidia-smi)
- Code linting and validation tools
Data Storage
- CSV/JSON output files
- Database for historical comparisons
- Version control integration
- Cloud storage for sharing results
Success Criteria
- Reproducible test results
- Clear performance differentiation between models
- Comprehensive quality assessment coverage
- Scalable architecture for additional models
- User-friendly interface for result interpretation
Risks and Mitigation
- Hardware Variability: Address with controlled test environments
- Inconsistent Results: Use multiple iterations and statistical analysis
- Resource Limitations: Implement proper monitoring and resource allocation
- Quality Assessment Bias: Use multiple evaluation criteria and human validation