Containerized local LLM stack for the Framework Desktop / Strix Halo,
plus the OpenCode harness on the Mac side.
- pyinfra/framework/: pyinfra deploy targeting the box
- llama.cpp (Vulkan), vLLM (ROCm), Ollama (ROCm with HSA override
for gfx1151), OpenWebUI
- Beszel (host + container + AMD GPU dashboard via sysfs)
- OpenLIT (LLM fleet metrics)
- Phoenix (per-trace agent waterfall)
- OpenHands (autonomous agent in a Docker sandbox)
- opencode/: OpenCode config + Phoenix bridge plugin (OTel exporter)
- install.sh deploys to ~/.config/opencode/
- StrixHaloSetup.md / StrixHaloMemory.md / Roadmap.md / TODO.md:
documentation and planning
- testing/qwen3-coder-30b/: small evaluation harness
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
109 lines
2.6 KiB
Markdown
109 lines
2.6 KiB
Markdown
# Local LLM Performance Testing Framework
|
|
|
|
A comprehensive framework for comparing the performance and quality of different local hosted LLMs on coding tasks.
|
|
|
|
## Features
|
|
|
|
- **Performance Metrics**: Response time, tokens per second, CPU/memory/GPU usage
|
|
- **Quality Assessment**: Automated code quality evaluation
|
|
- **Multi-Model Testing**: Test multiple LLMs simultaneously
|
|
- **Parallel Execution**: Run tests efficiently
|
|
- **Comprehensive Reporting**: Detailed test results and comparisons
|
|
|
|
## Installation
|
|
|
|
```bash
|
|
# Install required dependencies
|
|
pip install psutil
|
|
```
|
|
|
|
## Usage
|
|
|
|
### Quick Start
|
|
|
|
1. Create a test suite and add models:
|
|
|
|
```python
|
|
from llm_test_framework import TestSuite, MockLLM
|
|
|
|
# Create test suite
|
|
suite = TestSuite()
|
|
|
|
# Add models to test
|
|
suite.add_model(MockLLM("gpt4all", {"model_path": "/path/to/model"}))
|
|
suite.add_model(MockLLM("llama.cpp", {"model_path": "/path/to/llama"}))
|
|
|
|
# Define test tasks
|
|
tasks = [
|
|
{
|
|
"name": "Simple function",
|
|
"prompt": "Write a Python function to add two numbers",
|
|
"expected_output": "return a + b"
|
|
}
|
|
]
|
|
|
|
# Run tests
|
|
suite.run_all_tests(tasks)
|
|
suite.save_results()
|
|
```
|
|
|
|
### Running the Example
|
|
|
|
```bash
|
|
python llm_test_framework.py
|
|
```
|
|
|
|
This will run sample tests with mock models and save results to JSON.
|
|
|
|
## Components
|
|
|
|
### 1. TestSuite
|
|
Main class managing the testing process, including:
|
|
- Model registration
|
|
- Test execution
|
|
- Result collection and saving
|
|
|
|
### 2. LLMInterface
|
|
Abstract base class for LLM implementations with methods:
|
|
- `generate()`: Generate text from prompt
|
|
- `get_model_info()`: Get model information
|
|
|
|
### 3. TestResult
|
|
Data class storing individual test results with:
|
|
- Performance metrics
|
|
- Quality scores
|
|
- Raw outputs
|
|
- Success status
|
|
|
|
## Extending the Framework
|
|
|
|
To add support for a new LLM:
|
|
|
|
1. Create a new class inheriting from `LLMInterface`
|
|
2. Implement `generate()` and `get_model_info()` methods
|
|
3. Add your model to the test suite
|
|
|
|
## Test Types
|
|
|
|
The framework supports multiple types of coding tasks:
|
|
- Basic function implementations
|
|
- Code refactoring examples
|
|
- Algorithm challenges
|
|
- System design questions
|
|
|
|
## Output
|
|
|
|
Results are saved in JSON format with:
|
|
- Performance metrics (response time, TPS)
|
|
- Resource usage (CPU, memory, GPU)
|
|
- Quality assessment scores
|
|
- Raw model outputs
|
|
- Success/failure indicators
|
|
|
|
## Future Enhancements
|
|
|
|
- Integration with local LLM APIs (text-generation-webui, llama.cpp, etc.)
|
|
- Advanced code quality evaluation using static analysis
|
|
- Web-based dashboard for results visualization
|
|
- Database storage for historical comparisons
|
|
- Automated statistical analysis of results |