Initial commit: localgenai stack

Containerized local LLM stack for the Framework Desktop / Strix Halo,
plus the OpenCode harness on the Mac side.

- pyinfra/framework/: pyinfra deploy targeting the box
  - llama.cpp (Vulkan), vLLM (ROCm), Ollama (ROCm with HSA override
    for gfx1151), OpenWebUI
  - Beszel (host + container + AMD GPU dashboard via sysfs)
  - OpenLIT (LLM fleet metrics)
  - Phoenix (per-trace agent waterfall)
  - OpenHands (autonomous agent in a Docker sandbox)
- opencode/: OpenCode config + Phoenix bridge plugin (OTel exporter)
  - install.sh deploys to ~/.config/opencode/
- StrixHaloSetup.md / StrixHaloMemory.md / Roadmap.md / TODO.md:
  documentation and planning
- testing/qwen3-coder-30b/: small evaluation harness

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-08 11:35:10 -04:00
commit 2c4bfefa95
36 changed files with 5265 additions and 0 deletions

View File

@@ -0,0 +1,311 @@
# Deploying Kimi Linear Model on Strix Halo with 128GB Unified Memory
This document provides a comprehensive guide to deploying the Kimi Linear 48B parameter model on your AMD Strix Halo hardware with 128GB unified memory.
## Hardware Specifications
- **Processor**: AMD Ryzen AI Max+ 395 (Strix Halo)
- **Memory**: 128GB LPDDR5X unified memory
- **GPU**: Integrated Radeon 8060S (40 CU RDNA 3.5)
- **Architecture**: Zen 5 CPU with unified memory access
## Prerequisites
### Hardware Requirements
1. AMD Ryzen AI Max+ 395 processor (or compatible Strix Halo APU)
2. Minimum 128GB RAM (as specified)
3. Ubuntu 24.04 LTS or later with kernel 6.16.9+
4. BIOS access for configuration
### Software Requirements
1. Python 3.10+
2. ROCm 6.4.4 or higher
3. llama.cpp with ROCm support
4. HuggingFace CLI client
## Deployment Steps
### Phase 1: Hardware Configuration
#### 1.1 BIOS Configuration
Before installation, you must set these BIOS options:
- **UMA Frame Buffer Size**: 512MB (for display)
- **IOMMU**: Disabled (for performance)
- **Power Mode**: 85W TDP for balanced performance
#### 1.2 Kernel Configuration
Ensure you're running kernel 6.16.9 or later:
```bash
# Check current kernel
uname -r
# If upgrading needed:
sudo add-apt-repository ppa:cappelikan/ppa -y
sudo apt update
sudo apt install mainline -y
sudo mainline --install 6.16.9
```
#### 1.3 GRUB Boot Parameters
Edit `/etc/default/grub`:
```bash
sudo nano /etc/default/grub
```
Add to `GRUB_CMDLINE_LINUX_DEFAULT`:
```bash
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash amd_iommu=off amdgpu.gttsize=117760"
```
Apply the changes:
```bash
sudo update-grub
sudo reboot
```
#### 1.4 GPU Access Configuration
Create udev rules for proper GPU permissions:
```bash
sudo bash -c 'cat > /etc/udev/rules.d/99-amd-kfd.rules << EOF
SUBSYSTEM=="kfd", GROUP="render", MODE="0666"
SUBSYSTEM=="drm", KERNEL=="card[0-9]*", GROUP="render", MODE="0666"
SUBSYSTEM=="drm", KERNEL=="renderD[0-9]*", GROUP="render", MODE="0666"
EOF'
sudo udevadm control --reload-rules
sudo udevadm trigger
sudo usermod -aG video,render $USER
```
### Phase 2: ROCm Installation
#### 2.1 Install ROCm from AMD Repository
```bash
# Add AMD ROCm repository
wget https://repo.radeon.com/rocm/rocm.gpg.key -O - | sudo apt-key add -
echo 'deb [arch=amd64] https://repo.radeon.com/rocm/apt/latest ubuntu main' | sudo tee /etc/apt/sources.list.d/rocm.list
sudo apt update
sudo apt install rocm-hip-runtime rocm-hip-sdk -y
```
#### 2.2 Verify ROCm Installation
```bash
# Check ROCm version
rocminfo | grep -E "Agent|Name:|Marketing"
# Check GTT memory allocation
cat /sys/class/drm/card*/device/mem_info_gtt_total
```
### Phase 3: Model Setup
#### 3.1 Download Kimi Linear GGUF Model
```bash
# Install huggingface-cli
pip install "huggingface_hub[cli]"
# Create model directory
mkdir -p ~/models/kimi-linear
cd ~/models/kimi-linear
# Download the Q4_K_M quantized GGUF model (recommended for your hardware)
huggingface-cli download bartowski/moonshotai_Kimi-Linear-48B-A3B-Instruct-GGUF \
--include "moonshotai_Kimi-Linear-48B-A3B-Instruct-Q4_K_M.gguf" \
--local-dir ./
# Or download a lower quality version if memory is tight
# huggingface-cli download bartowski/moonshotai_Kimi-Linear-48B-A3B-Instruct-GGUF \
# --include "moonshotai_Kimi-Linear-48B-A3B-Instruct-IQ4_XS.gguf" \
# --local-dir ./
```
#### 3.2 Validate Download
```bash
ls -la ~/models/kimi-linear/
# You should see a file named moonshotai_Kimi-Linear-48B-A3B-Instruct-Q4_K_M.gguf
```
### Phase 4: llama.cpp Setup for AMD ROCm
#### 4.1 Build llama.cpp with ROCm Support
```bash
# Clone llama.cpp
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
# Install build dependencies
sudo apt install cmake gcc g++ git libcurl4-openssl-dev -y
# Build for ROCm
cmake -B build -S . \
-DGGML_HIP=ON \
-DAMDGPU_TARGETS="gfx1151"
cmake --build build --config Release -j$(nproc)
```
#### 4.2 Alternative: Use Pre-built Container
Instead of building, you can use a container with ROCm pre-configured:
```bash
# Install Podman and Distrobox
sudo apt install podman -y
curl -s https://raw.githubusercontent.com/89luca89/distrobox/main/install | sudo sh
# Create LLM container with ROCm
distrobox create llama-rocm \
--image docker.io/kyuz0/amd-strix-halo-toolboxes:rocm-6.4.4-rocwmma \
--additional-flags "--device /dev/dri --device /dev/kfd --group-add video --group-add render"
```
### Phase 5: Test Inference
#### 5.1 Run a Basic Test
```bash
# If building locally:
./build/bin/llama-cli \
-m ~/models/kimi-linear/moonshotai_Kimi-Linear-48B-A3B-Instruct-Q4_K_M.gguf \
--no-mmap \
-ngl 99 \
-p "Explain quantum computing in simple terms:" \
-n 256
# If using container:
distrobox enter llama-rocm
llama-cli \
-m ~/models/kimi-linear/moonshotai_Kimi-Linear-48B-A3B-Instruct-Q4_K_M.gguf \
--no-mmap \
-ngl 99 \
-p "Explain quantum computing in simple terms:" \
-n 256
```
### Phase 6: Benchmarking and Optimization
#### 6.1 Run Performance Benchmarks
```bash
# Run llama-bench for performance testing
llama-bench \
-m ~/models/kimi-linear/moonshotai_Kimi-Linear-48B-A3B-Instruct-Q4_K_M.gguf \
-mmp 0 \
-ngl 99 \
-p 512 \
-n 128
```
#### 6.2 Monitor GPU Resources
```bash
# Watch GPU usage during inference
watch -n 1 rocm-smi
# Check memory usage
cat /sys/class/drm/card*/device/mem_info_gtt_total
cat /sys/class/drm/card*/device/mem_info_gtt_used
```
## Performance Optimization Tips
### Memory Management
1. **Use 128GB GTT**: Your system can utilize up to ~115GB of memory for computing
2. **Choose appropriate quantization**:
- Q4_K_M: Good balance of quality and size (~30GB)
- IQ4_XS: Smaller but with some quality loss (~26GB)
- Q5_K_M: Higher quality but larger (~35GB)
### Inference Parameters
```bash
# Optimal flags for AMD Strix Halo
llama-cli \
-m ~/models/kimi-linear/moonshotai_Kimi-Linear-48B-A3B-Instruct-Q4_K_M.gguf \
--no-mmap \
-ngl 99 \
-p "Your prompt text" \
-n 512 \
--temp 0.7 \
--top-p 0.9 \
--repeat-penalty 1.1
```
## Expected Performance on Strix Halo
| Model Quantization | Predicted Performance | Notes |
| ------------------ | --------------------- | ----- |
| Q4_K_M (~30GB) | 10-15 tokens/sec | Good balance |
| IQ4_XS (~26GB) | 8-12 tokens/sec | Memory efficient |
| Q5_K_M (~35GB) | 12-18 tokens/sec | Higher quality |
## Troubleshooting
### Common Issues
1. **HSA_STATUS_ERROR_OUT_OF_RESOURCES**: Check udev rules
2. **Slow model loading**: Add `--no-mmap` flag
3. **Memory allocation errors**: Verify GTT size in GRUB
4. **Container GPU access**: Check device permissions
### Verification Commands
```bash
# Verify memory allocation
cat /sys/class/drm/card*/device/mem_info_gtt_total
# Verify ROCm detection
rocminfo | grep -E "Name:|Marketing"
# Test basic GPU functionality
rocminfo
```
## Additional Tools and Integration
### For Development/Testing
1. **llama-cpp-python**: For Python integration
```python
# Install package
pip install llama-cpp-python
# Test with Python
from llama_cpp import Llama
llm = Llama.from_pretrained(
repo_id="bartowski/moonshotai_Kimi-Linear-48B-A3B-Instruct-GGUF",
filename="moonshotai_Kimi-Linear-48B-A3B-Instruct-Q4_K_M.gguf"
)
```
2. **OpenWebUI**: Web interface for LLMs
3. **Ollama**: Alternative for simpler deployment
4. **vLLM**: High-throughput inference (for API deployment)
## Monitoring and Resource Usage
### System Resources to Monitor
- **GPU Utilization**: Use `watch -n 1 rocm-smi`
- **Memory Usage**: Check `/sys/class/drm/card*/device/mem_info_gtt_total`
- **CPU Load**: Monitor with `htop`
- **Temperature**: Monitor GPU temperature
### Storage Considerations
- Model files require ~30-50GB on disk
- Additional swap space recommended for complex tasks
- Consider SSD storage for optimal I/O performance
## Recommendations for Different Use Cases
### For Coding Tasks
- Use `--temp 0.3` for more deterministic outputs
- Enable `--repeat-penalty 1.2` to reduce repetition
- Set `-n 1024` for long context generation
### For Research/Testing
- Use `Q4_K_M` for reliable performance
- Try `IQ4_XS` to test memory efficiency
- Monitor with `llama-bench` for reproducible results
### For Production Use
- Consider containerized deployment for stability
- Set up automated monitoring for resource usage
- Create model-specific configuration files
## Conclusion
Your Strix Halo with 128GB unified memory provides an excellent platform for running large language models like Kimi Linear. The key to success is proper configuration of both the system and inference tools to fully utilize the unified memory architecture.
With this setup, you should be able to run Kimi Linear and achieve performance that rivals much more expensive hardware setups.

View File

@@ -0,0 +1,109 @@
# Local LLM Performance Testing Framework
A comprehensive framework for comparing the performance and quality of different local hosted LLMs on coding tasks.
## Features
- **Performance Metrics**: Response time, tokens per second, CPU/memory/GPU usage
- **Quality Assessment**: Automated code quality evaluation
- **Multi-Model Testing**: Test multiple LLMs simultaneously
- **Parallel Execution**: Run tests efficiently
- **Comprehensive Reporting**: Detailed test results and comparisons
## Installation
```bash
# Install required dependencies
pip install psutil
```
## Usage
### Quick Start
1. Create a test suite and add models:
```python
from llm_test_framework import TestSuite, MockLLM
# Create test suite
suite = TestSuite()
# Add models to test
suite.add_model(MockLLM("gpt4all", {"model_path": "/path/to/model"}))
suite.add_model(MockLLM("llama.cpp", {"model_path": "/path/to/llama"}))
# Define test tasks
tasks = [
{
"name": "Simple function",
"prompt": "Write a Python function to add two numbers",
"expected_output": "return a + b"
}
]
# Run tests
suite.run_all_tests(tasks)
suite.save_results()
```
### Running the Example
```bash
python llm_test_framework.py
```
This will run sample tests with mock models and save results to JSON.
## Components
### 1. TestSuite
Main class managing the testing process, including:
- Model registration
- Test execution
- Result collection and saving
### 2. LLMInterface
Abstract base class for LLM implementations with methods:
- `generate()`: Generate text from prompt
- `get_model_info()`: Get model information
### 3. TestResult
Data class storing individual test results with:
- Performance metrics
- Quality scores
- Raw outputs
- Success status
## Extending the Framework
To add support for a new LLM:
1. Create a new class inheriting from `LLMInterface`
2. Implement `generate()` and `get_model_info()` methods
3. Add your model to the test suite
## Test Types
The framework supports multiple types of coding tasks:
- Basic function implementations
- Code refactoring examples
- Algorithm challenges
- System design questions
## Output
Results are saved in JSON format with:
- Performance metrics (response time, TPS)
- Resource usage (CPU, memory, GPU)
- Quality assessment scores
- Raw model outputs
- Success/failure indicators
## Future Enhancements
- Integration with local LLM APIs (text-generation-webui, llama.cpp, etc.)
- Advanced code quality evaluation using static analysis
- Web-based dashboard for results visualization
- Database storage for historical comparisons
- Automated statistical analysis of results

View File

@@ -0,0 +1,158 @@
# Local LLM Performance Testing Plan
## Overview
This document outlines a comprehensive testing framework to compare the performance and quality of different local hosted LLMs on coding tasks. The framework will measure both speed and quality metrics to provide actionable insights for model selection and optimization.
## Objectives
- Evaluate response time and throughput of different LLMs
- Assess code quality and correctness of generated outputs
- Measure resource utilization during code generation
- Provide standardized comparison between different models
- Enable reproducible testing environments
## Key Metrics to Measure
### Performance Metrics
1. **Response Time**: Time from request to first token
2. **Tokens Per Second (TPS)**: Generation speed
3. **Total Latency**: Complete generation time
4. **Resource Usage**:
- CPU utilization
- Memory consumption
- GPU utilization (for GPU-enabled models)
- VRAM usage (for GPU-enabled models)
### Quality Metrics
1. **Code Syntax Accuracy**: Valid syntax according to language rules
2. **Functional Correctness**: Code produces expected outputs
3. **Code Efficiency**: Algorithmic efficiency and best practices
4. **Code Readability**: Code structure and documentation
5. **Task Completion Rate**: Properly solving the given coding challenge
## Testing Framework Architecture
### 1. Test Runner Component
- Initialize and configure different LLM instances
- Manage test execution workflow
- Coordinate parallel execution of multiple models
- Handle test parameters and configurations
### 2. Performance Monitoring
- Real-time resource usage tracking
- Timing measurements for each request
- Memory allocation monitoring
- GPU utilization tracking
### 3. Task Definition Catalog
- Collection of coding challenges at various difficulty levels
- Standardized test cases with expected outputs
- Task categorization by programming language and complexity
- Reproducible test scenarios
### 4. Evaluation Engine
- Automated code quality assessment
- Functional correctness validation
- Syntax checking and linting
- Comparison against expected outputs
### 5. Reporting System
- Performance comparison dashboards
- Quality assessment reports
- Resource usage visualizations
- Summary statistics and trend analysis
## Test Scenarios
### Basic Programming Tasks
- Variable assignments and data structures
- Function implementations
- Control flow (loops, conditionals)
- Error handling and validation
### Intermediate Complexity
- Algorithm implementation
- Data structure manipulation
- API integration examples
- Modular code design
### Advanced Challenges
- Multi-threading and concurrency
- Mathematical and scientific computing
- System design and architecture
- Code optimization and refactoring
## Implementation Approach
### Phase 1: Core Framework Development
1. Create base testing infrastructure
2. Implement model interface abstraction
3. Build test runner and execution engine
4. Develop basic performance monitoring
### Phase 2: Task and Evaluation System
1. Design coding challenge catalog
2. Implement automated quality assessment
3. Create validation mechanisms
4. Develop reporting templates
### Phase 3: Advanced Features
1. Parallel execution support
2. Advanced visualization capabilities
3. Statistical analysis and confidence intervals
4. Configuration management and customization
## Expected Output
### Test Reports
- Detailed performance metrics per model
- Quality assessment scores
- Resource usage comparisons
- Statistical significance analysis
- Recommendations for model selection
### Visualization Components
- Performance comparison charts
- Resource utilization graphs
- Quality trend analysis
- Multi-model side-by-side comparisons
## Configuration Options
### Model Parameters
- Context window size
- Sampling parameters (temperature, top-p, etc.)
- System prompts and templates
- Hardware specifications
### Test Parameters
- Number of iterations per task
- Timeout thresholds
- Resource monitoring intervals
- Quality evaluation criteria weights
## Integration Points
### Existing Tools
- Local LLM APIs (e.g., text-generation-webui)
- Python-based LLM libraries
- System monitoring tools (htop, nvidia-smi)
- Code linting and validation tools
### Data Storage
- CSV/JSON output files
- Database for historical comparisons
- Version control integration
- Cloud storage for sharing results
## Success Criteria
- Reproducible test results
- Clear performance differentiation between models
- Comprehensive quality assessment coverage
- Scalable architecture for additional models
- User-friendly interface for result interpretation
## Risks and Mitigation
1. **Hardware Variability**: Address with controlled test environments
2. **Inconsistent Results**: Use multiple iterations and statistical analysis
3. **Resource Limitations**: Implement proper monitoring and resource allocation
4. **Quality Assessment Bias**: Use multiple evaluation criteria and human validation

View File

@@ -0,0 +1,70 @@
#!/usr/bin/env python3
"""
Example usage of the LLM Test Framework with real models
"""
import time
import json
from pathlib import Path
from llm_test_framework import TestSuite, MockLLM
# Example tasks for testing
SAMPLE_TASKS = [
{
"name": "Hello World Function",
"prompt": "Write a Python function that prints 'Hello, World!'",
"expected_output": "print('Hello, World!')",
"parameters": {"temperature": 0.3}
},
{
"name": "Sum Function",
"prompt": "Write a Python function to calculate the sum of two numbers",
"expected_output": "return a + b",
"parameters": {"temperature": 0.5}
},
{
"name": "FizzBuzz",
"prompt": "Write a Python function to solve the FizzBuzz problem",
"expected_output": "if i % 3 == 0 and i % 5 == 0",
"parameters": {"temperature": 0.7}
}
]
def main():
print("Starting LLM Performance Testing Framework")
print("=" * 50)
# Create test suite
suite = TestSuite("sample_test_results")
# Add some sample models
print("Adding sample models...")
suite.add_model(MockLLM("Model_A", {"context": 2048}))
suite.add_model(MockLLM("Model_B", {"context": 4096}))
# Run tests
print("Running sample tests...")
suite.run_all_tests(SAMPLE_TASKS)
# Save results
print("Saving results...")
suite.save_results("sample_results.json")
# Show summary
print("\n" + "=" * 50)
print("TEST SUMMARY")
print("=" * 50)
for result in suite.results:
print(f"Model: {result.model_name}")
print(f"Task: {result.task_name}")
print(f"Response Time: {result.response_time:.2f}s")
print(f"Quality Score: {result.quality_score:.1f}/100")
print(f"Success: {'Yes' if result.success else 'No'}")
print("-" * 30)
print(f"\nTotal tests run: {len(suite.results)}")
print(f"Results saved to: {suite.output_dir}")
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,129 @@
#!/usr/bin/env python3
"""
Find duplicate files in a directory tree by content hash, with parallelization support.
"""
import hashlib
import os
import sys
from collections import defaultdict
from pathlib import Path
from multiprocessing import Pool, cpu_count
from typing import Dict, List, Tuple
def get_file_hash(filepath: Path, chunk_size: int = 8192) -> str:
"""
Calculate the SHA256 hash of a file.
Args:
filepath: Path to the file
chunk_size: Size of chunks to read at a time
Returns:
SHA256 hash of the file as a hexadecimal string
"""
hash_sha256 = hashlib.sha256()
try:
with open(filepath, 'rb') as f:
# Read file in chunks to handle large files efficiently
for chunk in iter(lambda: f.read(chunk_size), b""):
hash_sha256.update(chunk)
return hash_sha256.hexdigest()
except Exception as e:
print(f"Error reading {filepath}: {e}")
return ""
def process_file(filepath: Path) -> Tuple[str, Path]:
"""
Process a single file and return its hash and path.
Args:
filepath: Path to the file
Returns:
Tuple of (file_hash, file_path)
"""
file_hash = get_file_hash(filepath)
return (file_hash, filepath)
def find_duplicates_parallel(root_dir: Path, num_processes: int = None) -> Dict[str, List[Path]]:
"""
Find duplicate files in a directory tree by content hash using parallel processing.
Args:
root_dir: Root directory to search
num_processes: Number of processes to use (default: CPU count)
Returns:
Dictionary mapping file hashes to lists of file paths with that hash
"""
if num_processes is None:
num_processes = cpu_count()
# Get all files in the directory tree
files = []
for file_path in root_dir.rglob('*'):
if file_path.is_file():
files.append(file_path)
print(f"Found {len(files)} files to process")
# Process files in parallel
with Pool(processes=num_processes) as pool:
results = pool.map(process_file, files)
# Group files by hash
hash_to_files = defaultdict(list)
for file_hash, file_path in results:
if file_hash: # Only add files that were successfully hashed
hash_to_files[file_hash].append(file_path)
# Filter out unique files (only keep duplicates)
duplicates = {hash_val: file_list for hash_val, file_list in hash_to_files.items()
if len(file_list) > 1}
return duplicates
def print_duplicates(duplicates: Dict[str, List[Path]]) -> None:
"""
Print the duplicate files in a formatted way.
Args:
duplicates: Dictionary mapping file hashes to lists of file paths
"""
if not duplicates:
print("No duplicates found.")
return
print(f"Found {len(duplicates)} sets of duplicate files:")
for hash_val, file_list in duplicates.items():
print(f"\nHash: {hash_val}")
for file_path in file_list:
print(f" {file_path}")
def main():
"""Main function to run the duplicate file finder."""
if len(sys.argv) < 2:
print("Usage: python find_duplicates.py <root_directory> [num_processes]")
sys.exit(1)
root_dir = Path(sys.argv[1])
if not root_dir.exists():
print(f"Error: Directory {root_dir} does not exist")
sys.exit(1)
num_processes = None
if len(sys.argv) > 2:
try:
num_processes = int(sys.argv[2])
except ValueError:
print("Error: num_processes must be an integer")
sys.exit(1)
print(f"Searching for duplicates in {root_dir} using {num_processes or cpu_count()} processes")
duplicates = find_duplicates_parallel(root_dir, num_processes)
print_duplicates(duplicates)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,246 @@
#!/usr/bin/env python3
"""
LLM Performance Testing Framework
"""
import time
import psutil
import json
import subprocess
import os
from abc import ABC, abstractmethod
from typing import Dict, List, Any, Tuple
from dataclasses import dataclass
from pathlib import Path
@dataclass
class TestResult:
"""Container for test results"""
model_name: str
task_name: str
response_time: float
tps: float
cpu_usage: float
memory_usage: float
gpu_usage: float
quality_score: float
raw_output: str
expected_output: str
success: bool
timestamp: float
@dataclass
class PerformanceMetrics:
"""Container for performance metrics"""
response_time: float
tps: float
cpu_avg: float
memory_avg: float
gpu_avg: float
class LLMInterface(ABC):
"""Abstract base class for LLM interfaces"""
def __init__(self, model_name: str, config: Dict[str, Any]):
self.model_name = model_name
self.config = config
@abstractmethod
def generate(self, prompt: str, **kwargs) -> str:
"""Generate text from prompt"""
pass
@abstractmethod
def get_model_info(self) -> Dict[str, Any]:
"""Get model information"""
pass
class TestSuite:
"""Main test suite class"""
def __init__(self, output_dir: str = "test_results"):
self.output_dir = Path(output_dir)
self.output_dir.mkdir(exist_ok=True)
self.results: List[TestResult] = []
self.models: List[LLMInterface] = []
def add_model(self, model: LLMInterface):
"""Add a model to test"""
self.models.append(model)
def run_single_test(self, model: LLMInterface, task: Dict[str, Any]) -> TestResult:
"""Run a single test on a model"""
# Start resource monitoring
start_time = time.time()
# Monitor resources
cpu_start = psutil.cpu_percent(interval=1)
memory_start = psutil.virtual_memory().used
gpu_start = self._get_gpu_usage()
# Generate response
try:
response = model.generate(task["prompt"], **task.get("parameters", {}))
success = True
except Exception as e:
response = f"Error: {str(e)}"
success = False
# Measure completion
end_time = time.time()
response_time = end_time - start_time
# Calculate TPS
tps = 0
if success and response:
# Estimate tokens (very rough approximation)
tps = len(response.split()) / response_time if response_time > 0 else 0
# Monitor end resources
cpu_end = psutil.cpu_percent(interval=1)
memory_end = psutil.virtual_memory().used
gpu_end = self._get_gpu_usage()
# Calculate averages
cpu_avg = (cpu_start + cpu_end) / 2
memory_avg = (memory_start + memory_end) / 2
gpu_avg = (gpu_start + gpu_end) / 2
# Quality score - this is a placeholder
quality_score = self._calculate_quality_score(response, task.get("expected_output", ""))
result = TestResult(
model_name=model.model_name,
task_name=task["name"],
response_time=response_time,
tps=tps,
cpu_usage=cpu_avg,
memory_usage=memory_avg,
gpu_usage=gpu_avg,
quality_score=quality_score,
raw_output=response,
expected_output=task.get("expected_output", ""),
success=success,
timestamp=time.time()
)
self.results.append(result)
return result
def run_all_tests(self, tasks: List[Dict[str, Any]]):
"""Run all tests across all models"""
for model in self.models:
print(f"Running tests on {model.model_name}")
for task in tasks:
result = self.run_single_test(model, task)
print(f" {task['name']}: {result.response_time:.2f}s, {result.quality_score:.2f}/100")
def _get_gpu_usage(self) -> float:
"""Get GPU usage if available"""
try:
# This is a placeholder - in real implementation,
# you'd use nvidia-smi or similar
return 0.0
except:
return 0.0
def _calculate_quality_score(self, response: str, expected: str) -> float:
"""Calculate quality score based on response"""
# This is a placeholder implementation
# In practice, this could use static analysis, code execution, etc.
if not response:
return 0.0
# Very basic quality scoring
score = 50 # Base score
# Add points for presence of expected patterns
if expected and expected.lower() in response.lower():
score += 20
# Add points for valid syntax (simplified)
if not response.startswith("Error:") and len(response) > 10:
score += 10
# Cap score at 100
return min(score, 100.0)
def save_results(self, filename: str = None):
"""Save test results to JSON file"""
if filename is None:
filename = f"results_{int(time.time())}.json"
results_data = []
for result in self.results:
results_data.append({
"model_name": result.model_name,
"task_name": result.task_name,
"response_time": result.response_time,
"tps": result.tps,
"cpu_usage": result.cpu_usage,
"memory_usage": result.memory_usage,
"gpu_usage": result.gpu_usage,
"quality_score": result.quality_score,
"raw_output": result.raw_output,
"expected_output": result.expected_output,
"success": result.success,
"timestamp": result.timestamp
})
with open(self.output_dir / filename, 'w') as f:
json.dump(results_data, f, indent=2)
print(f"Results saved to {self.output_dir / filename}")
# Example model implementations
class MockLLM(LLMInterface):
"""Mock LLM for testing purposes"""
def generate(self, prompt: str, **kwargs) -> str:
# Simulate generation time
time.sleep(0.5)
return f"Mock response for: {prompt}"
def get_model_info(self) -> Dict[str, Any]:
return {
"name": self.model_name,
"type": "mock",
"version": "1.0"
}
class ExampleTask:
"""Example task class for testing"""
@staticmethod
def get_sample_tasks() -> List[Dict[str, Any]]:
return [
{
"name": "Simple function",
"prompt": "Write a Python function to add two numbers",
"expected_output": "return a + b",
"parameters": {}
},
{
"name": "Loop example",
"prompt": "Write a Python loop that prints numbers 1 to 10",
"expected_output": "for i in range(1, 11)",
"parameters": {}
}
]
if __name__ == "__main__":
# Create test suite
suite = TestSuite()
# Add mock models
suite.add_model(MockLLM("mock1", {}))
suite.add_model(MockLLM("mock2", {}))
# Run sample tests
tasks = ExampleTask.get_sample_tasks()
suite.run_all_tests(tasks)
# Save results
suite.save_results()
print("Testing completed!")

View File

@@ -0,0 +1,228 @@
#!/usr/bin/env python3
"""
LLM Performance Testing Framework (Simplified Version)
"""
import time
import json
import os
from abc import ABC, abstractmethod
from typing import Dict, List, Any, Tuple
from dataclasses import dataclass
from pathlib import Path
@dataclass
class TestResult:
"""Container for test results"""
model_name: str
task_name: str
response_time: float
tps: float
cpu_usage: float
memory_usage: float
quality_score: float
raw_output: str
expected_output: str
success: bool
timestamp: float
@dataclass
class PerformanceMetrics:
"""Container for performance metrics"""
response_time: float
tps: float
cpu_avg: float
memory_avg: float
class LLMInterface(ABC):
"""Abstract base class for LLM interfaces"""
def __init__(self, model_name: str, config: Dict[str, Any]):
self.model_name = model_name
self.config = config
@abstractmethod
def generate(self, prompt: str, **kwargs) -> str:
"""Generate text from prompt"""
pass
@abstractmethod
def get_model_info(self) -> Dict[str, Any]:
"""Get model information"""
pass
class TestSuite:
"""Main test suite class"""
def __init__(self, output_dir: str = "test_results"):
self.output_dir = Path(output_dir)
self.output_dir.mkdir(exist_ok=True)
self.results: List[TestResult] = []
self.models: List[LLMInterface] = []
def add_model(self, model: LLMInterface):
"""Add a model to test"""
self.models.append(model)
def run_single_test(self, model: LLMInterface, task: Dict[str, Any]) -> TestResult:
"""Run a single test on a model"""
# Start timing
start_time = time.time()
# Simulate resource usage
cpu_start = 50 # Percentage (simulated)
memory_start = 1024 # MB (simulated)
# Generate response
try:
response = model.generate(task["prompt"], **task.get("parameters", {}))
success = True
except Exception as e:
response = f"Error: {str(e)}"
success = False
# Measure completion
end_time = time.time()
response_time = end_time - start_time
# Calculate TPS (simulated)
tps = 0
if success and response:
# Estimate tokens (very rough approximation)
tps = len(response.split()) / response_time if response_time > 0 else 0
# Simulate end resources
cpu_end = 60 # Percentage (simulated)
memory_end = 1536 # MB (simulated)
# Calculate averages
cpu_avg = (cpu_start + cpu_end) / 2
memory_avg = (memory_start + memory_end) / 2
# Quality score - this is a placeholder
quality_score = self._calculate_quality_score(response, task.get("expected_output", ""))
result = TestResult(
model_name=model.model_name,
task_name=task["name"],
response_time=response_time,
tps=tps,
cpu_usage=cpu_avg,
memory_usage=memory_avg,
quality_score=quality_score,
raw_output=response,
expected_output=task.get("expected_output", ""),
success=success,
timestamp=time.time()
)
self.results.append(result)
return result
def run_all_tests(self, tasks: List[Dict[str, Any]]):
"""Run all tests across all models"""
for model in self.models:
print(f"Running tests on {model.model_name}")
for task in tasks:
result = self.run_single_test(model, task)
print(f" {task['name']}: {result.response_time:.2f}s, {result.quality_score:.2f}/100")
def _calculate_quality_score(self, response: str, expected: str) -> float:
"""Calculate quality score based on response"""
# This is a placeholder implementation
# In practice, this could use static analysis, code execution, etc.
if not response:
return 0.0
# Very basic quality scoring
score = 50 # Base score
# Add points for presence of expected patterns
if expected and expected.lower() in response.lower():
score += 20
# Add points for valid syntax (simplified)
if not response.startswith("Error:") and len(response) > 10:
score += 10
# Cap score at 100
return min(score, 100.0)
def save_results(self, filename: str = None):
"""Save test results to JSON file"""
if filename is None:
filename = f"results_{int(time.time())}.json"
results_data = []
for result in self.results:
results_data.append({
"model_name": result.model_name,
"task_name": result.task_name,
"response_time": result.response_time,
"tps": result.tps,
"cpu_usage": result.cpu_usage,
"memory_usage": result.memory_usage,
"quality_score": result.quality_score,
"raw_output": result.raw_output,
"expected_output": result.expected_output,
"success": result.success,
"timestamp": result.timestamp
})
with open(self.output_dir / filename, 'w') as f:
json.dump(results_data, f, indent=2)
print(f"Results saved to {self.output_dir / filename}")
# Example model implementations
class MockLLM(LLMInterface):
"""Mock LLM for testing purposes"""
def generate(self, prompt: str, **kwargs) -> str:
# Simulate generation time
time.sleep(0.1)
return f"Mock response for: {prompt}"
def get_model_info(self) -> Dict[str, Any]:
return {
"name": self.model_name,
"type": "mock",
"version": "1.0"
}
class ExampleTask:
"""Example task class for testing"""
@staticmethod
def get_sample_tasks() -> List[Dict[str, Any]]:
return [
{
"name": "Simple function",
"prompt": "Write a Python function to add two numbers",
"expected_output": "return a + b",
"parameters": {}
},
{
"name": "Loop example",
"prompt": "Write a Python loop that prints numbers 1 to 10",
"expected_output": "for i in range(1, 11)",
"parameters": {}
}
]
if __name__ == "__main__":
# Create test suite
suite = TestSuite()
# Add mock models
suite.add_model(MockLLM("mock1", {}))
suite.add_model(MockLLM("mock2", {}))
# Run sample tests
tasks = ExampleTask.get_sample_tasks()
suite.run_all_tests(tasks)
# Save results
suite.save_results()
print("Testing completed!")

View File

@@ -0,0 +1,3 @@
# LLM Performance Testing Framework Requirements
psutil>=5.8.0