Initial commit: localgenai stack
Containerized local LLM stack for the Framework Desktop / Strix Halo,
plus the OpenCode harness on the Mac side.
- pyinfra/framework/: pyinfra deploy targeting the box
- llama.cpp (Vulkan), vLLM (ROCm), Ollama (ROCm with HSA override
for gfx1151), OpenWebUI
- Beszel (host + container + AMD GPU dashboard via sysfs)
- OpenLIT (LLM fleet metrics)
- Phoenix (per-trace agent waterfall)
- OpenHands (autonomous agent in a Docker sandbox)
- opencode/: OpenCode config + Phoenix bridge plugin (OTel exporter)
- install.sh deploys to ~/.config/opencode/
- StrixHaloSetup.md / StrixHaloMemory.md / Roadmap.md / TODO.md:
documentation and planning
- testing/qwen3-coder-30b/: small evaluation harness
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
311
testing/qwen3-coder-30b/DEPLOYMENT_PLAN.md
Normal file
311
testing/qwen3-coder-30b/DEPLOYMENT_PLAN.md
Normal file
@@ -0,0 +1,311 @@
|
||||
# Deploying Kimi Linear Model on Strix Halo with 128GB Unified Memory
|
||||
|
||||
This document provides a comprehensive guide to deploying the Kimi Linear 48B parameter model on your AMD Strix Halo hardware with 128GB unified memory.
|
||||
|
||||
## Hardware Specifications
|
||||
|
||||
- **Processor**: AMD Ryzen AI Max+ 395 (Strix Halo)
|
||||
- **Memory**: 128GB LPDDR5X unified memory
|
||||
- **GPU**: Integrated Radeon 8060S (40 CU RDNA 3.5)
|
||||
- **Architecture**: Zen 5 CPU with unified memory access
|
||||
|
||||
## Prerequisites
|
||||
|
||||
### Hardware Requirements
|
||||
1. AMD Ryzen AI Max+ 395 processor (or compatible Strix Halo APU)
|
||||
2. Minimum 128GB RAM (as specified)
|
||||
3. Ubuntu 24.04 LTS or later with kernel 6.16.9+
|
||||
4. BIOS access for configuration
|
||||
|
||||
### Software Requirements
|
||||
1. Python 3.10+
|
||||
2. ROCm 6.4.4 or higher
|
||||
3. llama.cpp with ROCm support
|
||||
4. HuggingFace CLI client
|
||||
|
||||
## Deployment Steps
|
||||
|
||||
### Phase 1: Hardware Configuration
|
||||
|
||||
#### 1.1 BIOS Configuration
|
||||
Before installation, you must set these BIOS options:
|
||||
- **UMA Frame Buffer Size**: 512MB (for display)
|
||||
- **IOMMU**: Disabled (for performance)
|
||||
- **Power Mode**: 85W TDP for balanced performance
|
||||
|
||||
#### 1.2 Kernel Configuration
|
||||
Ensure you're running kernel 6.16.9 or later:
|
||||
```bash
|
||||
# Check current kernel
|
||||
uname -r
|
||||
|
||||
# If upgrading needed:
|
||||
sudo add-apt-repository ppa:cappelikan/ppa -y
|
||||
sudo apt update
|
||||
sudo apt install mainline -y
|
||||
sudo mainline --install 6.16.9
|
||||
```
|
||||
|
||||
#### 1.3 GRUB Boot Parameters
|
||||
Edit `/etc/default/grub`:
|
||||
```bash
|
||||
sudo nano /etc/default/grub
|
||||
```
|
||||
|
||||
Add to `GRUB_CMDLINE_LINUX_DEFAULT`:
|
||||
```bash
|
||||
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash amd_iommu=off amdgpu.gttsize=117760"
|
||||
```
|
||||
|
||||
Apply the changes:
|
||||
```bash
|
||||
sudo update-grub
|
||||
sudo reboot
|
||||
```
|
||||
|
||||
#### 1.4 GPU Access Configuration
|
||||
Create udev rules for proper GPU permissions:
|
||||
```bash
|
||||
sudo bash -c 'cat > /etc/udev/rules.d/99-amd-kfd.rules << EOF
|
||||
SUBSYSTEM=="kfd", GROUP="render", MODE="0666"
|
||||
SUBSYSTEM=="drm", KERNEL=="card[0-9]*", GROUP="render", MODE="0666"
|
||||
SUBSYSTEM=="drm", KERNEL=="renderD[0-9]*", GROUP="render", MODE="0666"
|
||||
EOF'
|
||||
|
||||
sudo udevadm control --reload-rules
|
||||
sudo udevadm trigger
|
||||
sudo usermod -aG video,render $USER
|
||||
```
|
||||
|
||||
### Phase 2: ROCm Installation
|
||||
|
||||
#### 2.1 Install ROCm from AMD Repository
|
||||
```bash
|
||||
# Add AMD ROCm repository
|
||||
wget https://repo.radeon.com/rocm/rocm.gpg.key -O - | sudo apt-key add -
|
||||
echo 'deb [arch=amd64] https://repo.radeon.com/rocm/apt/latest ubuntu main' | sudo tee /etc/apt/sources.list.d/rocm.list
|
||||
|
||||
sudo apt update
|
||||
sudo apt install rocm-hip-runtime rocm-hip-sdk -y
|
||||
```
|
||||
|
||||
#### 2.2 Verify ROCm Installation
|
||||
```bash
|
||||
# Check ROCm version
|
||||
rocminfo | grep -E "Agent|Name:|Marketing"
|
||||
|
||||
# Check GTT memory allocation
|
||||
cat /sys/class/drm/card*/device/mem_info_gtt_total
|
||||
```
|
||||
|
||||
### Phase 3: Model Setup
|
||||
|
||||
#### 3.1 Download Kimi Linear GGUF Model
|
||||
```bash
|
||||
# Install huggingface-cli
|
||||
pip install "huggingface_hub[cli]"
|
||||
|
||||
# Create model directory
|
||||
mkdir -p ~/models/kimi-linear
|
||||
cd ~/models/kimi-linear
|
||||
|
||||
# Download the Q4_K_M quantized GGUF model (recommended for your hardware)
|
||||
huggingface-cli download bartowski/moonshotai_Kimi-Linear-48B-A3B-Instruct-GGUF \
|
||||
--include "moonshotai_Kimi-Linear-48B-A3B-Instruct-Q4_K_M.gguf" \
|
||||
--local-dir ./
|
||||
|
||||
# Or download a lower quality version if memory is tight
|
||||
# huggingface-cli download bartowski/moonshotai_Kimi-Linear-48B-A3B-Instruct-GGUF \
|
||||
# --include "moonshotai_Kimi-Linear-48B-A3B-Instruct-IQ4_XS.gguf" \
|
||||
# --local-dir ./
|
||||
```
|
||||
|
||||
#### 3.2 Validate Download
|
||||
```bash
|
||||
ls -la ~/models/kimi-linear/
|
||||
# You should see a file named moonshotai_Kimi-Linear-48B-A3B-Instruct-Q4_K_M.gguf
|
||||
```
|
||||
|
||||
### Phase 4: llama.cpp Setup for AMD ROCm
|
||||
|
||||
#### 4.1 Build llama.cpp with ROCm Support
|
||||
```bash
|
||||
# Clone llama.cpp
|
||||
git clone https://github.com/ggerganov/llama.cpp.git
|
||||
cd llama.cpp
|
||||
|
||||
# Install build dependencies
|
||||
sudo apt install cmake gcc g++ git libcurl4-openssl-dev -y
|
||||
|
||||
# Build for ROCm
|
||||
cmake -B build -S . \
|
||||
-DGGML_HIP=ON \
|
||||
-DAMDGPU_TARGETS="gfx1151"
|
||||
|
||||
cmake --build build --config Release -j$(nproc)
|
||||
```
|
||||
|
||||
#### 4.2 Alternative: Use Pre-built Container
|
||||
Instead of building, you can use a container with ROCm pre-configured:
|
||||
```bash
|
||||
# Install Podman and Distrobox
|
||||
sudo apt install podman -y
|
||||
curl -s https://raw.githubusercontent.com/89luca89/distrobox/main/install | sudo sh
|
||||
|
||||
# Create LLM container with ROCm
|
||||
distrobox create llama-rocm \
|
||||
--image docker.io/kyuz0/amd-strix-halo-toolboxes:rocm-6.4.4-rocwmma \
|
||||
--additional-flags "--device /dev/dri --device /dev/kfd --group-add video --group-add render"
|
||||
```
|
||||
|
||||
### Phase 5: Test Inference
|
||||
|
||||
#### 5.1 Run a Basic Test
|
||||
```bash
|
||||
# If building locally:
|
||||
./build/bin/llama-cli \
|
||||
-m ~/models/kimi-linear/moonshotai_Kimi-Linear-48B-A3B-Instruct-Q4_K_M.gguf \
|
||||
--no-mmap \
|
||||
-ngl 99 \
|
||||
-p "Explain quantum computing in simple terms:" \
|
||||
-n 256
|
||||
|
||||
# If using container:
|
||||
distrobox enter llama-rocm
|
||||
llama-cli \
|
||||
-m ~/models/kimi-linear/moonshotai_Kimi-Linear-48B-A3B-Instruct-Q4_K_M.gguf \
|
||||
--no-mmap \
|
||||
-ngl 99 \
|
||||
-p "Explain quantum computing in simple terms:" \
|
||||
-n 256
|
||||
```
|
||||
|
||||
### Phase 6: Benchmarking and Optimization
|
||||
|
||||
#### 6.1 Run Performance Benchmarks
|
||||
```bash
|
||||
# Run llama-bench for performance testing
|
||||
llama-bench \
|
||||
-m ~/models/kimi-linear/moonshotai_Kimi-Linear-48B-A3B-Instruct-Q4_K_M.gguf \
|
||||
-mmp 0 \
|
||||
-ngl 99 \
|
||||
-p 512 \
|
||||
-n 128
|
||||
```
|
||||
|
||||
#### 6.2 Monitor GPU Resources
|
||||
```bash
|
||||
# Watch GPU usage during inference
|
||||
watch -n 1 rocm-smi
|
||||
|
||||
# Check memory usage
|
||||
cat /sys/class/drm/card*/device/mem_info_gtt_total
|
||||
cat /sys/class/drm/card*/device/mem_info_gtt_used
|
||||
```
|
||||
|
||||
## Performance Optimization Tips
|
||||
|
||||
### Memory Management
|
||||
1. **Use 128GB GTT**: Your system can utilize up to ~115GB of memory for computing
|
||||
2. **Choose appropriate quantization**:
|
||||
- Q4_K_M: Good balance of quality and size (~30GB)
|
||||
- IQ4_XS: Smaller but with some quality loss (~26GB)
|
||||
- Q5_K_M: Higher quality but larger (~35GB)
|
||||
|
||||
### Inference Parameters
|
||||
```bash
|
||||
# Optimal flags for AMD Strix Halo
|
||||
llama-cli \
|
||||
-m ~/models/kimi-linear/moonshotai_Kimi-Linear-48B-A3B-Instruct-Q4_K_M.gguf \
|
||||
--no-mmap \
|
||||
-ngl 99 \
|
||||
-p "Your prompt text" \
|
||||
-n 512 \
|
||||
--temp 0.7 \
|
||||
--top-p 0.9 \
|
||||
--repeat-penalty 1.1
|
||||
```
|
||||
|
||||
## Expected Performance on Strix Halo
|
||||
|
||||
| Model Quantization | Predicted Performance | Notes |
|
||||
| ------------------ | --------------------- | ----- |
|
||||
| Q4_K_M (~30GB) | 10-15 tokens/sec | Good balance |
|
||||
| IQ4_XS (~26GB) | 8-12 tokens/sec | Memory efficient |
|
||||
| Q5_K_M (~35GB) | 12-18 tokens/sec | Higher quality |
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Common Issues
|
||||
1. **HSA_STATUS_ERROR_OUT_OF_RESOURCES**: Check udev rules
|
||||
2. **Slow model loading**: Add `--no-mmap` flag
|
||||
3. **Memory allocation errors**: Verify GTT size in GRUB
|
||||
4. **Container GPU access**: Check device permissions
|
||||
|
||||
### Verification Commands
|
||||
```bash
|
||||
# Verify memory allocation
|
||||
cat /sys/class/drm/card*/device/mem_info_gtt_total
|
||||
|
||||
# Verify ROCm detection
|
||||
rocminfo | grep -E "Name:|Marketing"
|
||||
|
||||
# Test basic GPU functionality
|
||||
rocminfo
|
||||
```
|
||||
|
||||
## Additional Tools and Integration
|
||||
|
||||
### For Development/Testing
|
||||
1. **llama-cpp-python**: For Python integration
|
||||
```python
|
||||
# Install package
|
||||
pip install llama-cpp-python
|
||||
|
||||
# Test with Python
|
||||
from llama_cpp import Llama
|
||||
llm = Llama.from_pretrained(
|
||||
repo_id="bartowski/moonshotai_Kimi-Linear-48B-A3B-Instruct-GGUF",
|
||||
filename="moonshotai_Kimi-Linear-48B-A3B-Instruct-Q4_K_M.gguf"
|
||||
)
|
||||
```
|
||||
|
||||
2. **OpenWebUI**: Web interface for LLMs
|
||||
3. **Ollama**: Alternative for simpler deployment
|
||||
4. **vLLM**: High-throughput inference (for API deployment)
|
||||
|
||||
## Monitoring and Resource Usage
|
||||
|
||||
### System Resources to Monitor
|
||||
- **GPU Utilization**: Use `watch -n 1 rocm-smi`
|
||||
- **Memory Usage**: Check `/sys/class/drm/card*/device/mem_info_gtt_total`
|
||||
- **CPU Load**: Monitor with `htop`
|
||||
- **Temperature**: Monitor GPU temperature
|
||||
|
||||
### Storage Considerations
|
||||
- Model files require ~30-50GB on disk
|
||||
- Additional swap space recommended for complex tasks
|
||||
- Consider SSD storage for optimal I/O performance
|
||||
|
||||
## Recommendations for Different Use Cases
|
||||
|
||||
### For Coding Tasks
|
||||
- Use `--temp 0.3` for more deterministic outputs
|
||||
- Enable `--repeat-penalty 1.2` to reduce repetition
|
||||
- Set `-n 1024` for long context generation
|
||||
|
||||
### For Research/Testing
|
||||
- Use `Q4_K_M` for reliable performance
|
||||
- Try `IQ4_XS` to test memory efficiency
|
||||
- Monitor with `llama-bench` for reproducible results
|
||||
|
||||
### For Production Use
|
||||
- Consider containerized deployment for stability
|
||||
- Set up automated monitoring for resource usage
|
||||
- Create model-specific configuration files
|
||||
|
||||
## Conclusion
|
||||
|
||||
Your Strix Halo with 128GB unified memory provides an excellent platform for running large language models like Kimi Linear. The key to success is proper configuration of both the system and inference tools to fully utilize the unified memory architecture.
|
||||
|
||||
With this setup, you should be able to run Kimi Linear and achieve performance that rivals much more expensive hardware setups.
|
||||
109
testing/qwen3-coder-30b/README.md
Normal file
109
testing/qwen3-coder-30b/README.md
Normal file
@@ -0,0 +1,109 @@
|
||||
# Local LLM Performance Testing Framework
|
||||
|
||||
A comprehensive framework for comparing the performance and quality of different local hosted LLMs on coding tasks.
|
||||
|
||||
## Features
|
||||
|
||||
- **Performance Metrics**: Response time, tokens per second, CPU/memory/GPU usage
|
||||
- **Quality Assessment**: Automated code quality evaluation
|
||||
- **Multi-Model Testing**: Test multiple LLMs simultaneously
|
||||
- **Parallel Execution**: Run tests efficiently
|
||||
- **Comprehensive Reporting**: Detailed test results and comparisons
|
||||
|
||||
## Installation
|
||||
|
||||
```bash
|
||||
# Install required dependencies
|
||||
pip install psutil
|
||||
```
|
||||
|
||||
## Usage
|
||||
|
||||
### Quick Start
|
||||
|
||||
1. Create a test suite and add models:
|
||||
|
||||
```python
|
||||
from llm_test_framework import TestSuite, MockLLM
|
||||
|
||||
# Create test suite
|
||||
suite = TestSuite()
|
||||
|
||||
# Add models to test
|
||||
suite.add_model(MockLLM("gpt4all", {"model_path": "/path/to/model"}))
|
||||
suite.add_model(MockLLM("llama.cpp", {"model_path": "/path/to/llama"}))
|
||||
|
||||
# Define test tasks
|
||||
tasks = [
|
||||
{
|
||||
"name": "Simple function",
|
||||
"prompt": "Write a Python function to add two numbers",
|
||||
"expected_output": "return a + b"
|
||||
}
|
||||
]
|
||||
|
||||
# Run tests
|
||||
suite.run_all_tests(tasks)
|
||||
suite.save_results()
|
||||
```
|
||||
|
||||
### Running the Example
|
||||
|
||||
```bash
|
||||
python llm_test_framework.py
|
||||
```
|
||||
|
||||
This will run sample tests with mock models and save results to JSON.
|
||||
|
||||
## Components
|
||||
|
||||
### 1. TestSuite
|
||||
Main class managing the testing process, including:
|
||||
- Model registration
|
||||
- Test execution
|
||||
- Result collection and saving
|
||||
|
||||
### 2. LLMInterface
|
||||
Abstract base class for LLM implementations with methods:
|
||||
- `generate()`: Generate text from prompt
|
||||
- `get_model_info()`: Get model information
|
||||
|
||||
### 3. TestResult
|
||||
Data class storing individual test results with:
|
||||
- Performance metrics
|
||||
- Quality scores
|
||||
- Raw outputs
|
||||
- Success status
|
||||
|
||||
## Extending the Framework
|
||||
|
||||
To add support for a new LLM:
|
||||
|
||||
1. Create a new class inheriting from `LLMInterface`
|
||||
2. Implement `generate()` and `get_model_info()` methods
|
||||
3. Add your model to the test suite
|
||||
|
||||
## Test Types
|
||||
|
||||
The framework supports multiple types of coding tasks:
|
||||
- Basic function implementations
|
||||
- Code refactoring examples
|
||||
- Algorithm challenges
|
||||
- System design questions
|
||||
|
||||
## Output
|
||||
|
||||
Results are saved in JSON format with:
|
||||
- Performance metrics (response time, TPS)
|
||||
- Resource usage (CPU, memory, GPU)
|
||||
- Quality assessment scores
|
||||
- Raw model outputs
|
||||
- Success/failure indicators
|
||||
|
||||
## Future Enhancements
|
||||
|
||||
- Integration with local LLM APIs (text-generation-webui, llama.cpp, etc.)
|
||||
- Advanced code quality evaluation using static analysis
|
||||
- Web-based dashboard for results visualization
|
||||
- Database storage for historical comparisons
|
||||
- Automated statistical analysis of results
|
||||
158
testing/qwen3-coder-30b/TestingPlan.md
Normal file
158
testing/qwen3-coder-30b/TestingPlan.md
Normal file
@@ -0,0 +1,158 @@
|
||||
# Local LLM Performance Testing Plan
|
||||
|
||||
## Overview
|
||||
This document outlines a comprehensive testing framework to compare the performance and quality of different local hosted LLMs on coding tasks. The framework will measure both speed and quality metrics to provide actionable insights for model selection and optimization.
|
||||
|
||||
## Objectives
|
||||
- Evaluate response time and throughput of different LLMs
|
||||
- Assess code quality and correctness of generated outputs
|
||||
- Measure resource utilization during code generation
|
||||
- Provide standardized comparison between different models
|
||||
- Enable reproducible testing environments
|
||||
|
||||
## Key Metrics to Measure
|
||||
|
||||
### Performance Metrics
|
||||
1. **Response Time**: Time from request to first token
|
||||
2. **Tokens Per Second (TPS)**: Generation speed
|
||||
3. **Total Latency**: Complete generation time
|
||||
4. **Resource Usage**:
|
||||
- CPU utilization
|
||||
- Memory consumption
|
||||
- GPU utilization (for GPU-enabled models)
|
||||
- VRAM usage (for GPU-enabled models)
|
||||
|
||||
### Quality Metrics
|
||||
1. **Code Syntax Accuracy**: Valid syntax according to language rules
|
||||
2. **Functional Correctness**: Code produces expected outputs
|
||||
3. **Code Efficiency**: Algorithmic efficiency and best practices
|
||||
4. **Code Readability**: Code structure and documentation
|
||||
5. **Task Completion Rate**: Properly solving the given coding challenge
|
||||
|
||||
## Testing Framework Architecture
|
||||
|
||||
### 1. Test Runner Component
|
||||
- Initialize and configure different LLM instances
|
||||
- Manage test execution workflow
|
||||
- Coordinate parallel execution of multiple models
|
||||
- Handle test parameters and configurations
|
||||
|
||||
### 2. Performance Monitoring
|
||||
- Real-time resource usage tracking
|
||||
- Timing measurements for each request
|
||||
- Memory allocation monitoring
|
||||
- GPU utilization tracking
|
||||
|
||||
### 3. Task Definition Catalog
|
||||
- Collection of coding challenges at various difficulty levels
|
||||
- Standardized test cases with expected outputs
|
||||
- Task categorization by programming language and complexity
|
||||
- Reproducible test scenarios
|
||||
|
||||
### 4. Evaluation Engine
|
||||
- Automated code quality assessment
|
||||
- Functional correctness validation
|
||||
- Syntax checking and linting
|
||||
- Comparison against expected outputs
|
||||
|
||||
### 5. Reporting System
|
||||
- Performance comparison dashboards
|
||||
- Quality assessment reports
|
||||
- Resource usage visualizations
|
||||
- Summary statistics and trend analysis
|
||||
|
||||
## Test Scenarios
|
||||
|
||||
### Basic Programming Tasks
|
||||
- Variable assignments and data structures
|
||||
- Function implementations
|
||||
- Control flow (loops, conditionals)
|
||||
- Error handling and validation
|
||||
|
||||
### Intermediate Complexity
|
||||
- Algorithm implementation
|
||||
- Data structure manipulation
|
||||
- API integration examples
|
||||
- Modular code design
|
||||
|
||||
### Advanced Challenges
|
||||
- Multi-threading and concurrency
|
||||
- Mathematical and scientific computing
|
||||
- System design and architecture
|
||||
- Code optimization and refactoring
|
||||
|
||||
## Implementation Approach
|
||||
|
||||
### Phase 1: Core Framework Development
|
||||
1. Create base testing infrastructure
|
||||
2. Implement model interface abstraction
|
||||
3. Build test runner and execution engine
|
||||
4. Develop basic performance monitoring
|
||||
|
||||
### Phase 2: Task and Evaluation System
|
||||
1. Design coding challenge catalog
|
||||
2. Implement automated quality assessment
|
||||
3. Create validation mechanisms
|
||||
4. Develop reporting templates
|
||||
|
||||
### Phase 3: Advanced Features
|
||||
1. Parallel execution support
|
||||
2. Advanced visualization capabilities
|
||||
3. Statistical analysis and confidence intervals
|
||||
4. Configuration management and customization
|
||||
|
||||
## Expected Output
|
||||
|
||||
### Test Reports
|
||||
- Detailed performance metrics per model
|
||||
- Quality assessment scores
|
||||
- Resource usage comparisons
|
||||
- Statistical significance analysis
|
||||
- Recommendations for model selection
|
||||
|
||||
### Visualization Components
|
||||
- Performance comparison charts
|
||||
- Resource utilization graphs
|
||||
- Quality trend analysis
|
||||
- Multi-model side-by-side comparisons
|
||||
|
||||
## Configuration Options
|
||||
|
||||
### Model Parameters
|
||||
- Context window size
|
||||
- Sampling parameters (temperature, top-p, etc.)
|
||||
- System prompts and templates
|
||||
- Hardware specifications
|
||||
|
||||
### Test Parameters
|
||||
- Number of iterations per task
|
||||
- Timeout thresholds
|
||||
- Resource monitoring intervals
|
||||
- Quality evaluation criteria weights
|
||||
|
||||
## Integration Points
|
||||
|
||||
### Existing Tools
|
||||
- Local LLM APIs (e.g., text-generation-webui)
|
||||
- Python-based LLM libraries
|
||||
- System monitoring tools (htop, nvidia-smi)
|
||||
- Code linting and validation tools
|
||||
|
||||
### Data Storage
|
||||
- CSV/JSON output files
|
||||
- Database for historical comparisons
|
||||
- Version control integration
|
||||
- Cloud storage for sharing results
|
||||
|
||||
## Success Criteria
|
||||
- Reproducible test results
|
||||
- Clear performance differentiation between models
|
||||
- Comprehensive quality assessment coverage
|
||||
- Scalable architecture for additional models
|
||||
- User-friendly interface for result interpretation
|
||||
|
||||
## Risks and Mitigation
|
||||
1. **Hardware Variability**: Address with controlled test environments
|
||||
2. **Inconsistent Results**: Use multiple iterations and statistical analysis
|
||||
3. **Resource Limitations**: Implement proper monitoring and resource allocation
|
||||
4. **Quality Assessment Bias**: Use multiple evaluation criteria and human validation
|
||||
70
testing/qwen3-coder-30b/example_usage.py
Executable file
70
testing/qwen3-coder-30b/example_usage.py
Executable file
@@ -0,0 +1,70 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Example usage of the LLM Test Framework with real models
|
||||
"""
|
||||
|
||||
import time
|
||||
import json
|
||||
from pathlib import Path
|
||||
from llm_test_framework import TestSuite, MockLLM
|
||||
|
||||
# Example tasks for testing
|
||||
SAMPLE_TASKS = [
|
||||
{
|
||||
"name": "Hello World Function",
|
||||
"prompt": "Write a Python function that prints 'Hello, World!'",
|
||||
"expected_output": "print('Hello, World!')",
|
||||
"parameters": {"temperature": 0.3}
|
||||
},
|
||||
{
|
||||
"name": "Sum Function",
|
||||
"prompt": "Write a Python function to calculate the sum of two numbers",
|
||||
"expected_output": "return a + b",
|
||||
"parameters": {"temperature": 0.5}
|
||||
},
|
||||
{
|
||||
"name": "FizzBuzz",
|
||||
"prompt": "Write a Python function to solve the FizzBuzz problem",
|
||||
"expected_output": "if i % 3 == 0 and i % 5 == 0",
|
||||
"parameters": {"temperature": 0.7}
|
||||
}
|
||||
]
|
||||
|
||||
def main():
|
||||
print("Starting LLM Performance Testing Framework")
|
||||
print("=" * 50)
|
||||
|
||||
# Create test suite
|
||||
suite = TestSuite("sample_test_results")
|
||||
|
||||
# Add some sample models
|
||||
print("Adding sample models...")
|
||||
suite.add_model(MockLLM("Model_A", {"context": 2048}))
|
||||
suite.add_model(MockLLM("Model_B", {"context": 4096}))
|
||||
|
||||
# Run tests
|
||||
print("Running sample tests...")
|
||||
suite.run_all_tests(SAMPLE_TASKS)
|
||||
|
||||
# Save results
|
||||
print("Saving results...")
|
||||
suite.save_results("sample_results.json")
|
||||
|
||||
# Show summary
|
||||
print("\n" + "=" * 50)
|
||||
print("TEST SUMMARY")
|
||||
print("=" * 50)
|
||||
|
||||
for result in suite.results:
|
||||
print(f"Model: {result.model_name}")
|
||||
print(f"Task: {result.task_name}")
|
||||
print(f"Response Time: {result.response_time:.2f}s")
|
||||
print(f"Quality Score: {result.quality_score:.1f}/100")
|
||||
print(f"Success: {'Yes' if result.success else 'No'}")
|
||||
print("-" * 30)
|
||||
|
||||
print(f"\nTotal tests run: {len(suite.results)}")
|
||||
print(f"Results saved to: {suite.output_dir}")
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
129
testing/qwen3-coder-30b/find_duplicates.py
Executable file
129
testing/qwen3-coder-30b/find_duplicates.py
Executable file
@@ -0,0 +1,129 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Find duplicate files in a directory tree by content hash, with parallelization support.
|
||||
"""
|
||||
|
||||
import hashlib
|
||||
import os
|
||||
import sys
|
||||
from collections import defaultdict
|
||||
from pathlib import Path
|
||||
from multiprocessing import Pool, cpu_count
|
||||
from typing import Dict, List, Tuple
|
||||
|
||||
def get_file_hash(filepath: Path, chunk_size: int = 8192) -> str:
|
||||
"""
|
||||
Calculate the SHA256 hash of a file.
|
||||
|
||||
Args:
|
||||
filepath: Path to the file
|
||||
chunk_size: Size of chunks to read at a time
|
||||
|
||||
Returns:
|
||||
SHA256 hash of the file as a hexadecimal string
|
||||
"""
|
||||
hash_sha256 = hashlib.sha256()
|
||||
try:
|
||||
with open(filepath, 'rb') as f:
|
||||
# Read file in chunks to handle large files efficiently
|
||||
for chunk in iter(lambda: f.read(chunk_size), b""):
|
||||
hash_sha256.update(chunk)
|
||||
return hash_sha256.hexdigest()
|
||||
except Exception as e:
|
||||
print(f"Error reading {filepath}: {e}")
|
||||
return ""
|
||||
|
||||
def process_file(filepath: Path) -> Tuple[str, Path]:
|
||||
"""
|
||||
Process a single file and return its hash and path.
|
||||
|
||||
Args:
|
||||
filepath: Path to the file
|
||||
|
||||
Returns:
|
||||
Tuple of (file_hash, file_path)
|
||||
"""
|
||||
file_hash = get_file_hash(filepath)
|
||||
return (file_hash, filepath)
|
||||
|
||||
def find_duplicates_parallel(root_dir: Path, num_processes: int = None) -> Dict[str, List[Path]]:
|
||||
"""
|
||||
Find duplicate files in a directory tree by content hash using parallel processing.
|
||||
|
||||
Args:
|
||||
root_dir: Root directory to search
|
||||
num_processes: Number of processes to use (default: CPU count)
|
||||
|
||||
Returns:
|
||||
Dictionary mapping file hashes to lists of file paths with that hash
|
||||
"""
|
||||
if num_processes is None:
|
||||
num_processes = cpu_count()
|
||||
|
||||
# Get all files in the directory tree
|
||||
files = []
|
||||
for file_path in root_dir.rglob('*'):
|
||||
if file_path.is_file():
|
||||
files.append(file_path)
|
||||
|
||||
print(f"Found {len(files)} files to process")
|
||||
|
||||
# Process files in parallel
|
||||
with Pool(processes=num_processes) as pool:
|
||||
results = pool.map(process_file, files)
|
||||
|
||||
# Group files by hash
|
||||
hash_to_files = defaultdict(list)
|
||||
for file_hash, file_path in results:
|
||||
if file_hash: # Only add files that were successfully hashed
|
||||
hash_to_files[file_hash].append(file_path)
|
||||
|
||||
# Filter out unique files (only keep duplicates)
|
||||
duplicates = {hash_val: file_list for hash_val, file_list in hash_to_files.items()
|
||||
if len(file_list) > 1}
|
||||
|
||||
return duplicates
|
||||
|
||||
def print_duplicates(duplicates: Dict[str, List[Path]]) -> None:
|
||||
"""
|
||||
Print the duplicate files in a formatted way.
|
||||
|
||||
Args:
|
||||
duplicates: Dictionary mapping file hashes to lists of file paths
|
||||
"""
|
||||
if not duplicates:
|
||||
print("No duplicates found.")
|
||||
return
|
||||
|
||||
print(f"Found {len(duplicates)} sets of duplicate files:")
|
||||
for hash_val, file_list in duplicates.items():
|
||||
print(f"\nHash: {hash_val}")
|
||||
for file_path in file_list:
|
||||
print(f" {file_path}")
|
||||
|
||||
def main():
|
||||
"""Main function to run the duplicate file finder."""
|
||||
if len(sys.argv) < 2:
|
||||
print("Usage: python find_duplicates.py <root_directory> [num_processes]")
|
||||
sys.exit(1)
|
||||
|
||||
root_dir = Path(sys.argv[1])
|
||||
if not root_dir.exists():
|
||||
print(f"Error: Directory {root_dir} does not exist")
|
||||
sys.exit(1)
|
||||
|
||||
num_processes = None
|
||||
if len(sys.argv) > 2:
|
||||
try:
|
||||
num_processes = int(sys.argv[2])
|
||||
except ValueError:
|
||||
print("Error: num_processes must be an integer")
|
||||
sys.exit(1)
|
||||
|
||||
print(f"Searching for duplicates in {root_dir} using {num_processes or cpu_count()} processes")
|
||||
|
||||
duplicates = find_duplicates_parallel(root_dir, num_processes)
|
||||
print_duplicates(duplicates)
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
246
testing/qwen3-coder-30b/llm_test_framework.py
Executable file
246
testing/qwen3-coder-30b/llm_test_framework.py
Executable file
@@ -0,0 +1,246 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
LLM Performance Testing Framework
|
||||
"""
|
||||
|
||||
import time
|
||||
import psutil
|
||||
import json
|
||||
import subprocess
|
||||
import os
|
||||
from abc import ABC, abstractmethod
|
||||
from typing import Dict, List, Any, Tuple
|
||||
from dataclasses import dataclass
|
||||
from pathlib import Path
|
||||
|
||||
@dataclass
|
||||
class TestResult:
|
||||
"""Container for test results"""
|
||||
model_name: str
|
||||
task_name: str
|
||||
response_time: float
|
||||
tps: float
|
||||
cpu_usage: float
|
||||
memory_usage: float
|
||||
gpu_usage: float
|
||||
quality_score: float
|
||||
raw_output: str
|
||||
expected_output: str
|
||||
success: bool
|
||||
timestamp: float
|
||||
|
||||
@dataclass
|
||||
class PerformanceMetrics:
|
||||
"""Container for performance metrics"""
|
||||
response_time: float
|
||||
tps: float
|
||||
cpu_avg: float
|
||||
memory_avg: float
|
||||
gpu_avg: float
|
||||
|
||||
class LLMInterface(ABC):
|
||||
"""Abstract base class for LLM interfaces"""
|
||||
|
||||
def __init__(self, model_name: str, config: Dict[str, Any]):
|
||||
self.model_name = model_name
|
||||
self.config = config
|
||||
|
||||
@abstractmethod
|
||||
def generate(self, prompt: str, **kwargs) -> str:
|
||||
"""Generate text from prompt"""
|
||||
pass
|
||||
|
||||
@abstractmethod
|
||||
def get_model_info(self) -> Dict[str, Any]:
|
||||
"""Get model information"""
|
||||
pass
|
||||
|
||||
class TestSuite:
|
||||
"""Main test suite class"""
|
||||
|
||||
def __init__(self, output_dir: str = "test_results"):
|
||||
self.output_dir = Path(output_dir)
|
||||
self.output_dir.mkdir(exist_ok=True)
|
||||
self.results: List[TestResult] = []
|
||||
self.models: List[LLMInterface] = []
|
||||
|
||||
def add_model(self, model: LLMInterface):
|
||||
"""Add a model to test"""
|
||||
self.models.append(model)
|
||||
|
||||
def run_single_test(self, model: LLMInterface, task: Dict[str, Any]) -> TestResult:
|
||||
"""Run a single test on a model"""
|
||||
# Start resource monitoring
|
||||
start_time = time.time()
|
||||
|
||||
# Monitor resources
|
||||
cpu_start = psutil.cpu_percent(interval=1)
|
||||
memory_start = psutil.virtual_memory().used
|
||||
gpu_start = self._get_gpu_usage()
|
||||
|
||||
# Generate response
|
||||
try:
|
||||
response = model.generate(task["prompt"], **task.get("parameters", {}))
|
||||
success = True
|
||||
except Exception as e:
|
||||
response = f"Error: {str(e)}"
|
||||
success = False
|
||||
|
||||
# Measure completion
|
||||
end_time = time.time()
|
||||
response_time = end_time - start_time
|
||||
|
||||
# Calculate TPS
|
||||
tps = 0
|
||||
if success and response:
|
||||
# Estimate tokens (very rough approximation)
|
||||
tps = len(response.split()) / response_time if response_time > 0 else 0
|
||||
|
||||
# Monitor end resources
|
||||
cpu_end = psutil.cpu_percent(interval=1)
|
||||
memory_end = psutil.virtual_memory().used
|
||||
gpu_end = self._get_gpu_usage()
|
||||
|
||||
# Calculate averages
|
||||
cpu_avg = (cpu_start + cpu_end) / 2
|
||||
memory_avg = (memory_start + memory_end) / 2
|
||||
gpu_avg = (gpu_start + gpu_end) / 2
|
||||
|
||||
# Quality score - this is a placeholder
|
||||
quality_score = self._calculate_quality_score(response, task.get("expected_output", ""))
|
||||
|
||||
result = TestResult(
|
||||
model_name=model.model_name,
|
||||
task_name=task["name"],
|
||||
response_time=response_time,
|
||||
tps=tps,
|
||||
cpu_usage=cpu_avg,
|
||||
memory_usage=memory_avg,
|
||||
gpu_usage=gpu_avg,
|
||||
quality_score=quality_score,
|
||||
raw_output=response,
|
||||
expected_output=task.get("expected_output", ""),
|
||||
success=success,
|
||||
timestamp=time.time()
|
||||
)
|
||||
|
||||
self.results.append(result)
|
||||
return result
|
||||
|
||||
def run_all_tests(self, tasks: List[Dict[str, Any]]):
|
||||
"""Run all tests across all models"""
|
||||
for model in self.models:
|
||||
print(f"Running tests on {model.model_name}")
|
||||
for task in tasks:
|
||||
result = self.run_single_test(model, task)
|
||||
print(f" {task['name']}: {result.response_time:.2f}s, {result.quality_score:.2f}/100")
|
||||
|
||||
def _get_gpu_usage(self) -> float:
|
||||
"""Get GPU usage if available"""
|
||||
try:
|
||||
# This is a placeholder - in real implementation,
|
||||
# you'd use nvidia-smi or similar
|
||||
return 0.0
|
||||
except:
|
||||
return 0.0
|
||||
|
||||
def _calculate_quality_score(self, response: str, expected: str) -> float:
|
||||
"""Calculate quality score based on response"""
|
||||
# This is a placeholder implementation
|
||||
# In practice, this could use static analysis, code execution, etc.
|
||||
if not response:
|
||||
return 0.0
|
||||
|
||||
# Very basic quality scoring
|
||||
score = 50 # Base score
|
||||
|
||||
# Add points for presence of expected patterns
|
||||
if expected and expected.lower() in response.lower():
|
||||
score += 20
|
||||
|
||||
# Add points for valid syntax (simplified)
|
||||
if not response.startswith("Error:") and len(response) > 10:
|
||||
score += 10
|
||||
|
||||
# Cap score at 100
|
||||
return min(score, 100.0)
|
||||
|
||||
def save_results(self, filename: str = None):
|
||||
"""Save test results to JSON file"""
|
||||
if filename is None:
|
||||
filename = f"results_{int(time.time())}.json"
|
||||
|
||||
results_data = []
|
||||
for result in self.results:
|
||||
results_data.append({
|
||||
"model_name": result.model_name,
|
||||
"task_name": result.task_name,
|
||||
"response_time": result.response_time,
|
||||
"tps": result.tps,
|
||||
"cpu_usage": result.cpu_usage,
|
||||
"memory_usage": result.memory_usage,
|
||||
"gpu_usage": result.gpu_usage,
|
||||
"quality_score": result.quality_score,
|
||||
"raw_output": result.raw_output,
|
||||
"expected_output": result.expected_output,
|
||||
"success": result.success,
|
||||
"timestamp": result.timestamp
|
||||
})
|
||||
|
||||
with open(self.output_dir / filename, 'w') as f:
|
||||
json.dump(results_data, f, indent=2)
|
||||
|
||||
print(f"Results saved to {self.output_dir / filename}")
|
||||
|
||||
# Example model implementations
|
||||
class MockLLM(LLMInterface):
|
||||
"""Mock LLM for testing purposes"""
|
||||
|
||||
def generate(self, prompt: str, **kwargs) -> str:
|
||||
# Simulate generation time
|
||||
time.sleep(0.5)
|
||||
return f"Mock response for: {prompt}"
|
||||
|
||||
def get_model_info(self) -> Dict[str, Any]:
|
||||
return {
|
||||
"name": self.model_name,
|
||||
"type": "mock",
|
||||
"version": "1.0"
|
||||
}
|
||||
|
||||
class ExampleTask:
|
||||
"""Example task class for testing"""
|
||||
|
||||
@staticmethod
|
||||
def get_sample_tasks() -> List[Dict[str, Any]]:
|
||||
return [
|
||||
{
|
||||
"name": "Simple function",
|
||||
"prompt": "Write a Python function to add two numbers",
|
||||
"expected_output": "return a + b",
|
||||
"parameters": {}
|
||||
},
|
||||
{
|
||||
"name": "Loop example",
|
||||
"prompt": "Write a Python loop that prints numbers 1 to 10",
|
||||
"expected_output": "for i in range(1, 11)",
|
||||
"parameters": {}
|
||||
}
|
||||
]
|
||||
|
||||
if __name__ == "__main__":
|
||||
# Create test suite
|
||||
suite = TestSuite()
|
||||
|
||||
# Add mock models
|
||||
suite.add_model(MockLLM("mock1", {}))
|
||||
suite.add_model(MockLLM("mock2", {}))
|
||||
|
||||
# Run sample tests
|
||||
tasks = ExampleTask.get_sample_tasks()
|
||||
suite.run_all_tests(tasks)
|
||||
|
||||
# Save results
|
||||
suite.save_results()
|
||||
|
||||
print("Testing completed!")
|
||||
228
testing/qwen3-coder-30b/llm_test_framework_simple.py
Executable file
228
testing/qwen3-coder-30b/llm_test_framework_simple.py
Executable file
@@ -0,0 +1,228 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
LLM Performance Testing Framework (Simplified Version)
|
||||
"""
|
||||
|
||||
import time
|
||||
import json
|
||||
import os
|
||||
from abc import ABC, abstractmethod
|
||||
from typing import Dict, List, Any, Tuple
|
||||
from dataclasses import dataclass
|
||||
from pathlib import Path
|
||||
|
||||
@dataclass
|
||||
class TestResult:
|
||||
"""Container for test results"""
|
||||
model_name: str
|
||||
task_name: str
|
||||
response_time: float
|
||||
tps: float
|
||||
cpu_usage: float
|
||||
memory_usage: float
|
||||
quality_score: float
|
||||
raw_output: str
|
||||
expected_output: str
|
||||
success: bool
|
||||
timestamp: float
|
||||
|
||||
@dataclass
|
||||
class PerformanceMetrics:
|
||||
"""Container for performance metrics"""
|
||||
response_time: float
|
||||
tps: float
|
||||
cpu_avg: float
|
||||
memory_avg: float
|
||||
|
||||
class LLMInterface(ABC):
|
||||
"""Abstract base class for LLM interfaces"""
|
||||
|
||||
def __init__(self, model_name: str, config: Dict[str, Any]):
|
||||
self.model_name = model_name
|
||||
self.config = config
|
||||
|
||||
@abstractmethod
|
||||
def generate(self, prompt: str, **kwargs) -> str:
|
||||
"""Generate text from prompt"""
|
||||
pass
|
||||
|
||||
@abstractmethod
|
||||
def get_model_info(self) -> Dict[str, Any]:
|
||||
"""Get model information"""
|
||||
pass
|
||||
|
||||
class TestSuite:
|
||||
"""Main test suite class"""
|
||||
|
||||
def __init__(self, output_dir: str = "test_results"):
|
||||
self.output_dir = Path(output_dir)
|
||||
self.output_dir.mkdir(exist_ok=True)
|
||||
self.results: List[TestResult] = []
|
||||
self.models: List[LLMInterface] = []
|
||||
|
||||
def add_model(self, model: LLMInterface):
|
||||
"""Add a model to test"""
|
||||
self.models.append(model)
|
||||
|
||||
def run_single_test(self, model: LLMInterface, task: Dict[str, Any]) -> TestResult:
|
||||
"""Run a single test on a model"""
|
||||
# Start timing
|
||||
start_time = time.time()
|
||||
|
||||
# Simulate resource usage
|
||||
cpu_start = 50 # Percentage (simulated)
|
||||
memory_start = 1024 # MB (simulated)
|
||||
|
||||
# Generate response
|
||||
try:
|
||||
response = model.generate(task["prompt"], **task.get("parameters", {}))
|
||||
success = True
|
||||
except Exception as e:
|
||||
response = f"Error: {str(e)}"
|
||||
success = False
|
||||
|
||||
# Measure completion
|
||||
end_time = time.time()
|
||||
response_time = end_time - start_time
|
||||
|
||||
# Calculate TPS (simulated)
|
||||
tps = 0
|
||||
if success and response:
|
||||
# Estimate tokens (very rough approximation)
|
||||
tps = len(response.split()) / response_time if response_time > 0 else 0
|
||||
|
||||
# Simulate end resources
|
||||
cpu_end = 60 # Percentage (simulated)
|
||||
memory_end = 1536 # MB (simulated)
|
||||
|
||||
# Calculate averages
|
||||
cpu_avg = (cpu_start + cpu_end) / 2
|
||||
memory_avg = (memory_start + memory_end) / 2
|
||||
|
||||
# Quality score - this is a placeholder
|
||||
quality_score = self._calculate_quality_score(response, task.get("expected_output", ""))
|
||||
|
||||
result = TestResult(
|
||||
model_name=model.model_name,
|
||||
task_name=task["name"],
|
||||
response_time=response_time,
|
||||
tps=tps,
|
||||
cpu_usage=cpu_avg,
|
||||
memory_usage=memory_avg,
|
||||
quality_score=quality_score,
|
||||
raw_output=response,
|
||||
expected_output=task.get("expected_output", ""),
|
||||
success=success,
|
||||
timestamp=time.time()
|
||||
)
|
||||
|
||||
self.results.append(result)
|
||||
return result
|
||||
|
||||
def run_all_tests(self, tasks: List[Dict[str, Any]]):
|
||||
"""Run all tests across all models"""
|
||||
for model in self.models:
|
||||
print(f"Running tests on {model.model_name}")
|
||||
for task in tasks:
|
||||
result = self.run_single_test(model, task)
|
||||
print(f" {task['name']}: {result.response_time:.2f}s, {result.quality_score:.2f}/100")
|
||||
|
||||
def _calculate_quality_score(self, response: str, expected: str) -> float:
|
||||
"""Calculate quality score based on response"""
|
||||
# This is a placeholder implementation
|
||||
# In practice, this could use static analysis, code execution, etc.
|
||||
if not response:
|
||||
return 0.0
|
||||
|
||||
# Very basic quality scoring
|
||||
score = 50 # Base score
|
||||
|
||||
# Add points for presence of expected patterns
|
||||
if expected and expected.lower() in response.lower():
|
||||
score += 20
|
||||
|
||||
# Add points for valid syntax (simplified)
|
||||
if not response.startswith("Error:") and len(response) > 10:
|
||||
score += 10
|
||||
|
||||
# Cap score at 100
|
||||
return min(score, 100.0)
|
||||
|
||||
def save_results(self, filename: str = None):
|
||||
"""Save test results to JSON file"""
|
||||
if filename is None:
|
||||
filename = f"results_{int(time.time())}.json"
|
||||
|
||||
results_data = []
|
||||
for result in self.results:
|
||||
results_data.append({
|
||||
"model_name": result.model_name,
|
||||
"task_name": result.task_name,
|
||||
"response_time": result.response_time,
|
||||
"tps": result.tps,
|
||||
"cpu_usage": result.cpu_usage,
|
||||
"memory_usage": result.memory_usage,
|
||||
"quality_score": result.quality_score,
|
||||
"raw_output": result.raw_output,
|
||||
"expected_output": result.expected_output,
|
||||
"success": result.success,
|
||||
"timestamp": result.timestamp
|
||||
})
|
||||
|
||||
with open(self.output_dir / filename, 'w') as f:
|
||||
json.dump(results_data, f, indent=2)
|
||||
|
||||
print(f"Results saved to {self.output_dir / filename}")
|
||||
|
||||
# Example model implementations
|
||||
class MockLLM(LLMInterface):
|
||||
"""Mock LLM for testing purposes"""
|
||||
|
||||
def generate(self, prompt: str, **kwargs) -> str:
|
||||
# Simulate generation time
|
||||
time.sleep(0.1)
|
||||
return f"Mock response for: {prompt}"
|
||||
|
||||
def get_model_info(self) -> Dict[str, Any]:
|
||||
return {
|
||||
"name": self.model_name,
|
||||
"type": "mock",
|
||||
"version": "1.0"
|
||||
}
|
||||
|
||||
class ExampleTask:
|
||||
"""Example task class for testing"""
|
||||
|
||||
@staticmethod
|
||||
def get_sample_tasks() -> List[Dict[str, Any]]:
|
||||
return [
|
||||
{
|
||||
"name": "Simple function",
|
||||
"prompt": "Write a Python function to add two numbers",
|
||||
"expected_output": "return a + b",
|
||||
"parameters": {}
|
||||
},
|
||||
{
|
||||
"name": "Loop example",
|
||||
"prompt": "Write a Python loop that prints numbers 1 to 10",
|
||||
"expected_output": "for i in range(1, 11)",
|
||||
"parameters": {}
|
||||
}
|
||||
]
|
||||
|
||||
if __name__ == "__main__":
|
||||
# Create test suite
|
||||
suite = TestSuite()
|
||||
|
||||
# Add mock models
|
||||
suite.add_model(MockLLM("mock1", {}))
|
||||
suite.add_model(MockLLM("mock2", {}))
|
||||
|
||||
# Run sample tests
|
||||
tasks = ExampleTask.get_sample_tasks()
|
||||
suite.run_all_tests(tasks)
|
||||
|
||||
# Save results
|
||||
suite.save_results()
|
||||
|
||||
print("Testing completed!")
|
||||
3
testing/qwen3-coder-30b/requirements.txt
Normal file
3
testing/qwen3-coder-30b/requirements.txt
Normal file
@@ -0,0 +1,3 @@
|
||||
# LLM Performance Testing Framework Requirements
|
||||
|
||||
psutil>=5.8.0
|
||||
Reference in New Issue
Block a user