Containerized local LLM stack for the Framework Desktop / Strix Halo,
plus the OpenCode harness on the Mac side.
- pyinfra/framework/: pyinfra deploy targeting the box
- llama.cpp (Vulkan), vLLM (ROCm), Ollama (ROCm with HSA override
for gfx1151), OpenWebUI
- Beszel (host + container + AMD GPU dashboard via sysfs)
- OpenLIT (LLM fleet metrics)
- Phoenix (per-trace agent waterfall)
- OpenHands (autonomous agent in a Docker sandbox)
- opencode/: OpenCode config + Phoenix bridge plugin (OTel exporter)
- install.sh deploys to ~/.config/opencode/
- StrixHaloSetup.md / StrixHaloMemory.md / Roadmap.md / TODO.md:
documentation and planning
- testing/qwen3-coder-30b/: small evaluation harness
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
8.6 KiB
Deploying Kimi Linear Model on Strix Halo with 128GB Unified Memory
This document provides a comprehensive guide to deploying the Kimi Linear 48B parameter model on your AMD Strix Halo hardware with 128GB unified memory.
Hardware Specifications
- Processor: AMD Ryzen AI Max+ 395 (Strix Halo)
- Memory: 128GB LPDDR5X unified memory
- GPU: Integrated Radeon 8060S (40 CU RDNA 3.5)
- Architecture: Zen 5 CPU with unified memory access
Prerequisites
Hardware Requirements
- AMD Ryzen AI Max+ 395 processor (or compatible Strix Halo APU)
- Minimum 128GB RAM (as specified)
- Ubuntu 24.04 LTS or later with kernel 6.16.9+
- BIOS access for configuration
Software Requirements
- Python 3.10+
- ROCm 6.4.4 or higher
- llama.cpp with ROCm support
- HuggingFace CLI client
Deployment Steps
Phase 1: Hardware Configuration
1.1 BIOS Configuration
Before installation, you must set these BIOS options:
- UMA Frame Buffer Size: 512MB (for display)
- IOMMU: Disabled (for performance)
- Power Mode: 85W TDP for balanced performance
1.2 Kernel Configuration
Ensure you're running kernel 6.16.9 or later:
# Check current kernel
uname -r
# If upgrading needed:
sudo add-apt-repository ppa:cappelikan/ppa -y
sudo apt update
sudo apt install mainline -y
sudo mainline --install 6.16.9
1.3 GRUB Boot Parameters
Edit /etc/default/grub:
sudo nano /etc/default/grub
Add to GRUB_CMDLINE_LINUX_DEFAULT:
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash amd_iommu=off amdgpu.gttsize=117760"
Apply the changes:
sudo update-grub
sudo reboot
1.4 GPU Access Configuration
Create udev rules for proper GPU permissions:
sudo bash -c 'cat > /etc/udev/rules.d/99-amd-kfd.rules << EOF
SUBSYSTEM=="kfd", GROUP="render", MODE="0666"
SUBSYSTEM=="drm", KERNEL=="card[0-9]*", GROUP="render", MODE="0666"
SUBSYSTEM=="drm", KERNEL=="renderD[0-9]*", GROUP="render", MODE="0666"
EOF'
sudo udevadm control --reload-rules
sudo udevadm trigger
sudo usermod -aG video,render $USER
Phase 2: ROCm Installation
2.1 Install ROCm from AMD Repository
# Add AMD ROCm repository
wget https://repo.radeon.com/rocm/rocm.gpg.key -O - | sudo apt-key add -
echo 'deb [arch=amd64] https://repo.radeon.com/rocm/apt/latest ubuntu main' | sudo tee /etc/apt/sources.list.d/rocm.list
sudo apt update
sudo apt install rocm-hip-runtime rocm-hip-sdk -y
2.2 Verify ROCm Installation
# Check ROCm version
rocminfo | grep -E "Agent|Name:|Marketing"
# Check GTT memory allocation
cat /sys/class/drm/card*/device/mem_info_gtt_total
Phase 3: Model Setup
3.1 Download Kimi Linear GGUF Model
# Install huggingface-cli
pip install "huggingface_hub[cli]"
# Create model directory
mkdir -p ~/models/kimi-linear
cd ~/models/kimi-linear
# Download the Q4_K_M quantized GGUF model (recommended for your hardware)
huggingface-cli download bartowski/moonshotai_Kimi-Linear-48B-A3B-Instruct-GGUF \
--include "moonshotai_Kimi-Linear-48B-A3B-Instruct-Q4_K_M.gguf" \
--local-dir ./
# Or download a lower quality version if memory is tight
# huggingface-cli download bartowski/moonshotai_Kimi-Linear-48B-A3B-Instruct-GGUF \
# --include "moonshotai_Kimi-Linear-48B-A3B-Instruct-IQ4_XS.gguf" \
# --local-dir ./
3.2 Validate Download
ls -la ~/models/kimi-linear/
# You should see a file named moonshotai_Kimi-Linear-48B-A3B-Instruct-Q4_K_M.gguf
Phase 4: llama.cpp Setup for AMD ROCm
4.1 Build llama.cpp with ROCm Support
# Clone llama.cpp
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
# Install build dependencies
sudo apt install cmake gcc g++ git libcurl4-openssl-dev -y
# Build for ROCm
cmake -B build -S . \
-DGGML_HIP=ON \
-DAMDGPU_TARGETS="gfx1151"
cmake --build build --config Release -j$(nproc)
4.2 Alternative: Use Pre-built Container
Instead of building, you can use a container with ROCm pre-configured:
# Install Podman and Distrobox
sudo apt install podman -y
curl -s https://raw.githubusercontent.com/89luca89/distrobox/main/install | sudo sh
# Create LLM container with ROCm
distrobox create llama-rocm \
--image docker.io/kyuz0/amd-strix-halo-toolboxes:rocm-6.4.4-rocwmma \
--additional-flags "--device /dev/dri --device /dev/kfd --group-add video --group-add render"
Phase 5: Test Inference
5.1 Run a Basic Test
# If building locally:
./build/bin/llama-cli \
-m ~/models/kimi-linear/moonshotai_Kimi-Linear-48B-A3B-Instruct-Q4_K_M.gguf \
--no-mmap \
-ngl 99 \
-p "Explain quantum computing in simple terms:" \
-n 256
# If using container:
distrobox enter llama-rocm
llama-cli \
-m ~/models/kimi-linear/moonshotai_Kimi-Linear-48B-A3B-Instruct-Q4_K_M.gguf \
--no-mmap \
-ngl 99 \
-p "Explain quantum computing in simple terms:" \
-n 256
Phase 6: Benchmarking and Optimization
6.1 Run Performance Benchmarks
# Run llama-bench for performance testing
llama-bench \
-m ~/models/kimi-linear/moonshotai_Kimi-Linear-48B-A3B-Instruct-Q4_K_M.gguf \
-mmp 0 \
-ngl 99 \
-p 512 \
-n 128
6.2 Monitor GPU Resources
# Watch GPU usage during inference
watch -n 1 rocm-smi
# Check memory usage
cat /sys/class/drm/card*/device/mem_info_gtt_total
cat /sys/class/drm/card*/device/mem_info_gtt_used
Performance Optimization Tips
Memory Management
- Use 128GB GTT: Your system can utilize up to ~115GB of memory for computing
- Choose appropriate quantization:
- Q4_K_M: Good balance of quality and size (~30GB)
- IQ4_XS: Smaller but with some quality loss (~26GB)
- Q5_K_M: Higher quality but larger (~35GB)
Inference Parameters
# Optimal flags for AMD Strix Halo
llama-cli \
-m ~/models/kimi-linear/moonshotai_Kimi-Linear-48B-A3B-Instruct-Q4_K_M.gguf \
--no-mmap \
-ngl 99 \
-p "Your prompt text" \
-n 512 \
--temp 0.7 \
--top-p 0.9 \
--repeat-penalty 1.1
Expected Performance on Strix Halo
| Model Quantization | Predicted Performance | Notes |
|---|---|---|
| Q4_K_M (~30GB) | 10-15 tokens/sec | Good balance |
| IQ4_XS (~26GB) | 8-12 tokens/sec | Memory efficient |
| Q5_K_M (~35GB) | 12-18 tokens/sec | Higher quality |
Troubleshooting
Common Issues
- HSA_STATUS_ERROR_OUT_OF_RESOURCES: Check udev rules
- Slow model loading: Add
--no-mmapflag - Memory allocation errors: Verify GTT size in GRUB
- Container GPU access: Check device permissions
Verification Commands
# Verify memory allocation
cat /sys/class/drm/card*/device/mem_info_gtt_total
# Verify ROCm detection
rocminfo | grep -E "Name:|Marketing"
# Test basic GPU functionality
rocminfo
Additional Tools and Integration
For Development/Testing
- llama-cpp-python: For Python integration
# Install package
pip install llama-cpp-python
# Test with Python
from llama_cpp import Llama
llm = Llama.from_pretrained(
repo_id="bartowski/moonshotai_Kimi-Linear-48B-A3B-Instruct-GGUF",
filename="moonshotai_Kimi-Linear-48B-A3B-Instruct-Q4_K_M.gguf"
)
- OpenWebUI: Web interface for LLMs
- Ollama: Alternative for simpler deployment
- vLLM: High-throughput inference (for API deployment)
Monitoring and Resource Usage
System Resources to Monitor
- GPU Utilization: Use
watch -n 1 rocm-smi - Memory Usage: Check
/sys/class/drm/card*/device/mem_info_gtt_total - CPU Load: Monitor with
htop - Temperature: Monitor GPU temperature
Storage Considerations
- Model files require ~30-50GB on disk
- Additional swap space recommended for complex tasks
- Consider SSD storage for optimal I/O performance
Recommendations for Different Use Cases
For Coding Tasks
- Use
--temp 0.3for more deterministic outputs - Enable
--repeat-penalty 1.2to reduce repetition - Set
-n 1024for long context generation
For Research/Testing
- Use
Q4_K_Mfor reliable performance - Try
IQ4_XSto test memory efficiency - Monitor with
llama-benchfor reproducible results
For Production Use
- Consider containerized deployment for stability
- Set up automated monitoring for resource usage
- Create model-specific configuration files
Conclusion
Your Strix Halo with 128GB unified memory provides an excellent platform for running large language models like Kimi Linear. The key to success is proper configuration of both the system and inference tools to fully utilize the unified memory architecture.
With this setup, you should be able to run Kimi Linear and achieve performance that rivals much more expensive hardware setups.