Files

noisedestroyers 2c4bfefa95 Initial commit: localgenai stack

Containerized local LLM stack for the Framework Desktop / Strix Halo,
plus the OpenCode harness on the Mac side.

- pyinfra/framework/: pyinfra deploy targeting the box
  - llama.cpp (Vulkan), vLLM (ROCm), Ollama (ROCm with HSA override
    for gfx1151), OpenWebUI
  - Beszel (host + container + AMD GPU dashboard via sysfs)
  - OpenLIT (LLM fleet metrics)
  - Phoenix (per-trace agent waterfall)
  - OpenHands (autonomous agent in a Docker sandbox)
- opencode/: OpenCode config + Phoenix bridge plugin (OTel exporter)
  - install.sh deploys to ~/.config/opencode/
- StrixHaloSetup.md / StrixHaloMemory.md / Roadmap.md / TODO.md:
  documentation and planning
- testing/qwen3-coder-30b/: small evaluation harness

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-08 11:35:10 -04:00

8.6 KiB

Raw Blame History

Deploying Kimi Linear Model on Strix Halo with 128GB Unified Memory

This document provides a comprehensive guide to deploying the Kimi Linear 48B parameter model on your AMD Strix Halo hardware with 128GB unified memory.

Hardware Specifications

Processor: AMD Ryzen AI Max+ 395 (Strix Halo)
Memory: 128GB LPDDR5X unified memory
GPU: Integrated Radeon 8060S (40 CU RDNA 3.5)
Architecture: Zen 5 CPU with unified memory access

Prerequisites

Hardware Requirements

AMD Ryzen AI Max+ 395 processor (or compatible Strix Halo APU)
Minimum 128GB RAM (as specified)
Ubuntu 24.04 LTS or later with kernel 6.16.9+
BIOS access for configuration

Software Requirements

Python 3.10+
ROCm 6.4.4 or higher
llama.cpp with ROCm support
HuggingFace CLI client

Deployment Steps

Phase 1: Hardware Configuration

1.1 BIOS Configuration

Before installation, you must set these BIOS options:

UMA Frame Buffer Size: 512MB (for display)
IOMMU: Disabled (for performance)
Power Mode: 85W TDP for balanced performance

1.2 Kernel Configuration

Ensure you're running kernel 6.16.9 or later:

# Check current kernel
uname -r

# If upgrading needed:
sudo add-apt-repository ppa:cappelikan/ppa -y
sudo apt update
sudo apt install mainline -y
sudo mainline --install 6.16.9

1.3 GRUB Boot Parameters

Edit /etc/default/grub:

sudo nano /etc/default/grub

Add to GRUB_CMDLINE_LINUX_DEFAULT:

GRUB_CMDLINE_LINUX_DEFAULT="quiet splash amd_iommu=off amdgpu.gttsize=117760"

Apply the changes:

sudo update-grub
sudo reboot

1.4 GPU Access Configuration

Create udev rules for proper GPU permissions:

sudo bash -c 'cat > /etc/udev/rules.d/99-amd-kfd.rules << EOF
SUBSYSTEM=="kfd", GROUP="render", MODE="0666"
SUBSYSTEM=="drm", KERNEL=="card[0-9]*", GROUP="render", MODE="0666"
SUBSYSTEM=="drm", KERNEL=="renderD[0-9]*", GROUP="render", MODE="0666"
EOF'

sudo udevadm control --reload-rules
sudo udevadm trigger
sudo usermod -aG video,render $USER

Phase 2: ROCm Installation

2.1 Install ROCm from AMD Repository

# Add AMD ROCm repository
wget https://repo.radeon.com/rocm/rocm.gpg.key -O - | sudo apt-key add -
echo 'deb [arch=amd64] https://repo.radeon.com/rocm/apt/latest ubuntu main' | sudo tee /etc/apt/sources.list.d/rocm.list

sudo apt update
sudo apt install rocm-hip-runtime rocm-hip-sdk -y

2.2 Verify ROCm Installation

# Check ROCm version
rocminfo | grep -E "Agent|Name:|Marketing"

# Check GTT memory allocation
cat /sys/class/drm/card*/device/mem_info_gtt_total

Phase 3: Model Setup

3.1 Download Kimi Linear GGUF Model

# Install huggingface-cli
pip install "huggingface_hub[cli]"

# Create model directory
mkdir -p ~/models/kimi-linear
cd ~/models/kimi-linear

# Download the Q4_K_M quantized GGUF model (recommended for your hardware)
huggingface-cli download bartowski/moonshotai_Kimi-Linear-48B-A3B-Instruct-GGUF \
  --include "moonshotai_Kimi-Linear-48B-A3B-Instruct-Q4_K_M.gguf" \
  --local-dir ./

# Or download a lower quality version if memory is tight
# huggingface-cli download bartowski/moonshotai_Kimi-Linear-48B-A3B-Instruct-GGUF \
#   --include "moonshotai_Kimi-Linear-48B-A3B-Instruct-IQ4_XS.gguf" \
#   --local-dir ./

3.2 Validate Download

ls -la ~/models/kimi-linear/
# You should see a file named moonshotai_Kimi-Linear-48B-A3B-Instruct-Q4_K_M.gguf

Phase 4: llama.cpp Setup for AMD ROCm

4.1 Build llama.cpp with ROCm Support

# Clone llama.cpp
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp

# Install build dependencies
sudo apt install cmake gcc g++ git libcurl4-openssl-dev -y

# Build for ROCm
cmake -B build -S . \
  -DGGML_HIP=ON \
  -DAMDGPU_TARGETS="gfx1151"

cmake --build build --config Release -j$(nproc)

4.2 Alternative: Use Pre-built Container

Instead of building, you can use a container with ROCm pre-configured:

# Install Podman and Distrobox
sudo apt install podman -y
curl -s https://raw.githubusercontent.com/89luca89/distrobox/main/install | sudo sh

# Create LLM container with ROCm
distrobox create llama-rocm \
  --image docker.io/kyuz0/amd-strix-halo-toolboxes:rocm-6.4.4-rocwmma \
  --additional-flags "--device /dev/dri --device /dev/kfd --group-add video --group-add render"

Phase 5: Test Inference

5.1 Run a Basic Test

# If building locally:
./build/bin/llama-cli \
  -m ~/models/kimi-linear/moonshotai_Kimi-Linear-48B-A3B-Instruct-Q4_K_M.gguf \
  --no-mmap \
  -ngl 99 \
  -p "Explain quantum computing in simple terms:" \
  -n 256

# If using container:
distrobox enter llama-rocm
llama-cli \
  -m ~/models/kimi-linear/moonshotai_Kimi-Linear-48B-A3B-Instruct-Q4_K_M.gguf \
  --no-mmap \
  -ngl 99 \
  -p "Explain quantum computing in simple terms:" \
  -n 256

Phase 6: Benchmarking and Optimization

6.1 Run Performance Benchmarks

# Run llama-bench for performance testing
llama-bench \
  -m ~/models/kimi-linear/moonshotai_Kimi-Linear-48B-A3B-Instruct-Q4_K_M.gguf \
  -mmp 0 \
  -ngl 99 \
  -p 512 \
  -n 128

6.2 Monitor GPU Resources

# Watch GPU usage during inference
watch -n 1 rocm-smi

# Check memory usage
cat /sys/class/drm/card*/device/mem_info_gtt_total
cat /sys/class/drm/card*/device/mem_info_gtt_used

Performance Optimization Tips

Memory Management

Use 128GB GTT: Your system can utilize up to ~115GB of memory for computing
Choose appropriate quantization:
- Q4_K_M: Good balance of quality and size (~30GB)
- IQ4_XS: Smaller but with some quality loss (~26GB)
- Q5_K_M: Higher quality but larger (~35GB)

Inference Parameters

# Optimal flags for AMD Strix Halo
llama-cli \
  -m ~/models/kimi-linear/moonshotai_Kimi-Linear-48B-A3B-Instruct-Q4_K_M.gguf \
  --no-mmap \
  -ngl 99 \
  -p "Your prompt text" \
  -n 512 \
  --temp 0.7 \
  --top-p 0.9 \
  --repeat-penalty 1.1

Expected Performance on Strix Halo

Model Quantization	Predicted Performance	Notes
Q4_K_M (~30GB)	10-15 tokens/sec	Good balance
IQ4_XS (~26GB)	8-12 tokens/sec	Memory efficient
Q5_K_M (~35GB)	12-18 tokens/sec	Higher quality

Troubleshooting

Common Issues

HSA_STATUS_ERROR_OUT_OF_RESOURCES: Check udev rules
Slow model loading: Add --no-mmap flag
Memory allocation errors: Verify GTT size in GRUB
Container GPU access: Check device permissions

Verification Commands

# Verify memory allocation
cat /sys/class/drm/card*/device/mem_info_gtt_total

# Verify ROCm detection
rocminfo | grep -E "Name:|Marketing" 

# Test basic GPU functionality  
rocminfo

Additional Tools and Integration

For Development/Testing

llama-cpp-python: For Python integration

# Install package
pip install llama-cpp-python

# Test with Python
from llama_cpp import Llama
llm = Llama.from_pretrained(
    repo_id="bartowski/moonshotai_Kimi-Linear-48B-A3B-Instruct-GGUF", 
    filename="moonshotai_Kimi-Linear-48B-A3B-Instruct-Q4_K_M.gguf"
)

OpenWebUI: Web interface for LLMs
Ollama: Alternative for simpler deployment
vLLM: High-throughput inference (for API deployment)

Monitoring and Resource Usage

System Resources to Monitor

GPU Utilization: Use watch -n 1 rocm-smi
Memory Usage: Check /sys/class/drm/card*/device/mem_info_gtt_total
CPU Load: Monitor with htop
Temperature: Monitor GPU temperature

Storage Considerations

Model files require ~30-50GB on disk
Additional swap space recommended for complex tasks
Consider SSD storage for optimal I/O performance

Recommendations for Different Use Cases

For Coding Tasks

Use --temp 0.3 for more deterministic outputs
Enable --repeat-penalty 1.2 to reduce repetition
Set -n 1024 for long context generation

For Research/Testing

Use Q4_K_M for reliable performance
Try IQ4_XS to test memory efficiency
Monitor with llama-bench for reproducible results

For Production Use

Consider containerized deployment for stability
Set up automated monitoring for resource usage
Create model-specific configuration files

Conclusion

Your Strix Halo with 128GB unified memory provides an excellent platform for running large language models like Kimi Linear. The key to success is proper configuration of both the system and inference tools to fully utilize the unified memory architecture.

With this setup, you should be able to run Kimi Linear and achieve performance that rivals much more expensive hardware setups.

8.6 KiB Raw Blame History