Files
localgenai/testing/qwen3-coder-30b/DEPLOYMENT_PLAN.md
noisedestroyers 2c4bfefa95 Initial commit: localgenai stack
Containerized local LLM stack for the Framework Desktop / Strix Halo,
plus the OpenCode harness on the Mac side.

- pyinfra/framework/: pyinfra deploy targeting the box
  - llama.cpp (Vulkan), vLLM (ROCm), Ollama (ROCm with HSA override
    for gfx1151), OpenWebUI
  - Beszel (host + container + AMD GPU dashboard via sysfs)
  - OpenLIT (LLM fleet metrics)
  - Phoenix (per-trace agent waterfall)
  - OpenHands (autonomous agent in a Docker sandbox)
- opencode/: OpenCode config + Phoenix bridge plugin (OTel exporter)
  - install.sh deploys to ~/.config/opencode/
- StrixHaloSetup.md / StrixHaloMemory.md / Roadmap.md / TODO.md:
  documentation and planning
- testing/qwen3-coder-30b/: small evaluation harness

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 11:35:10 -04:00

8.6 KiB

Deploying Kimi Linear Model on Strix Halo with 128GB Unified Memory

This document provides a comprehensive guide to deploying the Kimi Linear 48B parameter model on your AMD Strix Halo hardware with 128GB unified memory.

Hardware Specifications

  • Processor: AMD Ryzen AI Max+ 395 (Strix Halo)
  • Memory: 128GB LPDDR5X unified memory
  • GPU: Integrated Radeon 8060S (40 CU RDNA 3.5)
  • Architecture: Zen 5 CPU with unified memory access

Prerequisites

Hardware Requirements

  1. AMD Ryzen AI Max+ 395 processor (or compatible Strix Halo APU)
  2. Minimum 128GB RAM (as specified)
  3. Ubuntu 24.04 LTS or later with kernel 6.16.9+
  4. BIOS access for configuration

Software Requirements

  1. Python 3.10+
  2. ROCm 6.4.4 or higher
  3. llama.cpp with ROCm support
  4. HuggingFace CLI client

Deployment Steps

Phase 1: Hardware Configuration

1.1 BIOS Configuration

Before installation, you must set these BIOS options:

  • UMA Frame Buffer Size: 512MB (for display)
  • IOMMU: Disabled (for performance)
  • Power Mode: 85W TDP for balanced performance

1.2 Kernel Configuration

Ensure you're running kernel 6.16.9 or later:

# Check current kernel
uname -r

# If upgrading needed:
sudo add-apt-repository ppa:cappelikan/ppa -y
sudo apt update
sudo apt install mainline -y
sudo mainline --install 6.16.9

1.3 GRUB Boot Parameters

Edit /etc/default/grub:

sudo nano /etc/default/grub

Add to GRUB_CMDLINE_LINUX_DEFAULT:

GRUB_CMDLINE_LINUX_DEFAULT="quiet splash amd_iommu=off amdgpu.gttsize=117760"

Apply the changes:

sudo update-grub
sudo reboot

1.4 GPU Access Configuration

Create udev rules for proper GPU permissions:

sudo bash -c 'cat > /etc/udev/rules.d/99-amd-kfd.rules << EOF
SUBSYSTEM=="kfd", GROUP="render", MODE="0666"
SUBSYSTEM=="drm", KERNEL=="card[0-9]*", GROUP="render", MODE="0666"
SUBSYSTEM=="drm", KERNEL=="renderD[0-9]*", GROUP="render", MODE="0666"
EOF'

sudo udevadm control --reload-rules
sudo udevadm trigger
sudo usermod -aG video,render $USER

Phase 2: ROCm Installation

2.1 Install ROCm from AMD Repository

# Add AMD ROCm repository
wget https://repo.radeon.com/rocm/rocm.gpg.key -O - | sudo apt-key add -
echo 'deb [arch=amd64] https://repo.radeon.com/rocm/apt/latest ubuntu main' | sudo tee /etc/apt/sources.list.d/rocm.list

sudo apt update
sudo apt install rocm-hip-runtime rocm-hip-sdk -y

2.2 Verify ROCm Installation

# Check ROCm version
rocminfo | grep -E "Agent|Name:|Marketing"

# Check GTT memory allocation
cat /sys/class/drm/card*/device/mem_info_gtt_total

Phase 3: Model Setup

3.1 Download Kimi Linear GGUF Model

# Install huggingface-cli
pip install "huggingface_hub[cli]"

# Create model directory
mkdir -p ~/models/kimi-linear
cd ~/models/kimi-linear

# Download the Q4_K_M quantized GGUF model (recommended for your hardware)
huggingface-cli download bartowski/moonshotai_Kimi-Linear-48B-A3B-Instruct-GGUF \
  --include "moonshotai_Kimi-Linear-48B-A3B-Instruct-Q4_K_M.gguf" \
  --local-dir ./

# Or download a lower quality version if memory is tight
# huggingface-cli download bartowski/moonshotai_Kimi-Linear-48B-A3B-Instruct-GGUF \
#   --include "moonshotai_Kimi-Linear-48B-A3B-Instruct-IQ4_XS.gguf" \
#   --local-dir ./

3.2 Validate Download

ls -la ~/models/kimi-linear/
# You should see a file named moonshotai_Kimi-Linear-48B-A3B-Instruct-Q4_K_M.gguf

Phase 4: llama.cpp Setup for AMD ROCm

4.1 Build llama.cpp with ROCm Support

# Clone llama.cpp
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp

# Install build dependencies
sudo apt install cmake gcc g++ git libcurl4-openssl-dev -y

# Build for ROCm
cmake -B build -S . \
  -DGGML_HIP=ON \
  -DAMDGPU_TARGETS="gfx1151"

cmake --build build --config Release -j$(nproc)

4.2 Alternative: Use Pre-built Container

Instead of building, you can use a container with ROCm pre-configured:

# Install Podman and Distrobox
sudo apt install podman -y
curl -s https://raw.githubusercontent.com/89luca89/distrobox/main/install | sudo sh

# Create LLM container with ROCm
distrobox create llama-rocm \
  --image docker.io/kyuz0/amd-strix-halo-toolboxes:rocm-6.4.4-rocwmma \
  --additional-flags "--device /dev/dri --device /dev/kfd --group-add video --group-add render"

Phase 5: Test Inference

5.1 Run a Basic Test

# If building locally:
./build/bin/llama-cli \
  -m ~/models/kimi-linear/moonshotai_Kimi-Linear-48B-A3B-Instruct-Q4_K_M.gguf \
  --no-mmap \
  -ngl 99 \
  -p "Explain quantum computing in simple terms:" \
  -n 256

# If using container:
distrobox enter llama-rocm
llama-cli \
  -m ~/models/kimi-linear/moonshotai_Kimi-Linear-48B-A3B-Instruct-Q4_K_M.gguf \
  --no-mmap \
  -ngl 99 \
  -p "Explain quantum computing in simple terms:" \
  -n 256

Phase 6: Benchmarking and Optimization

6.1 Run Performance Benchmarks

# Run llama-bench for performance testing
llama-bench \
  -m ~/models/kimi-linear/moonshotai_Kimi-Linear-48B-A3B-Instruct-Q4_K_M.gguf \
  -mmp 0 \
  -ngl 99 \
  -p 512 \
  -n 128

6.2 Monitor GPU Resources

# Watch GPU usage during inference
watch -n 1 rocm-smi

# Check memory usage
cat /sys/class/drm/card*/device/mem_info_gtt_total
cat /sys/class/drm/card*/device/mem_info_gtt_used

Performance Optimization Tips

Memory Management

  1. Use 128GB GTT: Your system can utilize up to ~115GB of memory for computing
  2. Choose appropriate quantization:
    • Q4_K_M: Good balance of quality and size (~30GB)
    • IQ4_XS: Smaller but with some quality loss (~26GB)
    • Q5_K_M: Higher quality but larger (~35GB)

Inference Parameters

# Optimal flags for AMD Strix Halo
llama-cli \
  -m ~/models/kimi-linear/moonshotai_Kimi-Linear-48B-A3B-Instruct-Q4_K_M.gguf \
  --no-mmap \
  -ngl 99 \
  -p "Your prompt text" \
  -n 512 \
  --temp 0.7 \
  --top-p 0.9 \
  --repeat-penalty 1.1

Expected Performance on Strix Halo

Model Quantization Predicted Performance Notes
Q4_K_M (~30GB) 10-15 tokens/sec Good balance
IQ4_XS (~26GB) 8-12 tokens/sec Memory efficient
Q5_K_M (~35GB) 12-18 tokens/sec Higher quality

Troubleshooting

Common Issues

  1. HSA_STATUS_ERROR_OUT_OF_RESOURCES: Check udev rules
  2. Slow model loading: Add --no-mmap flag
  3. Memory allocation errors: Verify GTT size in GRUB
  4. Container GPU access: Check device permissions

Verification Commands

# Verify memory allocation
cat /sys/class/drm/card*/device/mem_info_gtt_total

# Verify ROCm detection
rocminfo | grep -E "Name:|Marketing" 

# Test basic GPU functionality  
rocminfo

Additional Tools and Integration

For Development/Testing

  1. llama-cpp-python: For Python integration
# Install package
pip install llama-cpp-python

# Test with Python
from llama_cpp import Llama
llm = Llama.from_pretrained(
    repo_id="bartowski/moonshotai_Kimi-Linear-48B-A3B-Instruct-GGUF", 
    filename="moonshotai_Kimi-Linear-48B-A3B-Instruct-Q4_K_M.gguf"
)
  1. OpenWebUI: Web interface for LLMs
  2. Ollama: Alternative for simpler deployment
  3. vLLM: High-throughput inference (for API deployment)

Monitoring and Resource Usage

System Resources to Monitor

  • GPU Utilization: Use watch -n 1 rocm-smi
  • Memory Usage: Check /sys/class/drm/card*/device/mem_info_gtt_total
  • CPU Load: Monitor with htop
  • Temperature: Monitor GPU temperature

Storage Considerations

  • Model files require ~30-50GB on disk
  • Additional swap space recommended for complex tasks
  • Consider SSD storage for optimal I/O performance

Recommendations for Different Use Cases

For Coding Tasks

  • Use --temp 0.3 for more deterministic outputs
  • Enable --repeat-penalty 1.2 to reduce repetition
  • Set -n 1024 for long context generation

For Research/Testing

  • Use Q4_K_M for reliable performance
  • Try IQ4_XS to test memory efficiency
  • Monitor with llama-bench for reproducible results

For Production Use

  • Consider containerized deployment for stability
  • Set up automated monitoring for resource usage
  • Create model-specific configuration files

Conclusion

Your Strix Halo with 128GB unified memory provides an excellent platform for running large language models like Kimi Linear. The key to success is proper configuration of both the system and inference tools to fully utilize the unified memory architecture.

With this setup, you should be able to run Kimi Linear and achieve performance that rivals much more expensive hardware setups.