# Deploying Kimi Linear Model on Strix Halo with 128GB Unified Memory This document provides a comprehensive guide to deploying the Kimi Linear 48B parameter model on your AMD Strix Halo hardware with 128GB unified memory. ## Hardware Specifications - **Processor**: AMD Ryzen AI Max+ 395 (Strix Halo) - **Memory**: 128GB LPDDR5X unified memory - **GPU**: Integrated Radeon 8060S (40 CU RDNA 3.5) - **Architecture**: Zen 5 CPU with unified memory access ## Prerequisites ### Hardware Requirements 1. AMD Ryzen AI Max+ 395 processor (or compatible Strix Halo APU) 2. Minimum 128GB RAM (as specified) 3. Ubuntu 24.04 LTS or later with kernel 6.16.9+ 4. BIOS access for configuration ### Software Requirements 1. Python 3.10+ 2. ROCm 6.4.4 or higher 3. llama.cpp with ROCm support 4. HuggingFace CLI client ## Deployment Steps ### Phase 1: Hardware Configuration #### 1.1 BIOS Configuration Before installation, you must set these BIOS options: - **UMA Frame Buffer Size**: 512MB (for display) - **IOMMU**: Disabled (for performance) - **Power Mode**: 85W TDP for balanced performance #### 1.2 Kernel Configuration Ensure you're running kernel 6.16.9 or later: ```bash # Check current kernel uname -r # If upgrading needed: sudo add-apt-repository ppa:cappelikan/ppa -y sudo apt update sudo apt install mainline -y sudo mainline --install 6.16.9 ``` #### 1.3 GRUB Boot Parameters Edit `/etc/default/grub`: ```bash sudo nano /etc/default/grub ``` Add to `GRUB_CMDLINE_LINUX_DEFAULT`: ```bash GRUB_CMDLINE_LINUX_DEFAULT="quiet splash amd_iommu=off amdgpu.gttsize=117760" ``` Apply the changes: ```bash sudo update-grub sudo reboot ``` #### 1.4 GPU Access Configuration Create udev rules for proper GPU permissions: ```bash sudo bash -c 'cat > /etc/udev/rules.d/99-amd-kfd.rules << EOF SUBSYSTEM=="kfd", GROUP="render", MODE="0666" SUBSYSTEM=="drm", KERNEL=="card[0-9]*", GROUP="render", MODE="0666" SUBSYSTEM=="drm", KERNEL=="renderD[0-9]*", GROUP="render", MODE="0666" EOF' sudo udevadm control --reload-rules sudo udevadm trigger sudo usermod -aG video,render $USER ``` ### Phase 2: ROCm Installation #### 2.1 Install ROCm from AMD Repository ```bash # Add AMD ROCm repository wget https://repo.radeon.com/rocm/rocm.gpg.key -O - | sudo apt-key add - echo 'deb [arch=amd64] https://repo.radeon.com/rocm/apt/latest ubuntu main' | sudo tee /etc/apt/sources.list.d/rocm.list sudo apt update sudo apt install rocm-hip-runtime rocm-hip-sdk -y ``` #### 2.2 Verify ROCm Installation ```bash # Check ROCm version rocminfo | grep -E "Agent|Name:|Marketing" # Check GTT memory allocation cat /sys/class/drm/card*/device/mem_info_gtt_total ``` ### Phase 3: Model Setup #### 3.1 Download Kimi Linear GGUF Model ```bash # Install huggingface-cli pip install "huggingface_hub[cli]" # Create model directory mkdir -p ~/models/kimi-linear cd ~/models/kimi-linear # Download the Q4_K_M quantized GGUF model (recommended for your hardware) huggingface-cli download bartowski/moonshotai_Kimi-Linear-48B-A3B-Instruct-GGUF \ --include "moonshotai_Kimi-Linear-48B-A3B-Instruct-Q4_K_M.gguf" \ --local-dir ./ # Or download a lower quality version if memory is tight # huggingface-cli download bartowski/moonshotai_Kimi-Linear-48B-A3B-Instruct-GGUF \ # --include "moonshotai_Kimi-Linear-48B-A3B-Instruct-IQ4_XS.gguf" \ # --local-dir ./ ``` #### 3.2 Validate Download ```bash ls -la ~/models/kimi-linear/ # You should see a file named moonshotai_Kimi-Linear-48B-A3B-Instruct-Q4_K_M.gguf ``` ### Phase 4: llama.cpp Setup for AMD ROCm #### 4.1 Build llama.cpp with ROCm Support ```bash # Clone llama.cpp git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp # Install build dependencies sudo apt install cmake gcc g++ git libcurl4-openssl-dev -y # Build for ROCm cmake -B build -S . \ -DGGML_HIP=ON \ -DAMDGPU_TARGETS="gfx1151" cmake --build build --config Release -j$(nproc) ``` #### 4.2 Alternative: Use Pre-built Container Instead of building, you can use a container with ROCm pre-configured: ```bash # Install Podman and Distrobox sudo apt install podman -y curl -s https://raw.githubusercontent.com/89luca89/distrobox/main/install | sudo sh # Create LLM container with ROCm distrobox create llama-rocm \ --image docker.io/kyuz0/amd-strix-halo-toolboxes:rocm-6.4.4-rocwmma \ --additional-flags "--device /dev/dri --device /dev/kfd --group-add video --group-add render" ``` ### Phase 5: Test Inference #### 5.1 Run a Basic Test ```bash # If building locally: ./build/bin/llama-cli \ -m ~/models/kimi-linear/moonshotai_Kimi-Linear-48B-A3B-Instruct-Q4_K_M.gguf \ --no-mmap \ -ngl 99 \ -p "Explain quantum computing in simple terms:" \ -n 256 # If using container: distrobox enter llama-rocm llama-cli \ -m ~/models/kimi-linear/moonshotai_Kimi-Linear-48B-A3B-Instruct-Q4_K_M.gguf \ --no-mmap \ -ngl 99 \ -p "Explain quantum computing in simple terms:" \ -n 256 ``` ### Phase 6: Benchmarking and Optimization #### 6.1 Run Performance Benchmarks ```bash # Run llama-bench for performance testing llama-bench \ -m ~/models/kimi-linear/moonshotai_Kimi-Linear-48B-A3B-Instruct-Q4_K_M.gguf \ -mmp 0 \ -ngl 99 \ -p 512 \ -n 128 ``` #### 6.2 Monitor GPU Resources ```bash # Watch GPU usage during inference watch -n 1 rocm-smi # Check memory usage cat /sys/class/drm/card*/device/mem_info_gtt_total cat /sys/class/drm/card*/device/mem_info_gtt_used ``` ## Performance Optimization Tips ### Memory Management 1. **Use 128GB GTT**: Your system can utilize up to ~115GB of memory for computing 2. **Choose appropriate quantization**: - Q4_K_M: Good balance of quality and size (~30GB) - IQ4_XS: Smaller but with some quality loss (~26GB) - Q5_K_M: Higher quality but larger (~35GB) ### Inference Parameters ```bash # Optimal flags for AMD Strix Halo llama-cli \ -m ~/models/kimi-linear/moonshotai_Kimi-Linear-48B-A3B-Instruct-Q4_K_M.gguf \ --no-mmap \ -ngl 99 \ -p "Your prompt text" \ -n 512 \ --temp 0.7 \ --top-p 0.9 \ --repeat-penalty 1.1 ``` ## Expected Performance on Strix Halo | Model Quantization | Predicted Performance | Notes | | ------------------ | --------------------- | ----- | | Q4_K_M (~30GB) | 10-15 tokens/sec | Good balance | | IQ4_XS (~26GB) | 8-12 tokens/sec | Memory efficient | | Q5_K_M (~35GB) | 12-18 tokens/sec | Higher quality | ## Troubleshooting ### Common Issues 1. **HSA_STATUS_ERROR_OUT_OF_RESOURCES**: Check udev rules 2. **Slow model loading**: Add `--no-mmap` flag 3. **Memory allocation errors**: Verify GTT size in GRUB 4. **Container GPU access**: Check device permissions ### Verification Commands ```bash # Verify memory allocation cat /sys/class/drm/card*/device/mem_info_gtt_total # Verify ROCm detection rocminfo | grep -E "Name:|Marketing" # Test basic GPU functionality rocminfo ``` ## Additional Tools and Integration ### For Development/Testing 1. **llama-cpp-python**: For Python integration ```python # Install package pip install llama-cpp-python # Test with Python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="bartowski/moonshotai_Kimi-Linear-48B-A3B-Instruct-GGUF", filename="moonshotai_Kimi-Linear-48B-A3B-Instruct-Q4_K_M.gguf" ) ``` 2. **OpenWebUI**: Web interface for LLMs 3. **Ollama**: Alternative for simpler deployment 4. **vLLM**: High-throughput inference (for API deployment) ## Monitoring and Resource Usage ### System Resources to Monitor - **GPU Utilization**: Use `watch -n 1 rocm-smi` - **Memory Usage**: Check `/sys/class/drm/card*/device/mem_info_gtt_total` - **CPU Load**: Monitor with `htop` - **Temperature**: Monitor GPU temperature ### Storage Considerations - Model files require ~30-50GB on disk - Additional swap space recommended for complex tasks - Consider SSD storage for optimal I/O performance ## Recommendations for Different Use Cases ### For Coding Tasks - Use `--temp 0.3` for more deterministic outputs - Enable `--repeat-penalty 1.2` to reduce repetition - Set `-n 1024` for long context generation ### For Research/Testing - Use `Q4_K_M` for reliable performance - Try `IQ4_XS` to test memory efficiency - Monitor with `llama-bench` for reproducible results ### For Production Use - Consider containerized deployment for stability - Set up automated monitoring for resource usage - Create model-specific configuration files ## Conclusion Your Strix Halo with 128GB unified memory provides an excellent platform for running large language models like Kimi Linear. The key to success is proper configuration of both the system and inference tools to fully utilize the unified memory architecture. With this setup, you should be able to run Kimi Linear and achieve performance that rivals much more expensive hardware setups.