localgenai/testing/qwen3-coder-30b/DEPLOYMENT_PLAN.md

# Deploying Kimi Linear Model on Strix Halo with 128GB Unified Memory

This document provides a comprehensive guide to deploying the Kimi Linear 48B parameter model on your AMD Strix Halo hardware with 128GB unified memory.

## Hardware Specifications

- **Processor**: AMD Ryzen AI Max+ 395 (Strix Halo)
- **Memory**: 128GB LPDDR5X unified memory
- **GPU**: Integrated Radeon 8060S (40 CU RDNA 3.5)
- **Architecture**: Zen 5 CPU with unified memory access

## Prerequisites

### Hardware Requirements
1. AMD Ryzen AI Max+ 395 processor (or compatible Strix Halo APU)
2. Minimum 128GB RAM (as specified)
3. Ubuntu 24.04 LTS or later with kernel 6.16.9+
4. BIOS access for configuration

### Software Requirements
1. Python 3.10+
2. ROCm 6.4.4 or higher
3. llama.cpp with ROCm support
4. HuggingFace CLI client

## Deployment Steps

### Phase 1: Hardware Configuration

#### 1.1 BIOS Configuration
Before installation, you must set these BIOS options:
- **UMA Frame Buffer Size**: 512MB (for display)
- **IOMMU**: Disabled (for performance)
- **Power Mode**: 85W TDP for balanced performance

#### 1.2 Kernel Configuration
Ensure you're running kernel 6.16.9 or later:
```bash
# Check current kernel
uname -r

# If upgrading needed:
sudo add-apt-repository ppa:cappelikan/ppa -y
sudo apt update
sudo apt install mainline -y
sudo mainline --install 6.16.9
```

#### 1.3 GRUB Boot Parameters
Edit `/etc/default/grub`:
```bash
sudo nano /etc/default/grub
```

Add to `GRUB_CMDLINE_LINUX_DEFAULT`:
```bash
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash amd_iommu=off amdgpu.gttsize=117760"
```

Apply the changes:
```bash
sudo update-grub
sudo reboot
```

#### 1.4 GPU Access Configuration
Create udev rules for proper GPU permissions:
```bash
sudo bash -c 'cat > /etc/udev/rules.d/99-amd-kfd.rules << EOF
SUBSYSTEM=="kfd", GROUP="render", MODE="0666"
SUBSYSTEM=="drm", KERNEL=="card[0-9]*", GROUP="render", MODE="0666"
SUBSYSTEM=="drm", KERNEL=="renderD[0-9]*", GROUP="render", MODE="0666"
EOF'

sudo udevadm control --reload-rules
sudo udevadm trigger
sudo usermod -aG video,render $USER
```

### Phase 2: ROCm Installation

#### 2.1 Install ROCm from AMD Repository
```bash
# Add AMD ROCm repository
wget https://repo.radeon.com/rocm/rocm.gpg.key -O - | sudo apt-key add -
echo 'deb [arch=amd64] https://repo.radeon.com/rocm/apt/latest ubuntu main' | sudo tee /etc/apt/sources.list.d/rocm.list

sudo apt update
sudo apt install rocm-hip-runtime rocm-hip-sdk -y
```

#### 2.2 Verify ROCm Installation
```bash
# Check ROCm version
rocminfo | grep -E "Agent|Name:|Marketing"

# Check GTT memory allocation
cat /sys/class/drm/card*/device/mem_info_gtt_total
```

### Phase 3: Model Setup

#### 3.1 Download Kimi Linear GGUF Model
```bash
# Install huggingface-cli
pip install "huggingface_hub[cli]"

# Create model directory
mkdir -p ~/models/kimi-linear
cd ~/models/kimi-linear

# Download the Q4_K_M quantized GGUF model (recommended for your hardware)
huggingface-cli download bartowski/moonshotai_Kimi-Linear-48B-A3B-Instruct-GGUF \
  --include "moonshotai_Kimi-Linear-48B-A3B-Instruct-Q4_K_M.gguf" \
  --local-dir ./

# Or download a lower quality version if memory is tight
# huggingface-cli download bartowski/moonshotai_Kimi-Linear-48B-A3B-Instruct-GGUF \
#   --include "moonshotai_Kimi-Linear-48B-A3B-Instruct-IQ4_XS.gguf" \
#   --local-dir ./
```

#### 3.2 Validate Download
```bash
ls -la ~/models/kimi-linear/
# You should see a file named moonshotai_Kimi-Linear-48B-A3B-Instruct-Q4_K_M.gguf
```

### Phase 4: llama.cpp Setup for AMD ROCm

#### 4.1 Build llama.cpp with ROCm Support
```bash
# Clone llama.cpp
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp

# Install build dependencies
sudo apt install cmake gcc g++ git libcurl4-openssl-dev -y

# Build for ROCm
cmake -B build -S . \
  -DGGML_HIP=ON \
  -DAMDGPU_TARGETS="gfx1151"

cmake --build build --config Release -j$(nproc)
```

#### 4.2 Alternative: Use Pre-built Container
Instead of building, you can use a container with ROCm pre-configured:
```bash
# Install Podman and Distrobox
sudo apt install podman -y
curl -s https://raw.githubusercontent.com/89luca89/distrobox/main/install | sudo sh

# Create LLM container with ROCm
distrobox create llama-rocm \
  --image docker.io/kyuz0/amd-strix-halo-toolboxes:rocm-6.4.4-rocwmma \
  --additional-flags "--device /dev/dri --device /dev/kfd --group-add video --group-add render"
```

### Phase 5: Test Inference

#### 5.1 Run a Basic Test
```bash
# If building locally:
./build/bin/llama-cli \
  -m ~/models/kimi-linear/moonshotai_Kimi-Linear-48B-A3B-Instruct-Q4_K_M.gguf \
  --no-mmap \
  -ngl 99 \
  -p "Explain quantum computing in simple terms:" \
  -n 256

# If using container:
distrobox enter llama-rocm
llama-cli \
  -m ~/models/kimi-linear/moonshotai_Kimi-Linear-48B-A3B-Instruct-Q4_K_M.gguf \
  --no-mmap \
  -ngl 99 \
  -p "Explain quantum computing in simple terms:" \
  -n 256
```

### Phase 6: Benchmarking and Optimization

#### 6.1 Run Performance Benchmarks
```bash
# Run llama-bench for performance testing
llama-bench \
  -m ~/models/kimi-linear/moonshotai_Kimi-Linear-48B-A3B-Instruct-Q4_K_M.gguf \
  -mmp 0 \
  -ngl 99 \
  -p 512 \
  -n 128
```

#### 6.2 Monitor GPU Resources
```bash
# Watch GPU usage during inference
watch -n 1 rocm-smi

# Check memory usage
cat /sys/class/drm/card*/device/mem_info_gtt_total
cat /sys/class/drm/card*/device/mem_info_gtt_used
```

## Performance Optimization Tips

### Memory Management
1. **Use 128GB GTT**: Your system can utilize up to ~115GB of memory for computing
2. **Choose appropriate quantization**:
   - Q4_K_M: Good balance of quality and size (~30GB)
   - IQ4_XS: Smaller but with some quality loss (~26GB)
   - Q5_K_M: Higher quality but larger (~35GB)

### Inference Parameters
```bash
# Optimal flags for AMD Strix Halo
llama-cli \
  -m ~/models/kimi-linear/moonshotai_Kimi-Linear-48B-A3B-Instruct-Q4_K_M.gguf \
  --no-mmap \
  -ngl 99 \
  -p "Your prompt text" \
  -n 512 \
  --temp 0.7 \
  --top-p 0.9 \
  --repeat-penalty 1.1
```

## Expected Performance on Strix Halo

| Model Quantization | Predicted Performance | Notes |
| ------------------ | --------------------- | ----- |
| Q4_K_M (~30GB)     | 10-15 tokens/sec     | Good balance |
| IQ4_XS (~26GB)     | 8-12 tokens/sec      | Memory efficient |
| Q5_K_M (~35GB)     | 12-18 tokens/sec     | Higher quality |

## Troubleshooting

### Common Issues
1. **HSA_STATUS_ERROR_OUT_OF_RESOURCES**: Check udev rules
2. **Slow model loading**: Add `--no-mmap` flag
3. **Memory allocation errors**: Verify GTT size in GRUB
4. **Container GPU access**: Check device permissions

### Verification Commands
```bash
# Verify memory allocation
cat /sys/class/drm/card*/device/mem_info_gtt_total

# Verify ROCm detection
rocminfo | grep -E "Name:|Marketing"

# Test basic GPU functionality
rocminfo
```

## Additional Tools and Integration

### For Development/Testing
1. **llama-cpp-python**: For Python integration
```python
# Install package
pip install llama-cpp-python

# Test with Python
from llama_cpp import Llama
llm = Llama.from_pretrained(
    repo_id="bartowski/moonshotai_Kimi-Linear-48B-A3B-Instruct-GGUF",
    filename="moonshotai_Kimi-Linear-48B-A3B-Instruct-Q4_K_M.gguf"
)
```

2. **OpenWebUI**: Web interface for LLMs
3. **Ollama**: Alternative for simpler deployment
4. **vLLM**: High-throughput inference (for API deployment)

## Monitoring and Resource Usage

### System Resources to Monitor
- **GPU Utilization**: Use `watch -n 1 rocm-smi`
- **Memory Usage**: Check `/sys/class/drm/card*/device/mem_info_gtt_total`
- **CPU Load**: Monitor with `htop`
- **Temperature**: Monitor GPU temperature

### Storage Considerations
- Model files require ~30-50GB on disk
- Additional swap space recommended for complex tasks
- Consider SSD storage for optimal I/O performance

## Recommendations for Different Use Cases

### For Coding Tasks
- Use `--temp 0.3` for more deterministic outputs
- Enable `--repeat-penalty 1.2` to reduce repetition
- Set `-n 1024` for long context generation

### For Research/Testing
- Use `Q4_K_M` for reliable performance
- Try `IQ4_XS` to test memory efficiency
- Monitor with `llama-bench` for reproducible results

### For Production Use
- Consider containerized deployment for stability
- Set up automated monitoring for resource usage
- Create model-specific configuration files

## Conclusion

Your Strix Halo with 128GB unified memory provides an excellent platform for running large language models like Kimi Linear. The key to success is proper configuration of both the system and inference tools to fully utilize the unified memory architecture.

With this setup, you should be able to run Kimi Linear and achieve performance that rivals much more expensive hardware setups.