Local deployment of the **Carnice-Qwen3.6-MoE-35B-A3B** model using llama.cpp with GPU acceleration, optimized for different use cases (coding, RAG, uncensored).
## Overview
This project provides Docker-based inference servers for the Qwen3.6-MoE-35B-A3B model, running on NVIDIA GPUs via llama.cpp. Multiple configurations are available for different workflows:
| Configuration | Description |
|---------------|-------------|
| `docker-compose_Qwen3.6_Tools_RAG_faehig.yml` | RAG-optimized with long context support (default) |
| `docker-compose_Qwen3.6_Tools_coding.yml` | Coding-focused with tuned sampling parameters |
**Note**: All shell scripts automatically stop any existing containers with the same name before starting new ones. Use `docker rm -f <container_name>` to manually stop servers.
## Configuration Details
### Hardware Requirements
- **GPU**: NVIDIA RTX 3090 (2x) or equivalent with 24GB+ VRAM each
- **RAM**: 64GB+ system RAM recommended
- **Storage**: 100GB+ for model files and cache
### GPU Setup
- Primary GPU: device 0 (first 3090)
- Tensor split: 0.5,0.5 (symmetric across both GPUs)
- All layers offloaded to GPU (`-ngl 999`)
- Flash Attention enabled for optimized memory access
curl -X POST "http://localhost:8000/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3.6-35b-a3b-moe",
"messages": [
{ "role": "system", "content": "Du bist ein hilfreicher deutscher Assistent." },
{ "role": "user", "content": "Erkläre Quantencomputing in 3 Sätzen." }
],
"max_tokens": 1024,
"temperature": 0.2,
"stream": false
}'
```
### Health Check
```bash
curl -fs http://localhost:8000/
```
## Integration with Pi
### Files
Pi reads files using the `read` tool and sends content as prompt text. For automatic loading, use AGENTS.md or project-specific context files.
### Prompts
Configure via:
-`~/.pi/agent/SYSTEM.md` — replaces complete system prompt
-`~/.pi/agent/APPEND_SYSTEM.md` — appended to end of system prompt
### Tools
Built-in tools (read, write, edit, bash) plus custom extensions in `~/.pi/agent/extensions/`. The model uses OpenAI function-calling API via the `--jinja` flag.
### MCP Servers
Add to `settings.json`:
```json
"packages": [
"npm:pi-llama-cpp",
"npm:@modelcontextprotocol/server-filesystem",
"npm:irgendein-mcp-server"
]
```
## Troubleshooting
### Server Not Responding
1. Check GPU availability: `nvidia-smi`
2. Verify model file exists: `/home/dschlueter/nvme2n1p7_home/huggingface/models/qwen3/`