- Shell 100%
Qwen3.6-MoE-35B-A3B Local Inference Server
Local deployment of the Carnice-Qwen3.6-MoE-35B-A3B model using llama.cpp with GPU acceleration, optimized for different use cases (coding, RAG, uncensored).
Overview
This project provides Docker-based inference servers for the Qwen3.6-MoE-35B-A3B model, running on NVIDIA GPUs via llama.cpp. Multiple configurations are available for different workflows:
| Configuration | Description |
|---|---|
docker-compose_Qwen3.6_Tools_RAG_faehig.yml |
RAG-optimized with long context support (default) |
docker-compose_Qwen3.6_Tools_coding.yml |
Coding-focused with tuned sampling parameters |
docker-compose_Qwen3.6_Qwopus3.6_coding.yml |
Qwopus3.6 coding variant with multimodal support |
docker-compose_Qwen3.6_Uncensored.yml |
Uncensored variant for unrestricted use |
docker-compose_Qwen3.6_Uncensored_RAG_faehig.yml |
Uncensored + RAG support |
Model: Carnice-Qwen3.6-MoE-35B-A3B-Q4_K_M.gguf (standard)
Qwopus Model: Qwopus3.6-35B-A3B-v1-Q4_K_M.gguf (multimodal, requires mmproj)
Uncensored Model: Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-Q4_K_M.gguf
Image: ghcr.io/ggml-org/llama.cpp:server-cuda
API Endpoint: http://localhost:8000/v1/chat/completions
Architecture
MCP-Server ←──┐
Extensions ←──┤
AGENTS.md ←─┤ Pi ──→ llama-cpp Docker ──→ GPU
Dateien ←──┘ (API-Request mit System-Prompt + Tools)
The Docker containers serve as pure inference backends. All file management, prompts, tools, and MCP servers are handled at the pi level.
Quick Start
Using Docker Compose (Recommended)
# Start RAG-optimized server (default)
docker compose up -d --force-recreate
# Start coding-optimized server
docker compose -f docker-compose_Qwen3.6_Tools_coding.yml up -d --force-recreate
# Start Qwopus3.6 coding variant (multimodal)
docker compose -f docker-compose_Qwen3.6_Qwopus3.6_coding.yml up -d --force-recreate
# Stop and remove container
docker compose rm -s -f qwen35b
Using Shell Scripts
Server Mode Scripts
# Start tools server (coding-optimized)
./run_qwen35b_server_tools.sh
# Start RAG-optimized server (uncensored)
./run_qwen35b_server_uncensored_rag_longctx.sh
# Start uncensored server (no RAG)
./run_qwen35b_server_uncensored.sh
CLI Mode Scripts
# Start CLI mode for RAG
./run_qwen35b_cli_tools_rag_longctx.sh
# Start CLI mode for uncensored RAG
./run_qwen35b_cli_uncensored_rag_longctx.sh
Embedding Server
# Start BGE-M3 embedding server
./run_bge_m3_embedding_server.sh
Note: The Qwopus3.6 variant requires Docker Compose for startup due to multimodal support (mmproj file). Container name: qwopus35b-moe-coding.
Note: All shell scripts automatically stop any existing containers with the same name before starting new ones. Use docker rm -f <container_name> to manually stop servers.
Configuration Details
Hardware Requirements
- GPU: NVIDIA RTX 3090 (2x) or equivalent with 24GB+ VRAM each
- RAM: 64GB+ system RAM recommended
- Storage: 100GB+ for model files and cache
GPU Setup
- Primary GPU: device 0 (first 3090)
- Tensor split: 0.5,0.5 (symmetric across both GPUs)
- All layers offloaded to GPU (
-ngl 999) - Flash Attention enabled for optimized memory access
- Qwopus3.6: Uses 4 parallel slots (~2.5 GB KV-Cache per slot)
Context & Performance
- Context window: 262,144 tokens (256k)
- Max output: 16,384 tokens
- Parallel slots: 2 (saves ~10GB KV cache vs 4) — standard; Qwopus3.6 uses 4 slots
- Batch size: 2,048 for long context processing
- Micro-batch size: 512 (standard); Qwopus3.6 uses 1024 for SSM-Layer efficiency
- KV cache: q8_0 quantization for speed/quality balance
Sampling Parameters
| Parameter | RAG Mode | Coding Mode | Qwopus3.6 |
|---|---|---|---|
| Temperature | 0.2 | 0.3 | 0.3 |
| Top-p | 0.95 | 0.95 | 0.95 |
| Top-k | 40 | 40 | 40 |
| Min-p | 0.01 | 0.01 | 0.01 |
| Repeat penalty | 1.05 | 1.05 | 1.05 |
Qwopus3.6 Specifics
- Multimodal support: Requires mmproj file (see docker-compose for configuration)
- Parallel slots: 4 (vs 2 in standard) — KV-Cache ~2.5 GB/Slot, 4 slots feasible
- Micro-batch size: 1024 (vs 512) — SSM-Layer processes micro-batches more efficiently
- Container name:
qwopus35b-moe-coding(avoids conflict with standard coding container)
API Usage
Chat Completions
curl -X POST "http://localhost:8000/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3.6-35b-a3b-moe",
"messages": [
{ "role": "system", "content": "Du bist ein hilfreicher deutscher Assistent." },
{ "role": "user", "content": "Erkläre Quantencomputing in 3 Sätzen." }
],
"max_tokens": 1024,
"temperature": 0.2,
"stream": false
}'
Health Check
curl -fs http://localhost:8000/
Integration with Pi
Files
Pi reads files using the read tool and sends content as prompt text. For automatic loading, use AGENTS.md or project-specific context files.
Prompts
Configure via:
~/.pi/agent/SYSTEM.md— replaces complete system prompt~/.pi/agent/APPEND_SYSTEM.md— appended to end of system prompt
Tools
Built-in tools (read, write, edit, bash) plus custom extensions in ~/.pi/agent/extensions/. The model uses OpenAI function-calling API via the --jinja flag.
MCP Servers
Add to settings.json:
"packages": [
"npm:pi-llama-cpp",
"npm:@modelcontextprotocol/server-filesystem",
"npm:irgendein-mcp-server"
]
Troubleshooting
Server Not Responding
- Check GPU availability:
nvidia-smi - Verify model file exists:
/home/dschlueter/nvme2n1p7_home/huggingface/models/qwen3/ - Check container logs:
docker logs qwen35b-moe-rag-longctx
GPU Memory Issues
- Reduce parallel slots from 2 to 1
- Lower batch size from 2048 to 1024
- Use uncensored variant if VRAM is tight
Connection Refused
- Ensure port 8000 is not in use:
lsof -i :8000 - Check firewall settings
- Verify container is running:
docker ps | grep qwen35b
Maintenance
Update Model
- Download new GGUF file to HF_HOME path
- Update docker-compose.yml or shell script
-mparameter - Restart container
Backup Configuration
cp ~/.pi/agent/SYSTEM.md ~/backup/
cp ~/.pi/agent/APPEND_SYSTEM.md ~/backup/
cp ~/.pi/agent/extensions/ ~/backup/ -r
License
This project uses llama.cpp (Apache 2.0) and the Qwen3.6-MoE model. Model usage subject to original model license terms.