# Qwen3.6-MoE-35B-A3B Local Inference Server Local deployment of the **Carnice-Qwen3.6-MoE-35B-A3B** model using llama.cpp with GPU acceleration, optimized for different use cases (coding, RAG, uncensored). ## Overview This project provides Docker-based inference servers for the Qwen3.6-MoE-35B-A3B model, running on NVIDIA GPUs via llama.cpp. Multiple configurations are available for different workflows: | Configuration | Description | |---------------|-------------| | `docker-compose_Qwen3.6_Tools_RAG_faehig.yml` | RAG-optimized with long context support (default) | | `docker-compose_Qwen3.6_Tools_coding.yml` | Coding-focused with tuned sampling parameters | | `docker-compose_Qwen3.6_Qwopus3.6_coding.yml` | Qwopus3.6 coding variant with multimodal support | | `docker-compose_Qwen3.6_Uncensored.yml` | Uncensored variant for unrestricted use | | `docker-compose_Qwen3.6_Uncensored_RAG_faehig.yml` | Uncensored + RAG support | **Model**: Carnice-Qwen3.6-MoE-35B-A3B-Q4_K_M.gguf (standard) **Qwopus Model**: Qwopus3.6-35B-A3B-v1-Q4_K_M.gguf (multimodal, requires mmproj) **Uncensored Model**: Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-Q4_K_M.gguf **Image**: ghcr.io/ggml-org/llama.cpp:server-cuda **API Endpoint**: http://localhost:8000/v1/chat/completions ## Architecture ``` MCP-Server ←──┐ Extensions ←──┤ AGENTS.md ←─┤ Pi ──→ llama-cpp Docker ──→ GPU Dateien ←──┘ (API-Request mit System-Prompt + Tools) ``` The Docker containers serve as pure inference backends. All file management, prompts, tools, and MCP servers are handled at the pi level. ## Quick Start ### Using Docker Compose (Recommended) ```bash # Start RAG-optimized server (default) docker compose up -d --force-recreate # Start coding-optimized server docker compose -f docker-compose_Qwen3.6_Tools_coding.yml up -d --force-recreate # Start Qwopus3.6 coding variant (multimodal) docker compose -f docker-compose_Qwen3.6_Qwopus3.6_coding.yml up -d --force-recreate # Stop and remove container docker compose rm -s -f qwen35b ``` ### Using Shell Scripts #### Server Mode Scripts ```bash # Start tools server (coding-optimized) ./run_qwen35b_server_tools.sh # Start RAG-optimized server (uncensored) ./run_qwen35b_server_uncensored_rag_longctx.sh # Start uncensored server (no RAG) ./run_qwen35b_server_uncensored.sh ``` #### CLI Mode Scripts ```bash # Start CLI mode for RAG ./run_qwen35b_cli_tools_rag_longctx.sh # Start CLI mode for uncensored RAG ./run_qwen35b_cli_uncensored_rag_longctx.sh ``` #### Embedding Server ```bash # Start BGE-M3 embedding server ./run_bge_m3_embedding_server.sh ``` **Note**: The Qwopus3.6 variant requires Docker Compose for startup due to multimodal support (mmproj file). Container name: `qwopus35b-moe-coding`. **Note**: All shell scripts automatically stop any existing containers with the same name before starting new ones. Use `docker rm -f ` to manually stop servers. ## Configuration Details ### Hardware Requirements - **GPU**: NVIDIA RTX 3090 (2x) or equivalent with 24GB+ VRAM each - **RAM**: 64GB+ system RAM recommended - **Storage**: 100GB+ for model files and cache ### GPU Setup - Primary GPU: device 0 (first 3090) - Tensor split: 0.5,0.5 (symmetric across both GPUs) - All layers offloaded to GPU (`-ngl 999`) - Flash Attention enabled for optimized memory access - **Qwopus3.6**: Uses 4 parallel slots (~2.5 GB KV-Cache per slot) ### Context & Performance - **Context window**: 262,144 tokens (256k) - **Max output**: 16,384 tokens - **Parallel slots**: 2 (saves ~10GB KV cache vs 4) — standard; Qwopus3.6 uses 4 slots - **Batch size**: 2,048 for long context processing - **Micro-batch size**: 512 (standard); Qwopus3.6 uses 1024 for SSM-Layer efficiency - **KV cache**: q8_0 quantization for speed/quality balance ### Sampling Parameters | Parameter | RAG Mode | Coding Mode | Qwopus3.6 | |-----------|----------|-------------|-----------| | Temperature | 0.2 | 0.3 | 0.3 | | Top-p | 0.95 | 0.95 | 0.95 | | Top-k | 40 | 40 | 40 | | Min-p | 0.01 | 0.01 | 0.01 | | Repeat penalty | 1.05 | 1.05 | 1.05 | ### Qwopus3.6 Specifics - **Multimodal support**: Requires mmproj file (see docker-compose for configuration) - **Parallel slots**: 4 (vs 2 in standard) — KV-Cache ~2.5 GB/Slot, 4 slots feasible - **Micro-batch size**: 1024 (vs 512) — SSM-Layer processes micro-batches more efficiently - **Container name**: `qwopus35b-moe-coding` (avoids conflict with standard coding container) ## API Usage ### Chat Completions ```bash curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ -d '{ "model": "qwen3.6-35b-a3b-moe", "messages": [ { "role": "system", "content": "Du bist ein hilfreicher deutscher Assistent." }, { "role": "user", "content": "Erkläre Quantencomputing in 3 Sätzen." } ], "max_tokens": 1024, "temperature": 0.2, "stream": false }' ``` ### Health Check ```bash curl -fs http://localhost:8000/ ``` ## Integration with Pi ### Files Pi reads files using the `read` tool and sends content as prompt text. For automatic loading, use AGENTS.md or project-specific context files. ### Prompts Configure via: - `~/.pi/agent/SYSTEM.md` — replaces complete system prompt - `~/.pi/agent/APPEND_SYSTEM.md` — appended to end of system prompt ### Tools Built-in tools (read, write, edit, bash) plus custom extensions in `~/.pi/agent/extensions/`. The model uses OpenAI function-calling API via the `--jinja` flag. ### MCP Servers Add to `settings.json`: ```json "packages": [ "npm:pi-llama-cpp", "npm:@modelcontextprotocol/server-filesystem", "npm:irgendein-mcp-server" ] ``` ## Troubleshooting ### Server Not Responding 1. Check GPU availability: `nvidia-smi` 2. Verify model file exists: `/home/dschlueter/nvme2n1p7_home/huggingface/models/qwen3/` 3. Check container logs: `docker logs qwen35b-moe-rag-longctx` ### GPU Memory Issues - Reduce parallel slots from 2 to 1 - Lower batch size from 2048 to 1024 - Use uncensored variant if VRAM is tight ### Connection Refused - Ensure port 8000 is not in use: `lsof -i :8000` - Check firewall settings - Verify container is running: `docker ps | grep qwen35b` ## Maintenance ### Update Model 1. Download new GGUF file to HF_HOME path 2. Update docker-compose.yml or shell script `-m` parameter 3. Restart container ### Backup Configuration ```bash cp ~/.pi/agent/SYSTEM.md ~/backup/ cp ~/.pi/agent/APPEND_SYSTEM.md ~/backup/ cp ~/.pi/agent/extensions/ ~/backup/ -r ``` ## License This project uses llama.cpp (Apache 2.0) and the Qwen3.6-MoE model. Model usage subject to original model license terms.