Initial commit: Qwen3.6-MoE-35B-A3B server configuration and documentation

2026-05-11 15:01:09 +02:00 · 2026-05-11 15:01:09 +02:00 · b039061615
commit b039061615
16 changed files with 1672 additions and 0 deletions
--- a/README.md
+++ b/README.md
@ -0,0 +1,188 @@
+# Qwen3.6-MoE-35B-A3B Local Inference Server
+
+Local deployment of the **Carnice-Qwen3.6-MoE-35B-A3B** model using llama.cpp with GPU acceleration, optimized for different use cases (coding, RAG, uncensored).
+
+## Overview
+
+This project provides Docker-based inference servers for the Qwen3.6-MoE-35B-A3B model, running on NVIDIA GPUs via llama.cpp. Multiple configurations are available for different workflows:
+
+| Configuration | Description |
+|---------------|-------------|
+| `docker-compose_Qwen3.6_Tools_RAG_faehig.yml` | RAG-optimized with long context support (default) |
+| `docker-compose_Qwen3.6_Tools_coding.yml` | Coding-focused with tuned sampling parameters |
+| `docker-compose_Qwen3.6_Uncensored.yml` | Uncensored variant for unrestricted use |
+| `docker-compose_Qwen3.6_Uncensored_RAG_faehig.yml` | Uncensored + RAG support |
+
+**Model**: Carnice-Qwen3.6-MoE-35B-A3B-Q4_K_M.gguf (standard)  
+**Uncensored Model**: Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-Q4_K_M.gguf  
+**Image**: ghcr.io/ggml-org/llama.cpp:server-cuda  
+**API Endpoint**: http://localhost:8000/v1/chat/completions
+
+## Architecture
+
+```
+MCP-Server ←──┐
+Extensions ←──┤
+  AGENTS.md ←─┤  Pi  ──→  llama-cpp Docker  ──→  GPU
+    Dateien ←──┘  (API-Request mit System-Prompt + Tools)
+```
+
+The Docker containers serve as pure inference backends. All file management, prompts, tools, and MCP servers are handled at the pi level.
+
+## Quick Start
+
+### Using Docker Compose (Recommended)
+
+```bash
+# Start RAG-optimized server (default)
+docker compose up -d --force-recreate
+
+# Start coding-optimized server
+docker compose -f docker-compose_Qwen3.6_Tools_coding.yml up -d --force-recreate
+
+# Stop and remove container
+docker compose rm -s -f qwen35b
+```
+
+### Using Shell Scripts
+
+#### Server Mode Scripts
+```bash
+# Start tools server (coding-optimized)
+./run_qwen35b_server_tools.sh
+
+# Start RAG-optimized server (uncensored)
+./run_qwen35b_server_uncensored_rag_longctx.sh
+
+# Start uncensored server (no RAG)
+./run_qwen35b_server_uncensored.sh
+```
+
+#### CLI Mode Scripts
+```bash
+# Start CLI mode for RAG
+./run_qwen35b_cli_tools_rag_longctx.sh
+
+# Start CLI mode for uncensored RAG
+./run_qwen35b_cli_uncensored_rag_longctx.sh
+```
+
+#### Embedding Server
+```bash
+# Start BGE-M3 embedding server
+./run_bge_m3_embedding_server.sh
+```
+
+**Note**: All shell scripts automatically stop any existing containers with the same name before starting new ones. Use `docker rm -f <container_name>` to manually stop servers.
+
+## Configuration Details
+
+### Hardware Requirements
+- **GPU**: NVIDIA RTX 3090 (2x) or equivalent with 24GB+ VRAM each
+- **RAM**: 64GB+ system RAM recommended
+- **Storage**: 100GB+ for model files and cache
+
+### GPU Setup
+- Primary GPU: device 0 (first 3090)
+- Tensor split: 0.5,0.5 (symmetric across both GPUs)
+- All layers offloaded to GPU (`-ngl 999`)
+- Flash Attention enabled for optimized memory access
+
+### Context & Performance
+- **Context window**: 262,144 tokens (256k)
+- **Max output**: 16,384 tokens
+- **Parallel slots**: 2 (saves ~10GB KV cache vs 4)
+- **Batch size**: 2,048 for long context processing
+- **KV cache**: q8_0 quantization for speed/quality balance
+
+### Sampling Parameters
+| Parameter | RAG Mode | Coding Mode |
+|-----------|----------|-------------|
+| Temperature | 0.2 | 0.3 |
+| Top-p | 0.95 | 0.95 |
+| Top-k | 40 | 40 |
+| Min-p | 0.01 | 0.01 |
+| Repeat penalty | 1.05 | 1.05 |
+
+## API Usage
+
+### Chat Completions
+
+```bash
+curl -X POST "http://localhost:8000/v1/chat/completions" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "qwen3.6-35b-a3b-moe",
+    "messages": [
+      { "role": "system", "content": "Du bist ein hilfreicher deutscher Assistent." },
+      { "role": "user",   "content": "Erkläre Quantencomputing in 3 Sätzen." }
+    ],
+    "max_tokens": 1024,
+    "temperature": 0.2,
+    "stream": false
+  }'
+```
+
+### Health Check
+
+```bash
+curl -fs http://localhost:8000/
+```
+
+## Integration with Pi
+
+### Files
+Pi reads files using the `read` tool and sends content as prompt text. For automatic loading, use AGENTS.md or project-specific context files.
+
+### Prompts
+Configure via:
+- `~/.pi/agent/SYSTEM.md` — replaces complete system prompt
+- `~/.pi/agent/APPEND_SYSTEM.md` — appended to end of system prompt
+
+### Tools
+Built-in tools (read, write, edit, bash) plus custom extensions in `~/.pi/agent/extensions/`. The model uses OpenAI function-calling API via the `--jinja` flag.
+
+### MCP Servers
+Add to `settings.json`:
+```json
+"packages": [
+  "npm:pi-llama-cpp",
+  "npm:@modelcontextprotocol/server-filesystem",
+  "npm:irgendein-mcp-server"
+]
+```
+
+## Troubleshooting
+
+### Server Not Responding
+1. Check GPU availability: `nvidia-smi`
+2. Verify model file exists: `/home/dschlueter/nvme2n1p7_home/huggingface/models/qwen3/`
+3. Check container logs: `docker logs qwen35b-moe-rag-longctx`
+
+### GPU Memory Issues
+- Reduce parallel slots from 2 to 1
+- Lower batch size from 2048 to 1024
+- Use uncensored variant if VRAM is tight
+
+### Connection Refused
+- Ensure port 8000 is not in use: `lsof -i :8000`
+- Check firewall settings
+- Verify container is running: `docker ps | grep qwen35b`
+
+## Maintenance
+
+### Update Model
+1. Download new GGUF file to HF_HOME path
+2. Update docker-compose.yml or shell script `-m` parameter
+3. Restart container
+
+### Backup Configuration
+```bash
+cp ~/.pi/agent/SYSTEM.md ~/backup/
+cp ~/.pi/agent/APPEND_SYSTEM.md ~/backup/
+cp ~/.pi/agent/extensions/ ~/backup/ -r
+```
+
+## License
+
+This project uses llama.cpp (Apache 2.0) and the Qwen3.6-MoE model. Model usage subject to original model license terms.