Qwen3.6-MoE-35B-A3B server Konfiguration und Dokumentation
Find a file
2026-05-11 19:17:57 +02:00
.gitignore Initial commit: Qwen3.6-MoE-35B-A3B server configuration and documentation 2026-05-11 15:01:09 +02:00
BEDIENUNGSANLEITUNG.md Update documentation: Add Qwopus3.6 coding variant with multimodal support 2026-05-11 19:17:57 +02:00
docker-compose.yml Initial commit: Qwen3.6-MoE-35B-A3B server configuration and documentation 2026-05-11 15:01:09 +02:00
docker-compose_Qwen3.6_Qwopus3.6_coding.yml Add Qwopus3.6 coding docker-compose configuration 2026-05-11 19:16:02 +02:00
docker-compose_Qwen3.6_Tools.yml Initial commit: Qwen3.6-MoE-35B-A3B server configuration and documentation 2026-05-11 15:01:09 +02:00
docker-compose_Qwen3.6_Tools_coding.yml Initial commit: Qwen3.6-MoE-35B-A3B server configuration and documentation 2026-05-11 15:01:09 +02:00
docker-compose_Qwen3.6_Tools_RAG_faehig.yml Initial commit: Qwen3.6-MoE-35B-A3B server configuration and documentation 2026-05-11 15:01:09 +02:00
docker-compose_Qwen3.6_Uncensored.yml Initial commit: Qwen3.6-MoE-35B-A3B server configuration and documentation 2026-05-11 15:01:09 +02:00
docker-compose_Qwen3.6_Uncensored_RAG_faehig.yml Initial commit: Qwen3.6-MoE-35B-A3B server configuration and documentation 2026-05-11 15:01:09 +02:00
FAQs.md Update documentation: Add Qwopus3.6 coding variant with multimodal support 2026-05-11 19:17:57 +02:00
README.md Update documentation: Add Qwopus3.6 coding variant with multimodal support 2026-05-11 19:17:57 +02:00
run_bge_m3_embedding_server.sh Initial commit: Qwen3.6-MoE-35B-A3B server configuration and documentation 2026-05-11 15:01:09 +02:00
run_qwen35b_cli_tools_rag_longctx.sh Initial commit: Qwen3.6-MoE-35B-A3B server configuration and documentation 2026-05-11 15:01:09 +02:00
run_qwen35b_cli_uncensored_rag_longctx.sh Initial commit: Qwen3.6-MoE-35B-A3B server configuration and documentation 2026-05-11 15:01:09 +02:00
run_qwen35b_server_tools.sh Initial commit: Qwen3.6-MoE-35B-A3B server configuration and documentation 2026-05-11 15:01:09 +02:00
run_qwen35b_server_uncensored.sh Initial commit: Qwen3.6-MoE-35B-A3B server configuration and documentation 2026-05-11 15:01:09 +02:00
run_qwen35b_server_uncensored_rag_longctx.sh Initial commit: Qwen3.6-MoE-35B-A3B server configuration and documentation 2026-05-11 15:01:09 +02:00

Qwen3.6-MoE-35B-A3B Local Inference Server

Local deployment of the Carnice-Qwen3.6-MoE-35B-A3B model using llama.cpp with GPU acceleration, optimized for different use cases (coding, RAG, uncensored).

Overview

This project provides Docker-based inference servers for the Qwen3.6-MoE-35B-A3B model, running on NVIDIA GPUs via llama.cpp. Multiple configurations are available for different workflows:

Configuration Description
docker-compose_Qwen3.6_Tools_RAG_faehig.yml RAG-optimized with long context support (default)
docker-compose_Qwen3.6_Tools_coding.yml Coding-focused with tuned sampling parameters
docker-compose_Qwen3.6_Qwopus3.6_coding.yml Qwopus3.6 coding variant with multimodal support
docker-compose_Qwen3.6_Uncensored.yml Uncensored variant for unrestricted use
docker-compose_Qwen3.6_Uncensored_RAG_faehig.yml Uncensored + RAG support

Model: Carnice-Qwen3.6-MoE-35B-A3B-Q4_K_M.gguf (standard)
Qwopus Model: Qwopus3.6-35B-A3B-v1-Q4_K_M.gguf (multimodal, requires mmproj)
Uncensored Model: Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-Q4_K_M.gguf
Image: ghcr.io/ggml-org/llama.cpp:server-cuda
API Endpoint: http://localhost:8000/v1/chat/completions

Architecture

MCP-Server ←──┐
Extensions ←──┤
  AGENTS.md ←─┤  Pi  ──→  llama-cpp Docker  ──→  GPU
    Dateien ←──┘  (API-Request mit System-Prompt + Tools)

The Docker containers serve as pure inference backends. All file management, prompts, tools, and MCP servers are handled at the pi level.

Quick Start

# Start RAG-optimized server (default)
docker compose up -d --force-recreate

# Start coding-optimized server
docker compose -f docker-compose_Qwen3.6_Tools_coding.yml up -d --force-recreate

# Start Qwopus3.6 coding variant (multimodal)
docker compose -f docker-compose_Qwen3.6_Qwopus3.6_coding.yml up -d --force-recreate

# Stop and remove container
docker compose rm -s -f qwen35b

Using Shell Scripts

Server Mode Scripts

# Start tools server (coding-optimized)
./run_qwen35b_server_tools.sh

# Start RAG-optimized server (uncensored)
./run_qwen35b_server_uncensored_rag_longctx.sh

# Start uncensored server (no RAG)
./run_qwen35b_server_uncensored.sh

CLI Mode Scripts

# Start CLI mode for RAG
./run_qwen35b_cli_tools_rag_longctx.sh

# Start CLI mode for uncensored RAG
./run_qwen35b_cli_uncensored_rag_longctx.sh

Embedding Server

# Start BGE-M3 embedding server
./run_bge_m3_embedding_server.sh

Note: The Qwopus3.6 variant requires Docker Compose for startup due to multimodal support (mmproj file). Container name: qwopus35b-moe-coding.

Note: All shell scripts automatically stop any existing containers with the same name before starting new ones. Use docker rm -f <container_name> to manually stop servers.

Configuration Details

Hardware Requirements

  • GPU: NVIDIA RTX 3090 (2x) or equivalent with 24GB+ VRAM each
  • RAM: 64GB+ system RAM recommended
  • Storage: 100GB+ for model files and cache

GPU Setup

  • Primary GPU: device 0 (first 3090)
  • Tensor split: 0.5,0.5 (symmetric across both GPUs)
  • All layers offloaded to GPU (-ngl 999)
  • Flash Attention enabled for optimized memory access
  • Qwopus3.6: Uses 4 parallel slots (~2.5 GB KV-Cache per slot)

Context & Performance

  • Context window: 262,144 tokens (256k)
  • Max output: 16,384 tokens
  • Parallel slots: 2 (saves ~10GB KV cache vs 4) — standard; Qwopus3.6 uses 4 slots
  • Batch size: 2,048 for long context processing
  • Micro-batch size: 512 (standard); Qwopus3.6 uses 1024 for SSM-Layer efficiency
  • KV cache: q8_0 quantization for speed/quality balance

Sampling Parameters

Parameter RAG Mode Coding Mode Qwopus3.6
Temperature 0.2 0.3 0.3
Top-p 0.95 0.95 0.95
Top-k 40 40 40
Min-p 0.01 0.01 0.01
Repeat penalty 1.05 1.05 1.05

Qwopus3.6 Specifics

  • Multimodal support: Requires mmproj file (see docker-compose for configuration)
  • Parallel slots: 4 (vs 2 in standard) — KV-Cache ~2.5 GB/Slot, 4 slots feasible
  • Micro-batch size: 1024 (vs 512) — SSM-Layer processes micro-batches more efficiently
  • Container name: qwopus35b-moe-coding (avoids conflict with standard coding container)

API Usage

Chat Completions

curl -X POST "http://localhost:8000/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3.6-35b-a3b-moe",
    "messages": [
      { "role": "system", "content": "Du bist ein hilfreicher deutscher Assistent." },
      { "role": "user",   "content": "Erkläre Quantencomputing in 3 Sätzen." }
    ],
    "max_tokens": 1024,
    "temperature": 0.2,
    "stream": false
  }'

Health Check

curl -fs http://localhost:8000/

Integration with Pi

Files

Pi reads files using the read tool and sends content as prompt text. For automatic loading, use AGENTS.md or project-specific context files.

Prompts

Configure via:

  • ~/.pi/agent/SYSTEM.md — replaces complete system prompt
  • ~/.pi/agent/APPEND_SYSTEM.md — appended to end of system prompt

Tools

Built-in tools (read, write, edit, bash) plus custom extensions in ~/.pi/agent/extensions/. The model uses OpenAI function-calling API via the --jinja flag.

MCP Servers

Add to settings.json:

"packages": [
  "npm:pi-llama-cpp",
  "npm:@modelcontextprotocol/server-filesystem",
  "npm:irgendein-mcp-server"
]

Troubleshooting

Server Not Responding

  1. Check GPU availability: nvidia-smi
  2. Verify model file exists: /home/dschlueter/nvme2n1p7_home/huggingface/models/qwen3/
  3. Check container logs: docker logs qwen35b-moe-rag-longctx

GPU Memory Issues

  • Reduce parallel slots from 2 to 1
  • Lower batch size from 2048 to 1024
  • Use uncensored variant if VRAM is tight

Connection Refused

  • Ensure port 8000 is not in use: lsof -i :8000
  • Check firewall settings
  • Verify container is running: docker ps | grep qwen35b

Maintenance

Update Model

  1. Download new GGUF file to HF_HOME path
  2. Update docker-compose.yml or shell script -m parameter
  3. Restart container

Backup Configuration

cp ~/.pi/agent/SYSTEM.md ~/backup/
cp ~/.pi/agent/APPEND_SYSTEM.md ~/backup/
cp ~/.pi/agent/extensions/ ~/backup/ -r

License

This project uses llama.cpp (Apache 2.0) and the Qwen3.6-MoE model. Model usage subject to original model license terms.