# Qwen3.6-MoE-35B-A3B Local Inference Server

Local deployment of the **Carnice-Qwen3.6-MoE-35B-A3B** model using llama.cpp with GPU acceleration, optimized for different use cases (coding, RAG, uncensored).

## Overview

This project provides Docker-based inference servers for the Qwen3.6-MoE-35B-A3B model, running on NVIDIA GPUs via llama.cpp. Multiple configurations are available for different workflows:

| Configuration | Description |
|---------------|-------------|
| `docker-compose_Qwen3.6_Tools_RAG_faehig.yml` | RAG-optimized with long context support (default) |
| `docker-compose_Qwen3.6_Tools_coding.yml` | Coding-focused with tuned sampling parameters |
| `docker-compose_Qwen3.6_Qwopus3.6_coding.yml` | Qwopus3.6 coding variant with multimodal support |
| `docker-compose_Qwen3.6_Uncensored.yml` | Uncensored variant for unrestricted use |
| `docker-compose_Qwen3.6_Uncensored_RAG_faehig.yml` | Uncensored + RAG support |

**Model**: Carnice-Qwen3.6-MoE-35B-A3B-Q4_K_M.gguf (standard)  
**Qwopus Model**: Qwopus3.6-35B-A3B-v1-Q4_K_M.gguf (multimodal, requires mmproj)  
**Uncensored Model**: Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-Q4_K_M.gguf  
**Image**: ghcr.io/ggml-org/llama.cpp:server-cuda  
**API Endpoint**: http://localhost:8000/v1/chat/completions

## Architecture

```
MCP-Server ←──┐
Extensions ←──┤
  AGENTS.md ←─┤  Pi  ──→  llama-cpp Docker  ──→  GPU
    Dateien ←──┘  (API-Request mit System-Prompt + Tools)
```

The Docker containers serve as pure inference backends. All file management, prompts, tools, and MCP servers are handled at the pi level.

## Quick Start

### Using Docker Compose (Recommended)

```bash
# Start RAG-optimized server (default)
docker compose up -d --force-recreate

# Start coding-optimized server
docker compose -f docker-compose_Qwen3.6_Tools_coding.yml up -d --force-recreate

# Start Qwopus3.6 coding variant (multimodal)
docker compose -f docker-compose_Qwen3.6_Qwopus3.6_coding.yml up -d --force-recreate

# Stop and remove container
docker compose rm -s -f qwen35b
```

### Using Shell Scripts

#### Server Mode Scripts
```bash
# Start tools server (coding-optimized)
./run_qwen35b_server_tools.sh

# Start RAG-optimized server (uncensored)
./run_qwen35b_server_uncensored_rag_longctx.sh

# Start uncensored server (no RAG)
./run_qwen35b_server_uncensored.sh
```

#### CLI Mode Scripts
```bash
# Start CLI mode for RAG
./run_qwen35b_cli_tools_rag_longctx.sh

# Start CLI mode for uncensored RAG
./run_qwen35b_cli_uncensored_rag_longctx.sh
```

#### Embedding Server
```bash
# Start BGE-M3 embedding server
./run_bge_m3_embedding_server.sh
```

**Note**: The Qwopus3.6 variant requires Docker Compose for startup due to multimodal support (mmproj file). Container name: `qwopus35b-moe-coding`.

**Note**: All shell scripts automatically stop any existing containers with the same name before starting new ones. Use `docker rm -f <container_name>` to manually stop servers.

## Configuration Details

### Hardware Requirements
- **GPU**: NVIDIA RTX 3090 (2x) or equivalent with 24GB+ VRAM each
- **RAM**: 64GB+ system RAM recommended
- **Storage**: 100GB+ for model files and cache

### GPU Setup
- Primary GPU: device 0 (first 3090)
- Tensor split: 0.5,0.5 (symmetric across both GPUs)
- All layers offloaded to GPU (`-ngl 999`)
- Flash Attention enabled for optimized memory access
- **Qwopus3.6**: Uses 4 parallel slots (~2.5 GB KV-Cache per slot)

### Context & Performance
- **Context window**: 262,144 tokens (256k)
- **Max output**: 16,384 tokens
- **Parallel slots**: 2 (saves ~10GB KV cache vs 4) — standard; Qwopus3.6 uses 4 slots
- **Batch size**: 2,048 for long context processing
- **Micro-batch size**: 512 (standard); Qwopus3.6 uses 1024 for SSM-Layer efficiency
- **KV cache**: q8_0 quantization for speed/quality balance

### Sampling Parameters
| Parameter | RAG Mode | Coding Mode | Qwopus3.6 |
|-----------|----------|-------------|-----------|
| Temperature | 0.2 | 0.3 | 0.3 |
| Top-p | 0.95 | 0.95 | 0.95 |
| Top-k | 40 | 40 | 40 |
| Min-p | 0.01 | 0.01 | 0.01 |
| Repeat penalty | 1.05 | 1.05 | 1.05 |

### Qwopus3.6 Specifics
- **Multimodal support**: Requires mmproj file (see docker-compose for configuration)
- **Parallel slots**: 4 (vs 2 in standard) — KV-Cache ~2.5 GB/Slot, 4 slots feasible
- **Micro-batch size**: 1024 (vs 512) — SSM-Layer processes micro-batches more efficiently
- **Container name**: `qwopus35b-moe-coding` (avoids conflict with standard coding container)

## API Usage

### Chat Completions

```bash
curl -X POST "http://localhost:8000/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3.6-35b-a3b-moe",
    "messages": [
      { "role": "system", "content": "Du bist ein hilfreicher deutscher Assistent." },
      { "role": "user",   "content": "Erkläre Quantencomputing in 3 Sätzen." }
    ],
    "max_tokens": 1024,
    "temperature": 0.2,
    "stream": false
  }'
```

### Health Check

```bash
curl -fs http://localhost:8000/
```

## Integration with Pi

### Files
Pi reads files using the `read` tool and sends content as prompt text. For automatic loading, use AGENTS.md or project-specific context files.

### Prompts
Configure via:
- `~/.pi/agent/SYSTEM.md` — replaces complete system prompt
- `~/.pi/agent/APPEND_SYSTEM.md` — appended to end of system prompt

### Tools
Built-in tools (read, write, edit, bash) plus custom extensions in `~/.pi/agent/extensions/`. The model uses OpenAI function-calling API via the `--jinja` flag.

### MCP Servers
Add to `settings.json`:
```json
"packages": [
  "npm:pi-llama-cpp",
  "npm:@modelcontextprotocol/server-filesystem",
  "npm:irgendein-mcp-server"
]
```

## Troubleshooting

### Server Not Responding
1. Check GPU availability: `nvidia-smi`
2. Verify model file exists: `/home/dschlueter/nvme2n1p7_home/huggingface/models/qwen3/`
3. Check container logs: `docker logs qwen35b-moe-rag-longctx`

### GPU Memory Issues
- Reduce parallel slots from 2 to 1
- Lower batch size from 2048 to 1024
- Use uncensored variant if VRAM is tight

### Connection Refused
- Ensure port 8000 is not in use: `lsof -i :8000`
- Check firewall settings
- Verify container is running: `docker ps | grep qwen35b`

## Maintenance

### Update Model
1. Download new GGUF file to HF_HOME path
2. Update docker-compose.yml or shell script `-m` parameter
3. Restart container

### Backup Configuration
```bash
cp ~/.pi/agent/SYSTEM.md ~/backup/
cp ~/.pi/agent/APPEND_SYSTEM.md ~/backup/
cp ~/.pi/agent/extensions/ ~/backup/ -r
```

## License

This project uses llama.cpp (Apache 2.0) and the Qwen3.6-MoE model. Model usage subject to original model license terms.