llama-server/README.md

# Qwen3.6-MoE-35B-A3B Local Inference Server

Local deployment of the **Carnice-Qwen3.6-MoE-35B-A3B** model using llama.cpp with GPU acceleration, optimized for different use cases (coding, RAG, uncensored).

## Overview

This project provides Docker-based inference servers for the Qwen3.6-MoE-35B-A3B model, running on NVIDIA GPUs via llama.cpp. Multiple configurations are available for different workflows:

| Configuration | Description |
|---------------|-------------|
| `docker-compose_Qwen3.6_Tools_RAG_faehig.yml` | RAG-optimized with long context support (default) |
| `docker-compose_Qwen3.6_Tools_coding.yml` | Coding-focused with tuned sampling parameters |
| `docker-compose_Qwen3.6_Uncensored.yml` | Uncensored variant for unrestricted use |
| `docker-compose_Qwen3.6_Uncensored_RAG_faehig.yml` | Uncensored + RAG support |

**Model**: Carnice-Qwen3.6-MoE-35B-A3B-Q4_K_M.gguf (standard)  
**Uncensored Model**: Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-Q4_K_M.gguf  
**Image**: ghcr.io/ggml-org/llama.cpp:server-cuda  
**API Endpoint**: http://localhost:8000/v1/chat/completions

## Architecture

```
MCP-Server ←──┐
Extensions ←──┤
  AGENTS.md ←─┤  Pi  ──→  llama-cpp Docker  ──→  GPU
    Dateien ←──┘  (API-Request mit System-Prompt + Tools)
```

The Docker containers serve as pure inference backends. All file management, prompts, tools, and MCP servers are handled at the pi level.

## Quick Start

### Using Docker Compose (Recommended)

```bash
# Start RAG-optimized server (default)
docker compose up -d --force-recreate

# Start coding-optimized server
docker compose -f docker-compose_Qwen3.6_Tools_coding.yml up -d --force-recreate

# Stop and remove container
docker compose rm -s -f qwen35b
```

### Using Shell Scripts

#### Server Mode Scripts
```bash
# Start tools server (coding-optimized)
./run_qwen35b_server_tools.sh

# Start RAG-optimized server (uncensored)
./run_qwen35b_server_uncensored_rag_longctx.sh

# Start uncensored server (no RAG)
./run_qwen35b_server_uncensored.sh
```

#### CLI Mode Scripts
```bash
# Start CLI mode for RAG
./run_qwen35b_cli_tools_rag_longctx.sh

# Start CLI mode for uncensored RAG
./run_qwen35b_cli_uncensored_rag_longctx.sh
```

#### Embedding Server
```bash
# Start BGE-M3 embedding server
./run_bge_m3_embedding_server.sh
```

**Note**: All shell scripts automatically stop any existing containers with the same name before starting new ones. Use `docker rm -f <container_name>` to manually stop servers.

## Configuration Details

### Hardware Requirements
- **GPU**: NVIDIA RTX 3090 (2x) or equivalent with 24GB+ VRAM each
- **RAM**: 64GB+ system RAM recommended
- **Storage**: 100GB+ for model files and cache

### GPU Setup
- Primary GPU: device 0 (first 3090)
- Tensor split: 0.5,0.5 (symmetric across both GPUs)
- All layers offloaded to GPU (`-ngl 999`)
- Flash Attention enabled for optimized memory access

### Context & Performance
- **Context window**: 262,144 tokens (256k)
- **Max output**: 16,384 tokens
- **Parallel slots**: 2 (saves ~10GB KV cache vs 4)
- **Batch size**: 2,048 for long context processing
- **KV cache**: q8_0 quantization for speed/quality balance

### Sampling Parameters
| Parameter | RAG Mode | Coding Mode |
|-----------|----------|-------------|
| Temperature | 0.2 | 0.3 |
| Top-p | 0.95 | 0.95 |
| Top-k | 40 | 40 |
| Min-p | 0.01 | 0.01 |
| Repeat penalty | 1.05 | 1.05 |

## API Usage

### Chat Completions

```bash
curl -X POST "http://localhost:8000/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3.6-35b-a3b-moe",
    "messages": [
      { "role": "system", "content": "Du bist ein hilfreicher deutscher Assistent." },
      { "role": "user",   "content": "Erkläre Quantencomputing in 3 Sätzen." }
    ],
    "max_tokens": 1024,
    "temperature": 0.2,
    "stream": false
  }'
```

### Health Check

```bash
curl -fs http://localhost:8000/
```

## Integration with Pi

### Files
Pi reads files using the `read` tool and sends content as prompt text. For automatic loading, use AGENTS.md or project-specific context files.

### Prompts
Configure via:
- `~/.pi/agent/SYSTEM.md` — replaces complete system prompt
- `~/.pi/agent/APPEND_SYSTEM.md` — appended to end of system prompt

### Tools
Built-in tools (read, write, edit, bash) plus custom extensions in `~/.pi/agent/extensions/`. The model uses OpenAI function-calling API via the `--jinja` flag.

### MCP Servers
Add to `settings.json`:
```json
"packages": [
  "npm:pi-llama-cpp",
  "npm:@modelcontextprotocol/server-filesystem",
  "npm:irgendein-mcp-server"
]
```

## Troubleshooting

### Server Not Responding
1. Check GPU availability: `nvidia-smi`
2. Verify model file exists: `/home/dschlueter/nvme2n1p7_home/huggingface/models/qwen3/`
3. Check container logs: `docker logs qwen35b-moe-rag-longctx`

### GPU Memory Issues
- Reduce parallel slots from 2 to 1
- Lower batch size from 2048 to 1024
- Use uncensored variant if VRAM is tight

### Connection Refused
- Ensure port 8000 is not in use: `lsof -i :8000`
- Check firewall settings
- Verify container is running: `docker ps | grep qwen35b`

## Maintenance

### Update Model
1. Download new GGUF file to HF_HOME path
2. Update docker-compose.yml or shell script `-m` parameter
3. Restart container

### Backup Configuration
```bash
cp ~/.pi/agent/SYSTEM.md ~/backup/
cp ~/.pi/agent/APPEND_SYSTEM.md ~/backup/
cp ~/.pi/agent/extensions/ ~/backup/ -r
```

## License

This project uses llama.cpp (Apache 2.0) and the Qwen3.6-MoE model. Model usage subject to original model license terms.
Initial commit: Qwen3.6-MoE-35B-A3B server configuration and documentation 2026-05-11 15:01:09 +02:00			`# Qwen3.6-MoE-35B-A3B Local Inference Server`

			`Local deployment of the Carnice-Qwen3.6-MoE-35B-A3B model using llama.cpp with GPU acceleration, optimized for different use cases (coding, RAG, uncensored).`

			`## Overview`

			`This project provides Docker-based inference servers for the Qwen3.6-MoE-35B-A3B model, running on NVIDIA GPUs via llama.cpp. Multiple configurations are available for different workflows:`

			`\| Configuration \| Description \|`
			`\|---------------\|-------------\|`
			\| `docker-compose_Qwen3.6_Tools_RAG_faehig.yml` \| RAG-optimized with long context support (default) \|
			\| `docker-compose_Qwen3.6_Tools_coding.yml` \| Coding-focused with tuned sampling parameters \|
			\| `docker-compose_Qwen3.6_Uncensored.yml` \| Uncensored variant for unrestricted use \|
			\| `docker-compose_Qwen3.6_Uncensored_RAG_faehig.yml` \| Uncensored + RAG support \|

			`Model: Carnice-Qwen3.6-MoE-35B-A3B-Q4_K_M.gguf (standard)`
			`Uncensored Model: Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-Q4_K_M.gguf`
			`Image: ghcr.io/ggml-org/llama.cpp:server-cuda`
			`API Endpoint: http://localhost:8000/v1/chat/completions`

			`## Architecture`

			```
			`MCP-Server ←──┐`
			`Extensions ←──┤`
			`AGENTS.md ←─┤ Pi ──→ llama-cpp Docker ──→ GPU`
			`Dateien ←──┘ (API-Request mit System-Prompt + Tools)`
			```

			`The Docker containers serve as pure inference backends. All file management, prompts, tools, and MCP servers are handled at the pi level.`

			`## Quick Start`

			`### Using Docker Compose (Recommended)`

			```bash
			`# Start RAG-optimized server (default)`
			`docker compose up -d --force-recreate`

			`# Start coding-optimized server`
			`docker compose -f docker-compose_Qwen3.6_Tools_coding.yml up -d --force-recreate`

			`# Stop and remove container`
			`docker compose rm -s -f qwen35b`
			```

			`### Using Shell Scripts`

			`#### Server Mode Scripts`
			```bash
			`# Start tools server (coding-optimized)`
			`./run_qwen35b_server_tools.sh`

			`# Start RAG-optimized server (uncensored)`
			`./run_qwen35b_server_uncensored_rag_longctx.sh`

			`# Start uncensored server (no RAG)`
			`./run_qwen35b_server_uncensored.sh`
			```

			`#### CLI Mode Scripts`
			```bash
			`# Start CLI mode for RAG`
			`./run_qwen35b_cli_tools_rag_longctx.sh`

			`# Start CLI mode for uncensored RAG`
			`./run_qwen35b_cli_uncensored_rag_longctx.sh`
			```

			`#### Embedding Server`
			```bash
			`# Start BGE-M3 embedding server`
			`./run_bge_m3_embedding_server.sh`
			```

			Note: All shell scripts automatically stop any existing containers with the same name before starting new ones. Use `docker rm -f <container_name>` to manually stop servers.

			`## Configuration Details`

			`### Hardware Requirements`
			`- GPU: NVIDIA RTX 3090 (2x) or equivalent with 24GB+ VRAM each`
			`- RAM: 64GB+ system RAM recommended`
			`- Storage: 100GB+ for model files and cache`

			`### GPU Setup`
			`- Primary GPU: device 0 (first 3090)`
			`- Tensor split: 0.5,0.5 (symmetric across both GPUs)`
			- All layers offloaded to GPU (`-ngl 999`)
			`- Flash Attention enabled for optimized memory access`

			`### Context & Performance`
			`- Context window: 262,144 tokens (256k)`
			`- Max output: 16,384 tokens`
			`- Parallel slots: 2 (saves ~10GB KV cache vs 4)`
			`- Batch size: 2,048 for long context processing`
			`- KV cache: q8_0 quantization for speed/quality balance`

			`### Sampling Parameters`
			`\| Parameter \| RAG Mode \| Coding Mode \|`
			`\|-----------\|----------\|-------------\|`
			`\| Temperature \| 0.2 \| 0.3 \|`
			`\| Top-p \| 0.95 \| 0.95 \|`
			`\| Top-k \| 40 \| 40 \|`
			`\| Min-p \| 0.01 \| 0.01 \|`
			`\| Repeat penalty \| 1.05 \| 1.05 \|`

			`## API Usage`

			`### Chat Completions`

			```bash
			`curl -X POST "http://localhost:8000/v1/chat/completions" \`
			`-H "Content-Type: application/json" \`
			`-d '{`
			`"model": "qwen3.6-35b-a3b-moe",`
			`"messages": [`
			`{ "role": "system", "content": "Du bist ein hilfreicher deutscher Assistent." },`
			`{ "role": "user", "content": "Erkläre Quantencomputing in 3 Sätzen." }`
			`],`
			`"max_tokens": 1024,`
			`"temperature": 0.2,`
			`"stream": false`
			`}'`
			```

			`### Health Check`

			```bash
			`curl -fs http://localhost:8000/`
			```

			`## Integration with Pi`

			`### Files`
			Pi reads files using the `read` tool and sends content as prompt text. For automatic loading, use AGENTS.md or project-specific context files.

			`### Prompts`
			`Configure via:`
			- `~/.pi/agent/SYSTEM.md` — replaces complete system prompt
			- `~/.pi/agent/APPEND_SYSTEM.md` — appended to end of system prompt

			`### Tools`
			Built-in tools (read, write, edit, bash) plus custom extensions in `~/.pi/agent/extensions/`. The model uses OpenAI function-calling API via the `--jinja` flag.

			`### MCP Servers`
			Add to `settings.json`:
			```json
			`"packages": [`
			`"npm:pi-llama-cpp",`
			`"npm:@modelcontextprotocol/server-filesystem",`
			`"npm:irgendein-mcp-server"`
			`]`
			```

			`## Troubleshooting`

			`### Server Not Responding`
			1. Check GPU availability: `nvidia-smi`
			2. Verify model file exists: `/home/dschlueter/nvme2n1p7_home/huggingface/models/qwen3/`
			3. Check container logs: `docker logs qwen35b-moe-rag-longctx`

			`### GPU Memory Issues`
			`- Reduce parallel slots from 2 to 1`
			`- Lower batch size from 2048 to 1024`
			`- Use uncensored variant if VRAM is tight`

			`### Connection Refused`
			- Ensure port 8000 is not in use: `lsof -i :8000`
			`- Check firewall settings`
			- Verify container is running: `docker ps \| grep qwen35b`

			`## Maintenance`

			`### Update Model`
			`1. Download new GGUF file to HF_HOME path`
			2. Update docker-compose.yml or shell script `-m` parameter
			`3. Restart container`

			`### Backup Configuration`
			```bash
			`cp ~/.pi/agent/SYSTEM.md ~/backup/`
			`cp ~/.pi/agent/APPEND_SYSTEM.md ~/backup/`
			`cp ~/.pi/agent/extensions/ ~/backup/ -r`
			```

			`## License`

			`This project uses llama.cpp (Apache 2.0) and the Qwen3.6-MoE model. Model usage subject to original model license terms.`