llama-server/README.md

188 lines
5.5 KiB
Markdown
Raw Normal View History

# Qwen3.6-MoE-35B-A3B Local Inference Server
Local deployment of the **Carnice-Qwen3.6-MoE-35B-A3B** model using llama.cpp with GPU acceleration, optimized for different use cases (coding, RAG, uncensored).
## Overview
This project provides Docker-based inference servers for the Qwen3.6-MoE-35B-A3B model, running on NVIDIA GPUs via llama.cpp. Multiple configurations are available for different workflows:
| Configuration | Description |
|---------------|-------------|
| `docker-compose_Qwen3.6_Tools_RAG_faehig.yml` | RAG-optimized with long context support (default) |
| `docker-compose_Qwen3.6_Tools_coding.yml` | Coding-focused with tuned sampling parameters |
| `docker-compose_Qwen3.6_Uncensored.yml` | Uncensored variant for unrestricted use |
| `docker-compose_Qwen3.6_Uncensored_RAG_faehig.yml` | Uncensored + RAG support |
**Model**: Carnice-Qwen3.6-MoE-35B-A3B-Q4_K_M.gguf (standard)
**Uncensored Model**: Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-Q4_K_M.gguf
**Image**: ghcr.io/ggml-org/llama.cpp:server-cuda
**API Endpoint**: http://localhost:8000/v1/chat/completions
## Architecture
```
MCP-Server ←──┐
Extensions ←──┤
AGENTS.md ←─┤ Pi ──→ llama-cpp Docker ──→ GPU
Dateien ←──┘ (API-Request mit System-Prompt + Tools)
```
The Docker containers serve as pure inference backends. All file management, prompts, tools, and MCP servers are handled at the pi level.
## Quick Start
### Using Docker Compose (Recommended)
```bash
# Start RAG-optimized server (default)
docker compose up -d --force-recreate
# Start coding-optimized server
docker compose -f docker-compose_Qwen3.6_Tools_coding.yml up -d --force-recreate
# Stop and remove container
docker compose rm -s -f qwen35b
```
### Using Shell Scripts
#### Server Mode Scripts
```bash
# Start tools server (coding-optimized)
./run_qwen35b_server_tools.sh
# Start RAG-optimized server (uncensored)
./run_qwen35b_server_uncensored_rag_longctx.sh
# Start uncensored server (no RAG)
./run_qwen35b_server_uncensored.sh
```
#### CLI Mode Scripts
```bash
# Start CLI mode for RAG
./run_qwen35b_cli_tools_rag_longctx.sh
# Start CLI mode for uncensored RAG
./run_qwen35b_cli_uncensored_rag_longctx.sh
```
#### Embedding Server
```bash
# Start BGE-M3 embedding server
./run_bge_m3_embedding_server.sh
```
**Note**: All shell scripts automatically stop any existing containers with the same name before starting new ones. Use `docker rm -f <container_name>` to manually stop servers.
## Configuration Details
### Hardware Requirements
- **GPU**: NVIDIA RTX 3090 (2x) or equivalent with 24GB+ VRAM each
- **RAM**: 64GB+ system RAM recommended
- **Storage**: 100GB+ for model files and cache
### GPU Setup
- Primary GPU: device 0 (first 3090)
- Tensor split: 0.5,0.5 (symmetric across both GPUs)
- All layers offloaded to GPU (`-ngl 999`)
- Flash Attention enabled for optimized memory access
### Context & Performance
- **Context window**: 262,144 tokens (256k)
- **Max output**: 16,384 tokens
- **Parallel slots**: 2 (saves ~10GB KV cache vs 4)
- **Batch size**: 2,048 for long context processing
- **KV cache**: q8_0 quantization for speed/quality balance
### Sampling Parameters
| Parameter | RAG Mode | Coding Mode |
|-----------|----------|-------------|
| Temperature | 0.2 | 0.3 |
| Top-p | 0.95 | 0.95 |
| Top-k | 40 | 40 |
| Min-p | 0.01 | 0.01 |
| Repeat penalty | 1.05 | 1.05 |
## API Usage
### Chat Completions
```bash
curl -X POST "http://localhost:8000/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3.6-35b-a3b-moe",
"messages": [
{ "role": "system", "content": "Du bist ein hilfreicher deutscher Assistent." },
{ "role": "user", "content": "Erkläre Quantencomputing in 3 Sätzen." }
],
"max_tokens": 1024,
"temperature": 0.2,
"stream": false
}'
```
### Health Check
```bash
curl -fs http://localhost:8000/
```
## Integration with Pi
### Files
Pi reads files using the `read` tool and sends content as prompt text. For automatic loading, use AGENTS.md or project-specific context files.
### Prompts
Configure via:
- `~/.pi/agent/SYSTEM.md` — replaces complete system prompt
- `~/.pi/agent/APPEND_SYSTEM.md` — appended to end of system prompt
### Tools
Built-in tools (read, write, edit, bash) plus custom extensions in `~/.pi/agent/extensions/`. The model uses OpenAI function-calling API via the `--jinja` flag.
### MCP Servers
Add to `settings.json`:
```json
"packages": [
"npm:pi-llama-cpp",
"npm:@modelcontextprotocol/server-filesystem",
"npm:irgendein-mcp-server"
]
```
## Troubleshooting
### Server Not Responding
1. Check GPU availability: `nvidia-smi`
2. Verify model file exists: `/home/dschlueter/nvme2n1p7_home/huggingface/models/qwen3/`
3. Check container logs: `docker logs qwen35b-moe-rag-longctx`
### GPU Memory Issues
- Reduce parallel slots from 2 to 1
- Lower batch size from 2048 to 1024
- Use uncensored variant if VRAM is tight
### Connection Refused
- Ensure port 8000 is not in use: `lsof -i :8000`
- Check firewall settings
- Verify container is running: `docker ps | grep qwen35b`
## Maintenance
### Update Model
1. Download new GGUF file to HF_HOME path
2. Update docker-compose.yml or shell script `-m` parameter
3. Restart container
### Backup Configuration
```bash
cp ~/.pi/agent/SYSTEM.md ~/backup/
cp ~/.pi/agent/APPEND_SYSTEM.md ~/backup/
cp ~/.pi/agent/extensions/ ~/backup/ -r
```
## License
This project uses llama.cpp (Apache 2.0) and the Qwen3.6-MoE model. Model usage subject to original model license terms.