Initial commit: Qwen3.6-MoE-35B-A3B server configuration and documentation
This commit is contained in:
commit
b039061615
16 changed files with 1672 additions and 0 deletions
188
README.md
Normal file
188
README.md
Normal file
|
|
@ -0,0 +1,188 @@
|
|||
# Qwen3.6-MoE-35B-A3B Local Inference Server
|
||||
|
||||
Local deployment of the **Carnice-Qwen3.6-MoE-35B-A3B** model using llama.cpp with GPU acceleration, optimized for different use cases (coding, RAG, uncensored).
|
||||
|
||||
## Overview
|
||||
|
||||
This project provides Docker-based inference servers for the Qwen3.6-MoE-35B-A3B model, running on NVIDIA GPUs via llama.cpp. Multiple configurations are available for different workflows:
|
||||
|
||||
| Configuration | Description |
|
||||
|---------------|-------------|
|
||||
| `docker-compose_Qwen3.6_Tools_RAG_faehig.yml` | RAG-optimized with long context support (default) |
|
||||
| `docker-compose_Qwen3.6_Tools_coding.yml` | Coding-focused with tuned sampling parameters |
|
||||
| `docker-compose_Qwen3.6_Uncensored.yml` | Uncensored variant for unrestricted use |
|
||||
| `docker-compose_Qwen3.6_Uncensored_RAG_faehig.yml` | Uncensored + RAG support |
|
||||
|
||||
**Model**: Carnice-Qwen3.6-MoE-35B-A3B-Q4_K_M.gguf (standard)
|
||||
**Uncensored Model**: Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-Q4_K_M.gguf
|
||||
**Image**: ghcr.io/ggml-org/llama.cpp:server-cuda
|
||||
**API Endpoint**: http://localhost:8000/v1/chat/completions
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
MCP-Server ←──┐
|
||||
Extensions ←──┤
|
||||
AGENTS.md ←─┤ Pi ──→ llama-cpp Docker ──→ GPU
|
||||
Dateien ←──┘ (API-Request mit System-Prompt + Tools)
|
||||
```
|
||||
|
||||
The Docker containers serve as pure inference backends. All file management, prompts, tools, and MCP servers are handled at the pi level.
|
||||
|
||||
## Quick Start
|
||||
|
||||
### Using Docker Compose (Recommended)
|
||||
|
||||
```bash
|
||||
# Start RAG-optimized server (default)
|
||||
docker compose up -d --force-recreate
|
||||
|
||||
# Start coding-optimized server
|
||||
docker compose -f docker-compose_Qwen3.6_Tools_coding.yml up -d --force-recreate
|
||||
|
||||
# Stop and remove container
|
||||
docker compose rm -s -f qwen35b
|
||||
```
|
||||
|
||||
### Using Shell Scripts
|
||||
|
||||
#### Server Mode Scripts
|
||||
```bash
|
||||
# Start tools server (coding-optimized)
|
||||
./run_qwen35b_server_tools.sh
|
||||
|
||||
# Start RAG-optimized server (uncensored)
|
||||
./run_qwen35b_server_uncensored_rag_longctx.sh
|
||||
|
||||
# Start uncensored server (no RAG)
|
||||
./run_qwen35b_server_uncensored.sh
|
||||
```
|
||||
|
||||
#### CLI Mode Scripts
|
||||
```bash
|
||||
# Start CLI mode for RAG
|
||||
./run_qwen35b_cli_tools_rag_longctx.sh
|
||||
|
||||
# Start CLI mode for uncensored RAG
|
||||
./run_qwen35b_cli_uncensored_rag_longctx.sh
|
||||
```
|
||||
|
||||
#### Embedding Server
|
||||
```bash
|
||||
# Start BGE-M3 embedding server
|
||||
./run_bge_m3_embedding_server.sh
|
||||
```
|
||||
|
||||
**Note**: All shell scripts automatically stop any existing containers with the same name before starting new ones. Use `docker rm -f <container_name>` to manually stop servers.
|
||||
|
||||
## Configuration Details
|
||||
|
||||
### Hardware Requirements
|
||||
- **GPU**: NVIDIA RTX 3090 (2x) or equivalent with 24GB+ VRAM each
|
||||
- **RAM**: 64GB+ system RAM recommended
|
||||
- **Storage**: 100GB+ for model files and cache
|
||||
|
||||
### GPU Setup
|
||||
- Primary GPU: device 0 (first 3090)
|
||||
- Tensor split: 0.5,0.5 (symmetric across both GPUs)
|
||||
- All layers offloaded to GPU (`-ngl 999`)
|
||||
- Flash Attention enabled for optimized memory access
|
||||
|
||||
### Context & Performance
|
||||
- **Context window**: 262,144 tokens (256k)
|
||||
- **Max output**: 16,384 tokens
|
||||
- **Parallel slots**: 2 (saves ~10GB KV cache vs 4)
|
||||
- **Batch size**: 2,048 for long context processing
|
||||
- **KV cache**: q8_0 quantization for speed/quality balance
|
||||
|
||||
### Sampling Parameters
|
||||
| Parameter | RAG Mode | Coding Mode |
|
||||
|-----------|----------|-------------|
|
||||
| Temperature | 0.2 | 0.3 |
|
||||
| Top-p | 0.95 | 0.95 |
|
||||
| Top-k | 40 | 40 |
|
||||
| Min-p | 0.01 | 0.01 |
|
||||
| Repeat penalty | 1.05 | 1.05 |
|
||||
|
||||
## API Usage
|
||||
|
||||
### Chat Completions
|
||||
|
||||
```bash
|
||||
curl -X POST "http://localhost:8000/v1/chat/completions" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "qwen3.6-35b-a3b-moe",
|
||||
"messages": [
|
||||
{ "role": "system", "content": "Du bist ein hilfreicher deutscher Assistent." },
|
||||
{ "role": "user", "content": "Erkläre Quantencomputing in 3 Sätzen." }
|
||||
],
|
||||
"max_tokens": 1024,
|
||||
"temperature": 0.2,
|
||||
"stream": false
|
||||
}'
|
||||
```
|
||||
|
||||
### Health Check
|
||||
|
||||
```bash
|
||||
curl -fs http://localhost:8000/
|
||||
```
|
||||
|
||||
## Integration with Pi
|
||||
|
||||
### Files
|
||||
Pi reads files using the `read` tool and sends content as prompt text. For automatic loading, use AGENTS.md or project-specific context files.
|
||||
|
||||
### Prompts
|
||||
Configure via:
|
||||
- `~/.pi/agent/SYSTEM.md` — replaces complete system prompt
|
||||
- `~/.pi/agent/APPEND_SYSTEM.md` — appended to end of system prompt
|
||||
|
||||
### Tools
|
||||
Built-in tools (read, write, edit, bash) plus custom extensions in `~/.pi/agent/extensions/`. The model uses OpenAI function-calling API via the `--jinja` flag.
|
||||
|
||||
### MCP Servers
|
||||
Add to `settings.json`:
|
||||
```json
|
||||
"packages": [
|
||||
"npm:pi-llama-cpp",
|
||||
"npm:@modelcontextprotocol/server-filesystem",
|
||||
"npm:irgendein-mcp-server"
|
||||
]
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Server Not Responding
|
||||
1. Check GPU availability: `nvidia-smi`
|
||||
2. Verify model file exists: `/home/dschlueter/nvme2n1p7_home/huggingface/models/qwen3/`
|
||||
3. Check container logs: `docker logs qwen35b-moe-rag-longctx`
|
||||
|
||||
### GPU Memory Issues
|
||||
- Reduce parallel slots from 2 to 1
|
||||
- Lower batch size from 2048 to 1024
|
||||
- Use uncensored variant if VRAM is tight
|
||||
|
||||
### Connection Refused
|
||||
- Ensure port 8000 is not in use: `lsof -i :8000`
|
||||
- Check firewall settings
|
||||
- Verify container is running: `docker ps | grep qwen35b`
|
||||
|
||||
## Maintenance
|
||||
|
||||
### Update Model
|
||||
1. Download new GGUF file to HF_HOME path
|
||||
2. Update docker-compose.yml or shell script `-m` parameter
|
||||
3. Restart container
|
||||
|
||||
### Backup Configuration
|
||||
```bash
|
||||
cp ~/.pi/agent/SYSTEM.md ~/backup/
|
||||
cp ~/.pi/agent/APPEND_SYSTEM.md ~/backup/
|
||||
cp ~/.pi/agent/extensions/ ~/backup/ -r
|
||||
```
|
||||
|
||||
## License
|
||||
|
||||
This project uses llama.cpp (Apache 2.0) and the Qwen3.6-MoE model. Model usage subject to original model license terms.
|
||||
Loading…
Add table
Add a link
Reference in a new issue