Initial commit: Qwen3.6-MoE-35B-A3B server configuration and documentation
This commit is contained in:
commit
b039061615
16 changed files with 1672 additions and 0 deletions
76
.gitignore
vendored
Normal file
76
.gitignore
vendored
Normal file
|
|
@ -0,0 +1,76 @@
|
||||||
|
# Environment files
|
||||||
|
.env
|
||||||
|
.env.local
|
||||||
|
.env.*.local
|
||||||
|
|
||||||
|
# Docker volumes and data
|
||||||
|
docker/volumes/
|
||||||
|
docker/data/
|
||||||
|
docker/tmp/
|
||||||
|
|
||||||
|
# Node modules (if used)
|
||||||
|
node_modules/
|
||||||
|
npm-debug.log*
|
||||||
|
yarn-debug.log*
|
||||||
|
yarn-error.log*
|
||||||
|
|
||||||
|
# Python virtual environments
|
||||||
|
venv/
|
||||||
|
env/
|
||||||
|
.env/
|
||||||
|
.venv/
|
||||||
|
*.egg-info/
|
||||||
|
dist/
|
||||||
|
build/
|
||||||
|
|
||||||
|
# IDE files
|
||||||
|
.vscode/
|
||||||
|
.idea/
|
||||||
|
*.swp
|
||||||
|
*.swo
|
||||||
|
*~
|
||||||
|
|
||||||
|
# OS files
|
||||||
|
.DS_Store
|
||||||
|
Thumbs.db
|
||||||
|
ehthumbs.db
|
||||||
|
Desktop.ini
|
||||||
|
$RECYCLE.BIN/
|
||||||
|
|
||||||
|
# Logs
|
||||||
|
logs/
|
||||||
|
*.log
|
||||||
|
npm-debug.log*
|
||||||
|
yarn-debug.log*
|
||||||
|
yarn-error.log*
|
||||||
|
|
||||||
|
# Cache and temporary files
|
||||||
|
.cache/
|
||||||
|
.tmp/
|
||||||
|
temp/
|
||||||
|
tmp/
|
||||||
|
|
||||||
|
# Docker build cache
|
||||||
|
.docker/
|
||||||
|
|
||||||
|
# Local configuration overrides
|
||||||
|
docker-compose.override.yml
|
||||||
|
|
||||||
|
# Hugging Face cache (if downloaded locally)
|
||||||
|
.huggingface/
|
||||||
|
|
||||||
|
# Model files (large binary files)
|
||||||
|
*.gguf
|
||||||
|
*.bin
|
||||||
|
*.safetensors
|
||||||
|
|
||||||
|
# Backup files
|
||||||
|
*.bak
|
||||||
|
*.backup
|
||||||
|
*~
|
||||||
|
|
||||||
|
# Credentials and sensitive data
|
||||||
|
credentials/
|
||||||
|
secrets/
|
||||||
|
*.pem
|
||||||
|
*.key
|
||||||
339
BEDIENUNGSANLEITUNG.md
Normal file
339
BEDIENUNGSANLEITUNG.md
Normal file
|
|
@ -0,0 +1,339 @@
|
||||||
|
# Bedienungshandbuch für Qwen3.6-MoE-35B-A3B Server
|
||||||
|
|
||||||
|
Dieses Handbuch beschreibt die Installation, Konfiguration und den Betrieb der lokalen Qwen3.6-MoE-35B-A3B Inferenzserver mit llama.cpp.
|
||||||
|
|
||||||
|
## Inhaltsverzeichnis
|
||||||
|
1. [Systemvoraussetzungen](#systemvoraussetzungen)
|
||||||
|
2. [Installation und Start](#installation-und-start)
|
||||||
|
3. [Server-Verwaltung](#server-verwaltung)
|
||||||
|
4. [Konfiguration und Parameter](#konfiguration-und-parameter)
|
||||||
|
5. [Integration mit Pi](#integration-mit-pi)
|
||||||
|
6. [API-Nutzung](#api-nutzung)
|
||||||
|
7. [Fehlerbehebung](#fehlerbehebung)
|
||||||
|
|
||||||
|
## Systemvoraussetzungen
|
||||||
|
|
||||||
|
### Hardware
|
||||||
|
- **GPU**: NVIDIA RTX 3090 (2x) oder equivalent mit je 24GB+ VRAM
|
||||||
|
- **RAM**: 64GB+ System-RAM empfohlen
|
||||||
|
- **Speicher**: 100GB+ für Modell-Dateien und Cache
|
||||||
|
- **NVIDIA-Treiber**: Mindestens Version 535+ mit CUDA 12.x
|
||||||
|
|
||||||
|
### Software
|
||||||
|
- Docker Engine (Version 20.10+)
|
||||||
|
- Docker Compose (Version 2.0+)
|
||||||
|
- NVIDIA Container Toolkit
|
||||||
|
- curl oder wget für Healthchecks
|
||||||
|
|
||||||
|
## Installation und Start
|
||||||
|
|
||||||
|
### Voraussetzungen prüfen
|
||||||
|
```bash
|
||||||
|
# GPU-Verfügbarkeit prüfen
|
||||||
|
nvidia-smi
|
||||||
|
|
||||||
|
# Docker-Version prüfen
|
||||||
|
docker --version
|
||||||
|
docker compose version
|
||||||
|
|
||||||
|
# Verzeichnisstruktur erstellen
|
||||||
|
mkdir -p ~/llama-server
|
||||||
|
cd ~/llama-server
|
||||||
|
```
|
||||||
|
|
||||||
|
### Server starten
|
||||||
|
|
||||||
|
#### Methode 1: Docker Compose (Empfohlen)
|
||||||
|
```bash
|
||||||
|
# In das Projektverzeichnis wechseln
|
||||||
|
cd ~/llama-server
|
||||||
|
|
||||||
|
# RAG-optimierten Server starten (Standard)
|
||||||
|
docker compose up -d --force-recreate
|
||||||
|
|
||||||
|
# Coding-optimierten Server starten
|
||||||
|
docker compose -f docker-compose_Qwen3.6_Tools_coding.yml up -d --force-recreate
|
||||||
|
|
||||||
|
# Uncensored-Variante starten
|
||||||
|
docker compose -f docker-compose_Qwen3.6_Uncensored.yml up -d --force-recreate
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Methode 2: Shell-Skripte
|
||||||
|
```bash
|
||||||
|
# Server-Modus (Hintergrunddienst)
|
||||||
|
./run_qwen35b_server_tools.sh # Coding-optimiert
|
||||||
|
./run_qwen35b_server_uncensored_rag_longctx.sh # Uncensored + RAG
|
||||||
|
./run_qwen35b_server_uncensored.sh # Uncensored (kein RAG)
|
||||||
|
|
||||||
|
# CLI-Modus (Kommandozeile)
|
||||||
|
./run_qwen35b_cli_tools_rag_longctx.sh # CLI mit RAG
|
||||||
|
./run_qwen35b_cli_uncensored_rag_longctx.sh # CLI Uncensored + RAG
|
||||||
|
|
||||||
|
# Embedding-Server
|
||||||
|
./run_bge_m3_embedding_server.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
**Hinweis**: Alle Shell-Skripte stoppen automatisch existierende Container gleichen Namens vor dem Start.
|
||||||
|
|
||||||
|
## Server-Verwaltung
|
||||||
|
|
||||||
|
### Wichtige Regel
|
||||||
|
> **Nur ein Server kann gleichzeitig auf Port 8000 laufen!**
|
||||||
|
|
||||||
|
### Container-Namen und Konfigurationen
|
||||||
|
|
||||||
|
| Container-Name | Modell | Konfigurationsdatei |
|
||||||
|
|----------------|--------|---------------------|
|
||||||
|
| qwen35b-moe-coding | Carnice | docker-compose_Qwen3.6_Tools_coding.yml |
|
||||||
|
| qwen35b-moe-tools | Carnice | docker-compose_Qwen3.6_Tools.yml |
|
||||||
|
| qwen35b-moe-rag-longctx | Carnice | docker-compose_Qwen3.6_Tools_RAG_faehig.yml |
|
||||||
|
| qwen35b-moe-uncensored | Uncensored | docker-compose_Qwen3.6_Uncensored.yml |
|
||||||
|
| qwen35b-moe-uncensored-rag | Uncensored | docker-compose_Qwen3.6_Uncensored_RAG_faehig.yml |
|
||||||
|
| qwen35b-moe-uncensored-rag-longctx | Uncensored | run_qwen35b_server_uncensored_rag_longctx.sh |
|
||||||
|
|
||||||
|
### Server stoppen und starten
|
||||||
|
|
||||||
|
#### Container stoppen
|
||||||
|
```bash
|
||||||
|
# Nach Container-Namen stoppen
|
||||||
|
docker rm -f qwen35b-moe-coding
|
||||||
|
|
||||||
|
# Oder via docker-compose
|
||||||
|
cd ~/llama-server
|
||||||
|
docker compose -f docker-compose_Qwen3.6_Tools_coding.yml down
|
||||||
|
|
||||||
|
# Alle laufenden Container anzeigen
|
||||||
|
docker ps --format "table {{.Names}}\t{{.Status}}\t{{.Ports}}"
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Server wechseln
|
||||||
|
```bash
|
||||||
|
cd ~/llama-server
|
||||||
|
|
||||||
|
# Aktuellen Server stoppen
|
||||||
|
docker rm -f qwen35b-moe-coding
|
||||||
|
|
||||||
|
# Anderen Server starten
|
||||||
|
docker compose -f docker-compose_Qwen3.6_Uncensored.yml up -d
|
||||||
|
```
|
||||||
|
|
||||||
|
### Healthcheck und Status
|
||||||
|
```bash
|
||||||
|
# Server-Status prüfen
|
||||||
|
curl -fs http://localhost:8000/
|
||||||
|
|
||||||
|
# Container-Logs anzeigen
|
||||||
|
docker logs qwen35b-moe-rag-longctx
|
||||||
|
|
||||||
|
# Container-Status prüfen
|
||||||
|
docker inspect --format='{{.State.Health.Status}}' qwen35b-moe-rag-longctx
|
||||||
|
```
|
||||||
|
|
||||||
|
## Konfiguration und Parameter
|
||||||
|
|
||||||
|
### Hardware-Konfiguration
|
||||||
|
```yaml
|
||||||
|
GPU:
|
||||||
|
Haupt-GPU: "0" (erste 3090)
|
||||||
|
Tensor-Split: "0.5,0.5" (symmetrisch)
|
||||||
|
Alle Layer auf GPU: -ngl 999
|
||||||
|
Flash Attention: -fa on
|
||||||
|
|
||||||
|
KV-Cache:
|
||||||
|
Typ: q8_0 (K und V)
|
||||||
|
Unified Cache: --kv-unified
|
||||||
|
```
|
||||||
|
|
||||||
|
### Kontext- und Performance-Parameter
|
||||||
|
| Parameter | Wert | Beschreibung |
|
||||||
|
|-----------|------|--------------|
|
||||||
|
| Kontext-Fenster | 262,144 (256k) | Für lange RAG-Kontexte |
|
||||||
|
| Max. Ausgabe | 16,384 Token | Verhindert Text-Loops |
|
||||||
|
| Parallel-Slots | 2 | Spart ~10GB KV-Cache |
|
||||||
|
| Batch-Größe | 2,048 | Für lange Kontexte |
|
||||||
|
| Ubatch-Größe | 512 | Passend zu batch-size |
|
||||||
|
|
||||||
|
### Sampling-Parameter
|
||||||
|
|
||||||
|
#### RAG-Modus (Standard)
|
||||||
|
```yaml
|
||||||
|
temperature: 0.2 # Niedriger für faktentreue Antworten
|
||||||
|
top-p: 0.95 # Qwen-Empfehlung
|
||||||
|
top-k: 40 # Qwen-Empfehlung
|
||||||
|
min-p: 0.01 # Stabilisiert Sampling
|
||||||
|
repeat-penalty: 1.05 # Verhindert Wiederholungen
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Coding-Modus
|
||||||
|
```yaml
|
||||||
|
temperature: 0.3 # Kompromiss für Kreativität und Präzision
|
||||||
|
top-p: 0.95
|
||||||
|
top-k: 40
|
||||||
|
min-p: 0.01
|
||||||
|
repeat-penalty: 1.05
|
||||||
|
```
|
||||||
|
|
||||||
|
### Laufzeit-Parameter (ohne Neustart)
|
||||||
|
Diese Parameter können pro API-Request überschrieben werden:
|
||||||
|
- `temperature`
|
||||||
|
- `top_p`
|
||||||
|
- `top_k`
|
||||||
|
- `min_p`
|
||||||
|
- `repeat_penalty`
|
||||||
|
- `max_tokens`
|
||||||
|
|
||||||
|
**Beispiel**:
|
||||||
|
```bash
|
||||||
|
curl -X POST "http://localhost:8000/v1/chat/completions" \
|
||||||
|
-H "Content-Type: application/json" \
|
||||||
|
-d '{
|
||||||
|
"temperature": 0.0,
|
||||||
|
"top_k": 20,
|
||||||
|
"max_tokens": 512,
|
||||||
|
...
|
||||||
|
}'
|
||||||
|
```
|
||||||
|
|
||||||
|
### Parameter mit Neustart erforderlich
|
||||||
|
| Parameter | Grund |
|
||||||
|
|-----------|-------|
|
||||||
|
| `-c` (Kontext) | KV-Cache wird beim Start allokiert |
|
||||||
|
| `--parallel` | Anzahl KV-Cache-Slots ist fest |
|
||||||
|
| `-ngl`, `--tensor-split` | Modell wird beim Start auf GPU geladen |
|
||||||
|
| `--kv-unified`, `--cache-type-*` | Cache-Struktur ist unveränderlich |
|
||||||
|
| `--batch-size`, `--ubatch-size` | Interne Buffer-Allokation |
|
||||||
|
| Modell-Datei | Offensichtlich |
|
||||||
|
|
||||||
|
## Integration mit Pi
|
||||||
|
|
||||||
|
### Architektur
|
||||||
|
```
|
||||||
|
MCP-Server ←──┐
|
||||||
|
Extensions ←──┤
|
||||||
|
AGENTS.md ←─┤ Pi ──→ llama-cpp Docker ──→ GPU
|
||||||
|
Dateien ←──┘ (API-Request mit System-Prompt + Tools)
|
||||||
|
```
|
||||||
|
|
||||||
|
### 1. Dateien übergeben
|
||||||
|
Pi liest Dateien mit dem `read`-Tool und sendet den Inhalt als Text im Prompt.
|
||||||
|
|
||||||
|
**Automatisches Laden**:
|
||||||
|
- AGENTS.md oder projektspezifische Context-Files beim Session-Start
|
||||||
|
|
||||||
|
### 2. Prompts konfigurieren
|
||||||
|
```bash
|
||||||
|
~/.pi/agent/SYSTEM.md # Ersetzt kompletten System-Prompt
|
||||||
|
~/.pi/agent/APPEND_SYSTEM.md # Wird ans Ende angehängt
|
||||||
|
```
|
||||||
|
|
||||||
|
### 3. Tools verwenden
|
||||||
|
**Eingebaute Tools**: read, write, edit, bash
|
||||||
|
|
||||||
|
**Eigene Tools**: Als Pi-Extensions in `~/.pi/agent/extensions/` registriert. Das Modell sieht Tool-Definitionen im System-Prompt und ruft sie über OpenAI function-calling API auf (deshalb ist `--jinja` wichtig).
|
||||||
|
|
||||||
|
### 4. MCP-Server einrichten
|
||||||
|
In `settings.json`:
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"packages": [
|
||||||
|
"npm:pi-llama-cpp",
|
||||||
|
"npm:@modelcontextprotocol/server-filesystem",
|
||||||
|
"npm:irgendein-mcp-server"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Der MCP-Server läuft als Prozess neben pi — nicht im llama-cpp-Container.
|
||||||
|
|
||||||
|
**Hinweis**: llama.cpp hat ein `--system-prompt`-Flag, aber das ist weniger flexibel als AGENTS.md und kollidiert mit pi's eigenem System-Prompt.
|
||||||
|
|
||||||
|
## API-Nutzung
|
||||||
|
|
||||||
|
### Chat Completions
|
||||||
|
```bash
|
||||||
|
curl -X POST "http://localhost:8000/v1/chat/completions" \
|
||||||
|
-H "Content-Type: application/json" \
|
||||||
|
-d '{
|
||||||
|
"model": "qwen3.6-35b-a3b-moe",
|
||||||
|
"messages": [
|
||||||
|
{ "role": "system", "content": "Du bist ein hilfreicher deutscher Assistent." },
|
||||||
|
{ "role": "user", "content": "Erkläre Quantencomputing in 3 Sätzen." }
|
||||||
|
],
|
||||||
|
"max_tokens": 1024,
|
||||||
|
"temperature": 0.2,
|
||||||
|
"stream": false
|
||||||
|
}'
|
||||||
|
```
|
||||||
|
|
||||||
|
### Streaming aktivieren
|
||||||
|
```bash
|
||||||
|
curl -X POST "http://localhost:8000/v1/chat/completions" \
|
||||||
|
-H "Content-Type: application/json" \
|
||||||
|
-d '{
|
||||||
|
"model": "qwen3.6-35b-a3b-moe",
|
||||||
|
"messages": [
|
||||||
|
{ "role": "user", "content": "Schreibe eine kurze Geschichte." }
|
||||||
|
],
|
||||||
|
"stream": true
|
||||||
|
}'
|
||||||
|
```
|
||||||
|
|
||||||
|
## Fehlerbehebung
|
||||||
|
|
||||||
|
### Server antwortet nicht
|
||||||
|
1. **GPU-Verfügbarkeit prüfen**: `nvidia-smi`
|
||||||
|
2. **Modell-Datei existiert**: `/home/dschlueter/nvme2n1p7_home/huggingface/models/qwen3/`
|
||||||
|
3. **Container-Logs prüfen**: `docker logs qwen35b-moe-rag-longctx`
|
||||||
|
|
||||||
|
### GPU-Speicher-Probleme
|
||||||
|
- Parallel-Slots von 2 auf 1 reduzieren
|
||||||
|
- Batch-Größe von 2048 auf 1024 reduzieren
|
||||||
|
- Uncensored-Variante verwenden (geringerer VRAM-Bedarf)
|
||||||
|
|
||||||
|
### Verbindungsfehler
|
||||||
|
- **Port 8000 belegt**: `lsof -i :8000` prüfen
|
||||||
|
- **Firewall**: Firewall-Einstellungen überprüfen
|
||||||
|
- **Container läuft**: `docker ps | grep qwen35b` prüfen
|
||||||
|
|
||||||
|
### Container startet nicht
|
||||||
|
1. **GPU-Zugriff**: NVIDIA Container Toolkit installieren
|
||||||
|
2. **Speicher**: Ausreichend VRAM verfügbar?
|
||||||
|
3. **Port-Konflikt**: Anderen Server stoppen
|
||||||
|
|
||||||
|
### Modell-Datei nicht gefunden
|
||||||
|
```bash
|
||||||
|
# Pfad prüfen
|
||||||
|
ls -la /home/dschlueter/nvme2n1p7_home/huggingface/models/qwen3/
|
||||||
|
|
||||||
|
# Falls nötig, Modell herunterladen
|
||||||
|
huggingface-cli download <model-name> --local-dir ./models
|
||||||
|
```
|
||||||
|
|
||||||
|
## Wartung und Backup
|
||||||
|
|
||||||
|
### Modell aktualisieren
|
||||||
|
1. Neue GGUF-Datei in HF_HOME-Pfad herunterladen
|
||||||
|
2. docker-compose.yml oder Shell-Skript `-m` Parameter aktualisieren
|
||||||
|
3. Container neu starten
|
||||||
|
|
||||||
|
### Konfiguration sichern
|
||||||
|
```bash
|
||||||
|
# System-Prompts sichern
|
||||||
|
cp ~/.pi/agent/SYSTEM.md ~/backup/
|
||||||
|
cp ~/.pi/agent/APPEND_SYSTEM.md ~/backup/
|
||||||
|
|
||||||
|
# Extensions sichern
|
||||||
|
cp -r ~/.pi/agent/extensions/ ~/backup/
|
||||||
|
|
||||||
|
# Docker-Konfigurationen sichern
|
||||||
|
cp ~/llama-server/docker-compose*.yml ~/backup/
|
||||||
|
```
|
||||||
|
|
||||||
|
### Regelmäßige Wartung
|
||||||
|
- **Wöchentlich**: Container-Logs prüfen
|
||||||
|
- **Monatlich**: GPU-Treiber aktualisieren
|
||||||
|
- **Bei Updates**: Docker und NVIDIA-Treiber updaten
|
||||||
|
|
||||||
|
## Lizenz
|
||||||
|
|
||||||
|
Dieses Projekt verwendet llama.cpp (Apache 2.0) und das Qwen3.6-MoE Modell. Die Modellnutzung unterliegt den Lizenzbedingungen des ursprünglichen Modells.
|
||||||
162
FAQs.md
Normal file
162
FAQs.md
Normal file
|
|
@ -0,0 +1,162 @@
|
||||||
|
# FAQ - Häufig gestellte Fragen
|
||||||
|
|
||||||
|
## Server-Verwaltung
|
||||||
|
|
||||||
|
### Wie stoppe und starte ich einen dieser Llama-Server?
|
||||||
|
|
||||||
|
⚠️ **Wichtig**: Da alle Container Port 8000 belegen, kann immer nur einer gleichzeitig laufen.
|
||||||
|
|
||||||
|
#### Server stoppen
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Laufenden Container nach Name stoppen (egal womit er gestartet wurde)
|
||||||
|
docker rm -f qwen35b-moe-coding
|
||||||
|
|
||||||
|
# Oder via docker-compose
|
||||||
|
cd ~/llama-server
|
||||||
|
docker compose -f docker-compose_Qwen3.6_Tools_coding.yml down
|
||||||
|
|
||||||
|
# Welche Container gerade laufen:
|
||||||
|
docker ps --format "table {{.Names}}\t{{.Status}}\t{{.Ports}}"
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Server starten
|
||||||
|
|
||||||
|
**Via Docker Compose (empfohlen — hat Healthcheck und restart: unless-stopped):**
|
||||||
|
```bash
|
||||||
|
cd ~/llama-server
|
||||||
|
docker compose -f docker-compose_Qwen3.6_Tools_coding.yml up -d
|
||||||
|
```
|
||||||
|
|
||||||
|
**Via Shell-Script (mit Readiness-Check und Test-Request):**
|
||||||
|
```bash
|
||||||
|
cd ~/llama-server
|
||||||
|
bash run_qwen35b_server_tools.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Wechsel zwischen zwei Servern
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd ~/llama-server
|
||||||
|
|
||||||
|
# Aktuellen stoppen
|
||||||
|
docker rm -f qwen35b-moe-coding
|
||||||
|
|
||||||
|
# Anderen starten
|
||||||
|
docker compose -f docker-compose_Qwen3.6_Uncensored.yml up -d
|
||||||
|
```
|
||||||
|
|
||||||
|
### Übersicht: Container-Namen ↔ Dateien
|
||||||
|
|
||||||
|
| Container-Name | Modell | Konfigurationsdatei |
|
||||||
|
|----------------|--------|---------------------|
|
||||||
|
| `qwen35b-moe-coding` | Carnice | `docker-compose_Qwen3.6_Tools_coding.yml` |
|
||||||
|
| `qwen35b-moe-tools` | Carnice | `docker-compose_Qwen3.6_Tools.yml` |
|
||||||
|
| `qwen35b-moe-rag-longctx` | Carnice | `docker-compose_Qwen3.6_Tools_RAG_faehig.yml` |
|
||||||
|
| `qwen35b-moe-uncensored` | Uncensored | `docker-compose_Qwen3.6_Uncensored.yml` |
|
||||||
|
| `qwen35b-moe-uncensored-rag` | Uncensored | `docker-compose_Qwen3.6_Uncensored_RAG_faehig.yml` |
|
||||||
|
| `qwen35b-moe-uncensored-rag-longctx` | Uncensored | `run_qwen35b_server_uncensored_rag_longctx.sh` |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Wie beende ich einen laufenden Docker-Server und erzeuge und starte ich einen neuen?
|
||||||
|
|
||||||
|
#### Beenden und löschen (Beispiel):
|
||||||
|
```bash
|
||||||
|
docker compose -f docker-compose_Qwen3.6_Tools_coding.yml rm -s -f qwen35b
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Erzeugen und starten (Beispiel):
|
||||||
|
```bash
|
||||||
|
docker compose -f docker-compose_Qwen3.6_Tools_coding.yml up -d --force-recreate
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Integration mit Pi
|
||||||
|
|
||||||
|
### Wie kann ich diesen Docker-Containern mit Pi Dateien, Prompts, Tools oder MCP-Server übergeben?
|
||||||
|
|
||||||
|
Das ist eine Architekturfrage. Die Docker-Container sind reine Inferenz-Backends — sie empfangen API-Requests und liefern Text zurück. Alles andere (Dateien, Prompts, Tools, MCP) wird auf der pi-Ebene verwaltet:
|
||||||
|
|
||||||
|
```
|
||||||
|
MCP-Server ←──┐
|
||||||
|
Extensions ←──┤
|
||||||
|
AGENTS.md ←─┤ Pi ──→ llama-cpp Docker ──→ GPU
|
||||||
|
Dateien ←──┘ (API-Request mit System-Prompt + Tools)
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Die vier Wege im Detail:
|
||||||
|
|
||||||
|
**1. Dateien → per pi-Kontext**
|
||||||
|
Pi liest Dateien mit dem `read`-Tool und schickt den Inhalt als Text im Prompt. Für automatisches Laden: AGENTS.md oder projektspezifische Context-Files (die pi beim Session-Start einliest).
|
||||||
|
|
||||||
|
**2. Prompts → AGENTS.md / SYSTEM.md**
|
||||||
|
Bereits eingerichtet. Zusätzlich gibt es:
|
||||||
|
```bash
|
||||||
|
~/.pi/agent/SYSTEM.md # ersetzt den kompletten System-Prompt
|
||||||
|
~/.pi/agent/APPEND_SYSTEM.md # wird ans Ende angehängt
|
||||||
|
```
|
||||||
|
|
||||||
|
**3. Tools**
|
||||||
|
Pi hat eingebaute Tools (read, write, edit, bash). Eigene Tools werden als Pi-Extensions im Verzeichnis `~/.pi/agent/extensions/` registriert — du hast dort bereits eine (fact-checker). Das Modell sieht die Tool-Definitionen im System-Prompt und ruft sie über die OpenAI function-calling API auf (deshalb ist `--jinja` wichtig).
|
||||||
|
|
||||||
|
**4. MCP-Server → Pi-Packages**
|
||||||
|
Pi unterstützt MCP-Server als Packages. In deiner `settings.json` ist `npm:pi-llama-cpp` bereits eingetragen. Weitere MCP-Server werden analog hinzugefügt:
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"packages": [
|
||||||
|
"npm:pi-llama-cpp",
|
||||||
|
"npm:@modelcontextprotocol/server-filesystem",
|
||||||
|
"npm:irgendein-mcp-server"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
```
|
||||||
|
Der MCP-Server läuft dann als Prozess neben pi — nicht im llama-cpp-Container.
|
||||||
|
|
||||||
|
> **Einzige Ausnahme**: Statischer System-Prompt direkt im Container
|
||||||
|
> llama.cpp hat einen `--system-prompt`-Flag, der einen festen Prompt beim Serverstart einbrennt. Das ist aber weniger flexibel als AGENTS.md und kollidiert mit pi's eigenem System-Prompt — daher eher nicht empfehlenswert.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Server-Einstellungen
|
||||||
|
|
||||||
|
### Kann ich die Server-Einstellungen im laufenden Betrieb ändern?
|
||||||
|
|
||||||
|
**Größtenteils nein** — llama.cpp lädt alle Parameter beim Start und allokiert KV-Cache, GPU-Layer-Verteilung und Kontextfenster fest. Ein Neustart ist für Infrastruktur-Änderungen nötig.
|
||||||
|
|
||||||
|
#### Was sich zur Laufzeit ändern lässt
|
||||||
|
|
||||||
|
**Sampling-Parameter — pro API-Request überschreibbar:**
|
||||||
|
Pi (und jeder andere API-Client) kann `temperature`, `top_p`, `top_k`, `repeat_penalty`, `max_tokens` in jedem einzelnen Request überschreiben — unabhängig vom Server-Default:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
POST /v1/chat/completions
|
||||||
|
{
|
||||||
|
"temperature": 0.0,
|
||||||
|
"top_k": 20,
|
||||||
|
"max_tokens": 512,
|
||||||
|
...
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Der Server-Default gilt nur, wenn der Request den Parameter weglässt.
|
||||||
|
|
||||||
|
**/props-Endpoint** — llama.cpp hat eine undokumentierte API zum Ändern weniger Server-Properties zur Laufzeit (z.B. `default_generation_settings`), aber das ist instabil und von pi nicht genutzt.
|
||||||
|
|
||||||
|
#### Was einen Neustart erfordert
|
||||||
|
|
||||||
|
| Parameter | Grund |
|
||||||
|
|-----------|-------|
|
||||||
|
| `-c` (Kontext) | KV-Cache wird beim Start allokiert |
|
||||||
|
| `--parallel` | Anzahl KV-Cache-Slots ist fest |
|
||||||
|
| `-ngl`, `--tensor-split` | Modell wird beim Start auf GPU geladen |
|
||||||
|
| `--kv-unified`, `--cache-type-*` | Cache-Struktur ist nach dem Laden unveränderlich |
|
||||||
|
| `--batch-size`, `--ubatch-size` | Interne Buffer-Allokation |
|
||||||
|
| Modell-Datei | Offensichtlich |
|
||||||
|
|
||||||
|
#### Fazit
|
||||||
|
|
||||||
|
Für Experimente mit Sampling-Parametern (temp, top_k etc.) brauchst du keinen Neustart — du kannst sie direkt im pi-Prompt oder per API-Call testen. Für alles andere gilt: stoppen, Datei anpassen, neu starten.
|
||||||
|
|
||||||
|
|
||||||
188
README.md
Normal file
188
README.md
Normal file
|
|
@ -0,0 +1,188 @@
|
||||||
|
# Qwen3.6-MoE-35B-A3B Local Inference Server
|
||||||
|
|
||||||
|
Local deployment of the **Carnice-Qwen3.6-MoE-35B-A3B** model using llama.cpp with GPU acceleration, optimized for different use cases (coding, RAG, uncensored).
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
This project provides Docker-based inference servers for the Qwen3.6-MoE-35B-A3B model, running on NVIDIA GPUs via llama.cpp. Multiple configurations are available for different workflows:
|
||||||
|
|
||||||
|
| Configuration | Description |
|
||||||
|
|---------------|-------------|
|
||||||
|
| `docker-compose_Qwen3.6_Tools_RAG_faehig.yml` | RAG-optimized with long context support (default) |
|
||||||
|
| `docker-compose_Qwen3.6_Tools_coding.yml` | Coding-focused with tuned sampling parameters |
|
||||||
|
| `docker-compose_Qwen3.6_Uncensored.yml` | Uncensored variant for unrestricted use |
|
||||||
|
| `docker-compose_Qwen3.6_Uncensored_RAG_faehig.yml` | Uncensored + RAG support |
|
||||||
|
|
||||||
|
**Model**: Carnice-Qwen3.6-MoE-35B-A3B-Q4_K_M.gguf (standard)
|
||||||
|
**Uncensored Model**: Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-Q4_K_M.gguf
|
||||||
|
**Image**: ghcr.io/ggml-org/llama.cpp:server-cuda
|
||||||
|
**API Endpoint**: http://localhost:8000/v1/chat/completions
|
||||||
|
|
||||||
|
## Architecture
|
||||||
|
|
||||||
|
```
|
||||||
|
MCP-Server ←──┐
|
||||||
|
Extensions ←──┤
|
||||||
|
AGENTS.md ←─┤ Pi ──→ llama-cpp Docker ──→ GPU
|
||||||
|
Dateien ←──┘ (API-Request mit System-Prompt + Tools)
|
||||||
|
```
|
||||||
|
|
||||||
|
The Docker containers serve as pure inference backends. All file management, prompts, tools, and MCP servers are handled at the pi level.
|
||||||
|
|
||||||
|
## Quick Start
|
||||||
|
|
||||||
|
### Using Docker Compose (Recommended)
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Start RAG-optimized server (default)
|
||||||
|
docker compose up -d --force-recreate
|
||||||
|
|
||||||
|
# Start coding-optimized server
|
||||||
|
docker compose -f docker-compose_Qwen3.6_Tools_coding.yml up -d --force-recreate
|
||||||
|
|
||||||
|
# Stop and remove container
|
||||||
|
docker compose rm -s -f qwen35b
|
||||||
|
```
|
||||||
|
|
||||||
|
### Using Shell Scripts
|
||||||
|
|
||||||
|
#### Server Mode Scripts
|
||||||
|
```bash
|
||||||
|
# Start tools server (coding-optimized)
|
||||||
|
./run_qwen35b_server_tools.sh
|
||||||
|
|
||||||
|
# Start RAG-optimized server (uncensored)
|
||||||
|
./run_qwen35b_server_uncensored_rag_longctx.sh
|
||||||
|
|
||||||
|
# Start uncensored server (no RAG)
|
||||||
|
./run_qwen35b_server_uncensored.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
#### CLI Mode Scripts
|
||||||
|
```bash
|
||||||
|
# Start CLI mode for RAG
|
||||||
|
./run_qwen35b_cli_tools_rag_longctx.sh
|
||||||
|
|
||||||
|
# Start CLI mode for uncensored RAG
|
||||||
|
./run_qwen35b_cli_uncensored_rag_longctx.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Embedding Server
|
||||||
|
```bash
|
||||||
|
# Start BGE-M3 embedding server
|
||||||
|
./run_bge_m3_embedding_server.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
**Note**: All shell scripts automatically stop any existing containers with the same name before starting new ones. Use `docker rm -f <container_name>` to manually stop servers.
|
||||||
|
|
||||||
|
## Configuration Details
|
||||||
|
|
||||||
|
### Hardware Requirements
|
||||||
|
- **GPU**: NVIDIA RTX 3090 (2x) or equivalent with 24GB+ VRAM each
|
||||||
|
- **RAM**: 64GB+ system RAM recommended
|
||||||
|
- **Storage**: 100GB+ for model files and cache
|
||||||
|
|
||||||
|
### GPU Setup
|
||||||
|
- Primary GPU: device 0 (first 3090)
|
||||||
|
- Tensor split: 0.5,0.5 (symmetric across both GPUs)
|
||||||
|
- All layers offloaded to GPU (`-ngl 999`)
|
||||||
|
- Flash Attention enabled for optimized memory access
|
||||||
|
|
||||||
|
### Context & Performance
|
||||||
|
- **Context window**: 262,144 tokens (256k)
|
||||||
|
- **Max output**: 16,384 tokens
|
||||||
|
- **Parallel slots**: 2 (saves ~10GB KV cache vs 4)
|
||||||
|
- **Batch size**: 2,048 for long context processing
|
||||||
|
- **KV cache**: q8_0 quantization for speed/quality balance
|
||||||
|
|
||||||
|
### Sampling Parameters
|
||||||
|
| Parameter | RAG Mode | Coding Mode |
|
||||||
|
|-----------|----------|-------------|
|
||||||
|
| Temperature | 0.2 | 0.3 |
|
||||||
|
| Top-p | 0.95 | 0.95 |
|
||||||
|
| Top-k | 40 | 40 |
|
||||||
|
| Min-p | 0.01 | 0.01 |
|
||||||
|
| Repeat penalty | 1.05 | 1.05 |
|
||||||
|
|
||||||
|
## API Usage
|
||||||
|
|
||||||
|
### Chat Completions
|
||||||
|
|
||||||
|
```bash
|
||||||
|
curl -X POST "http://localhost:8000/v1/chat/completions" \
|
||||||
|
-H "Content-Type: application/json" \
|
||||||
|
-d '{
|
||||||
|
"model": "qwen3.6-35b-a3b-moe",
|
||||||
|
"messages": [
|
||||||
|
{ "role": "system", "content": "Du bist ein hilfreicher deutscher Assistent." },
|
||||||
|
{ "role": "user", "content": "Erkläre Quantencomputing in 3 Sätzen." }
|
||||||
|
],
|
||||||
|
"max_tokens": 1024,
|
||||||
|
"temperature": 0.2,
|
||||||
|
"stream": false
|
||||||
|
}'
|
||||||
|
```
|
||||||
|
|
||||||
|
### Health Check
|
||||||
|
|
||||||
|
```bash
|
||||||
|
curl -fs http://localhost:8000/
|
||||||
|
```
|
||||||
|
|
||||||
|
## Integration with Pi
|
||||||
|
|
||||||
|
### Files
|
||||||
|
Pi reads files using the `read` tool and sends content as prompt text. For automatic loading, use AGENTS.md or project-specific context files.
|
||||||
|
|
||||||
|
### Prompts
|
||||||
|
Configure via:
|
||||||
|
- `~/.pi/agent/SYSTEM.md` — replaces complete system prompt
|
||||||
|
- `~/.pi/agent/APPEND_SYSTEM.md` — appended to end of system prompt
|
||||||
|
|
||||||
|
### Tools
|
||||||
|
Built-in tools (read, write, edit, bash) plus custom extensions in `~/.pi/agent/extensions/`. The model uses OpenAI function-calling API via the `--jinja` flag.
|
||||||
|
|
||||||
|
### MCP Servers
|
||||||
|
Add to `settings.json`:
|
||||||
|
```json
|
||||||
|
"packages": [
|
||||||
|
"npm:pi-llama-cpp",
|
||||||
|
"npm:@modelcontextprotocol/server-filesystem",
|
||||||
|
"npm:irgendein-mcp-server"
|
||||||
|
]
|
||||||
|
```
|
||||||
|
|
||||||
|
## Troubleshooting
|
||||||
|
|
||||||
|
### Server Not Responding
|
||||||
|
1. Check GPU availability: `nvidia-smi`
|
||||||
|
2. Verify model file exists: `/home/dschlueter/nvme2n1p7_home/huggingface/models/qwen3/`
|
||||||
|
3. Check container logs: `docker logs qwen35b-moe-rag-longctx`
|
||||||
|
|
||||||
|
### GPU Memory Issues
|
||||||
|
- Reduce parallel slots from 2 to 1
|
||||||
|
- Lower batch size from 2048 to 1024
|
||||||
|
- Use uncensored variant if VRAM is tight
|
||||||
|
|
||||||
|
### Connection Refused
|
||||||
|
- Ensure port 8000 is not in use: `lsof -i :8000`
|
||||||
|
- Check firewall settings
|
||||||
|
- Verify container is running: `docker ps | grep qwen35b`
|
||||||
|
|
||||||
|
## Maintenance
|
||||||
|
|
||||||
|
### Update Model
|
||||||
|
1. Download new GGUF file to HF_HOME path
|
||||||
|
2. Update docker-compose.yml or shell script `-m` parameter
|
||||||
|
3. Restart container
|
||||||
|
|
||||||
|
### Backup Configuration
|
||||||
|
```bash
|
||||||
|
cp ~/.pi/agent/SYSTEM.md ~/backup/
|
||||||
|
cp ~/.pi/agent/APPEND_SYSTEM.md ~/backup/
|
||||||
|
cp ~/.pi/agent/extensions/ ~/backup/ -r
|
||||||
|
```
|
||||||
|
|
||||||
|
## License
|
||||||
|
|
||||||
|
This project uses llama.cpp (Apache 2.0) and the Qwen3.6-MoE model. Model usage subject to original model license terms.
|
||||||
1
docker-compose.yml
Symbolic link
1
docker-compose.yml
Symbolic link
|
|
@ -0,0 +1 @@
|
||||||
|
docker-compose_Qwen3.6_Tools_RAG_faehig.yml
|
||||||
92
docker-compose_Qwen3.6_Tools.yml
Normal file
92
docker-compose_Qwen3.6_Tools.yml
Normal file
|
|
@ -0,0 +1,92 @@
|
||||||
|
services:
|
||||||
|
qwen35b:
|
||||||
|
image: ghcr.io/ggml-org/llama.cpp:server-cuda
|
||||||
|
container_name: qwen35b-moe-tools
|
||||||
|
restart: unless-stopped
|
||||||
|
|
||||||
|
ports:
|
||||||
|
- "8000:8000"
|
||||||
|
|
||||||
|
environment:
|
||||||
|
HF_HOME: /hf_home
|
||||||
|
NVIDIA_VISIBLE_DEVICES: "1,2" # Im Host‑System: 3090 = 1,2; T600 = 0
|
||||||
|
|
||||||
|
volumes:
|
||||||
|
- /home/dschlueter/nvme2n1p7_home/huggingface:/hf_home:ro
|
||||||
|
|
||||||
|
command:
|
||||||
|
- -m
|
||||||
|
- /hf_home/models/qwen3/Carnice-Qwen3.6-MoE-35B-A3B-Q4_K_M.gguf
|
||||||
|
|
||||||
|
# Kontext & Ausgabe
|
||||||
|
|
||||||
|
- -c
|
||||||
|
- "262144" # 256k: ideal für große Kontexte
|
||||||
|
- -n
|
||||||
|
- "16384" # 16k: Begrenzung verhindert Text-Generierungs-Loops
|
||||||
|
|
||||||
|
# Sampler
|
||||||
|
|
||||||
|
- --temp
|
||||||
|
- "0.3" # Kompromiss: niedrig genug für edit-Tool-Präzision, variabel genug für kreatives Coding
|
||||||
|
- --top-p
|
||||||
|
- "0.95" # Qwen-Empfehlung
|
||||||
|
- --top-k
|
||||||
|
- "40" # Qwen-Empfehlung
|
||||||
|
- --min-p
|
||||||
|
- "0.01" # stabilisiert Sampling-Verteilung
|
||||||
|
- --repeat-penalty
|
||||||
|
- "1.05" # minimal: verhindert Text-Wiederholungsschleifen, schadet edit-Tool kaum
|
||||||
|
|
||||||
|
# GPU-/Multi-GPU-Setup
|
||||||
|
|
||||||
|
- --main-gpu
|
||||||
|
- "0" # erste 3090 als Haupt-GPU im Container
|
||||||
|
- --tensor-split
|
||||||
|
- "0.5,0.5" # symmetrisch: beide 3090 haben je 24 GB VRAM
|
||||||
|
- -ngl
|
||||||
|
- "999" # alle Layer auf GPU auslagern
|
||||||
|
- -fa
|
||||||
|
- "on" # Flash Attention: optimierte Speicherzugriffe und Matmul
|
||||||
|
|
||||||
|
# KV-Cache
|
||||||
|
|
||||||
|
- --kv-unified
|
||||||
|
- --cache-type-k
|
||||||
|
- q8_0 # guter Speed/Qualitäts-Kompromiss
|
||||||
|
- --cache-type-v
|
||||||
|
- q8_0
|
||||||
|
|
||||||
|
# Batching & Parallelität
|
||||||
|
|
||||||
|
- --batch-size
|
||||||
|
- "2048" # großer Prompt-Batch: schnellere Verarbeitung langer Kontexte
|
||||||
|
- --ubatch-size
|
||||||
|
- "512" # passend zu batch-size
|
||||||
|
- --parallel
|
||||||
|
- "2" # 2 parallele Slots für Single-User: spart ~10 GB KV-Cache
|
||||||
|
- --cont-batching # kontinuierliches Batching aktivieren
|
||||||
|
|
||||||
|
# Server
|
||||||
|
|
||||||
|
- --jinja
|
||||||
|
- --no-context-shift
|
||||||
|
- --host
|
||||||
|
- 0.0.0.0
|
||||||
|
- --port
|
||||||
|
- "8000"
|
||||||
|
|
||||||
|
healthcheck:
|
||||||
|
test: ["CMD-SHELL", "curl -fs http://localhost:8000/ || exit 1"]
|
||||||
|
interval: 30s
|
||||||
|
timeout: 5s
|
||||||
|
retries: 3
|
||||||
|
start_period: 120s # 35B-Modell braucht länger zum Laden
|
||||||
|
|
||||||
|
deploy:
|
||||||
|
resources:
|
||||||
|
reservations:
|
||||||
|
devices:
|
||||||
|
- driver: nvidia
|
||||||
|
device_ids: ["1", "2"] # beide 3090 (T600 = 0, nicht verwendet)
|
||||||
|
capabilities: [gpu]
|
||||||
92
docker-compose_Qwen3.6_Tools_RAG_faehig.yml
Normal file
92
docker-compose_Qwen3.6_Tools_RAG_faehig.yml
Normal file
|
|
@ -0,0 +1,92 @@
|
||||||
|
services:
|
||||||
|
qwen35b:
|
||||||
|
image: ghcr.io/ggml-org/llama.cpp:server-cuda
|
||||||
|
container_name: qwen35b-moe-rag-longctx
|
||||||
|
restart: unless-stopped
|
||||||
|
|
||||||
|
ports:
|
||||||
|
- "8000:8000"
|
||||||
|
|
||||||
|
environment:
|
||||||
|
HF_HOME: /hf_home
|
||||||
|
NVIDIA_VISIBLE_DEVICES: "1,2" # Im Host‑System: 3090 = 1,2; T600 = 0
|
||||||
|
|
||||||
|
volumes:
|
||||||
|
- /home/dschlueter/nvme2n1p7_home/huggingface:/hf_home:ro
|
||||||
|
|
||||||
|
command:
|
||||||
|
- -m
|
||||||
|
- /hf_home/models/qwen3/Carnice-Qwen3.6-MoE-35B-A3B-Q4_K_M.gguf
|
||||||
|
|
||||||
|
# Kontext & Ausgabe
|
||||||
|
|
||||||
|
- -c
|
||||||
|
- "262144" # 256k: ideal für RAG mit langen Retrieval-Kontexten
|
||||||
|
- -n
|
||||||
|
- "16384" # 16k: Begrenzung verhindert Text-Generierungs-Loops
|
||||||
|
|
||||||
|
# Sampler
|
||||||
|
|
||||||
|
- --temp
|
||||||
|
- "0.2" # niedriger als Tools_coding: RAG braucht faktentreue, präzise Antworten
|
||||||
|
- --top-p
|
||||||
|
- "0.95" # Qwen-Empfehlung
|
||||||
|
- --top-k
|
||||||
|
- "40" # Qwen-Empfehlung (0 = deaktiviert wäre zu unscharf)
|
||||||
|
- --min-p
|
||||||
|
- "0.01" # stabilisiert Sampling-Verteilung
|
||||||
|
- --repeat-penalty
|
||||||
|
- "1.05" # minimal: verhindert Text-Wiederholungsschleifen
|
||||||
|
|
||||||
|
# GPU-/Multi-GPU-Setup
|
||||||
|
|
||||||
|
- --main-gpu
|
||||||
|
- "0" # erste 3090 als Haupt-GPU im Container
|
||||||
|
- --tensor-split
|
||||||
|
- "0.5,0.5" # symmetrisch: beide 3090 haben je 24 GB VRAM
|
||||||
|
- -ngl
|
||||||
|
- "999" # alle Layer auf GPU auslagern
|
||||||
|
- -fa
|
||||||
|
- "on" # Flash Attention: optimierte Speicherzugriffe und Matmul
|
||||||
|
|
||||||
|
# KV-Cache
|
||||||
|
|
||||||
|
- --kv-unified
|
||||||
|
- --cache-type-k
|
||||||
|
- q8_0 # guter Speed/Qualitäts-Kompromiss
|
||||||
|
- --cache-type-v
|
||||||
|
- q8_0
|
||||||
|
|
||||||
|
# Batching & Parallelität
|
||||||
|
|
||||||
|
- --batch-size
|
||||||
|
- "2048" # großer Prompt-Batch: schnellere Verarbeitung langer RAG-Kontexte
|
||||||
|
- --ubatch-size
|
||||||
|
- "512" # passend zu batch-size
|
||||||
|
- --parallel
|
||||||
|
- "2" # 2 parallele Slots für Single-User: spart ~10 GB KV-Cache
|
||||||
|
- --cont-batching # kontinuierliches Batching aktivieren
|
||||||
|
|
||||||
|
# Server
|
||||||
|
|
||||||
|
- --jinja
|
||||||
|
- --no-context-shift
|
||||||
|
- --host
|
||||||
|
- 0.0.0.0
|
||||||
|
- --port
|
||||||
|
- "8000"
|
||||||
|
|
||||||
|
healthcheck:
|
||||||
|
test: ["CMD-SHELL", "curl -fs http://localhost:8000/ || exit 1"]
|
||||||
|
interval: 30s
|
||||||
|
timeout: 5s
|
||||||
|
retries: 3
|
||||||
|
start_period: 120s
|
||||||
|
|
||||||
|
deploy:
|
||||||
|
resources:
|
||||||
|
reservations:
|
||||||
|
devices:
|
||||||
|
- driver: nvidia
|
||||||
|
device_ids: ["1", "2"] # beide 3090 (T600 = 0, nicht verwendet)
|
||||||
|
capabilities: [gpu]
|
||||||
91
docker-compose_Qwen3.6_Tools_coding.yml
Normal file
91
docker-compose_Qwen3.6_Tools_coding.yml
Normal file
|
|
@ -0,0 +1,91 @@
|
||||||
|
services:
|
||||||
|
qwen35b:
|
||||||
|
image: ghcr.io/ggml-org/llama.cpp:server-cuda
|
||||||
|
container_name: qwen35b-moe-coding
|
||||||
|
restart: unless-stopped
|
||||||
|
|
||||||
|
ports:
|
||||||
|
- "8000:8000"
|
||||||
|
|
||||||
|
environment:
|
||||||
|
HF_HOME: /hf_home
|
||||||
|
NVIDIA_VISIBLE_DEVICES: "1,2" # Im Host‑System: 3090 = 1,2; T600 = 0
|
||||||
|
|
||||||
|
volumes:
|
||||||
|
- /home/dschlueter/nvme2n1p7_home/huggingface:/hf_home:ro
|
||||||
|
|
||||||
|
command:
|
||||||
|
- -m
|
||||||
|
- /hf_home/models/qwen3/Carnice-Qwen3.6-MoE-35B-A3B-Q4_K_M.gguf
|
||||||
|
|
||||||
|
# Kontext & Ausgabe
|
||||||
|
|
||||||
|
- -c
|
||||||
|
- "262144" # 256k: ideal für große Codeprojekte mit vielen Dateien im Kontext
|
||||||
|
- -n
|
||||||
|
- "16384" # 16k: reicht für komplexe Klassen, ganze Dateien, lange Erklärungen
|
||||||
|
|
||||||
|
# Sampler
|
||||||
|
|
||||||
|
- --temp
|
||||||
|
- "0.3" # Kompromiss: niedrig genug für edit-Tool-Präzision, variabel genug für kreatives Coding
|
||||||
|
- --top-p
|
||||||
|
- "0.95" # Qwen-Empfehlung
|
||||||
|
- --top-k
|
||||||
|
- "40" # Qwen-Empfehlung
|
||||||
|
- --min-p
|
||||||
|
- "0.01" # stabilisiert Sampling-Verteilung
|
||||||
|
- --repeat-penalty
|
||||||
|
- "1.05" # minimal: verhindert Text-Wiederholungsschleifen, schadet edit-Tool kaum
|
||||||
|
|
||||||
|
# GPU-/Multi-GPU-Setup
|
||||||
|
- --main-gpu
|
||||||
|
- "0" # erste 3090 als Haupt-GPU im Container
|
||||||
|
- --tensor-split
|
||||||
|
- "0.5,0.5" # symmetrisch: beide 3090 haben je 24 GB VRAM
|
||||||
|
- -ngl
|
||||||
|
- "999" # alle Layer auf GPU auslagern
|
||||||
|
- -fa
|
||||||
|
- "on" # Flash Attention: optimierte Speicherzugriffe und Matmul
|
||||||
|
|
||||||
|
# KV-Cache
|
||||||
|
- --kv-unified
|
||||||
|
- --cache-type-k
|
||||||
|
- q8_0 # guter Speed/Qualitäts-Kompromiss
|
||||||
|
- --cache-type-v
|
||||||
|
- q8_0
|
||||||
|
|
||||||
|
# Batching & Parallelität
|
||||||
|
- --batch-size
|
||||||
|
- "2048" # großer Prompt-Batch: schnellere Verarbeitung langer Datei-Kontexte
|
||||||
|
- --ubatch-size
|
||||||
|
- "512" # passend zu batch-size
|
||||||
|
- --parallel
|
||||||
|
- "2" # 2 parallele Slots für Single-User: spart ~10 GB KV-Cache vs. 4
|
||||||
|
- --cont-batching # kontinuierliches Batching aktivieren
|
||||||
|
|
||||||
|
|
||||||
|
# Server
|
||||||
|
|
||||||
|
- --jinja
|
||||||
|
- --no-context-shift
|
||||||
|
- --host
|
||||||
|
- 0.0.0.0
|
||||||
|
- --port
|
||||||
|
- "8000"
|
||||||
|
|
||||||
|
healthcheck:
|
||||||
|
test: ["CMD-SHELL", "curl -fs http://localhost:8000/ || exit 1"]
|
||||||
|
interval: 30s
|
||||||
|
timeout: 5s
|
||||||
|
retries: 3
|
||||||
|
start_period: 120s
|
||||||
|
|
||||||
|
deploy:
|
||||||
|
resources:
|
||||||
|
reservations:
|
||||||
|
devices:
|
||||||
|
- driver: nvidia
|
||||||
|
device_ids: ["1", "2"]
|
||||||
|
capabilities: [gpu]
|
||||||
|
|
||||||
92
docker-compose_Qwen3.6_Uncensored.yml
Normal file
92
docker-compose_Qwen3.6_Uncensored.yml
Normal file
|
|
@ -0,0 +1,92 @@
|
||||||
|
services:
|
||||||
|
qwen35b:
|
||||||
|
image: ghcr.io/ggml-org/llama.cpp:server-cuda
|
||||||
|
container_name: qwen35b-moe-uncensored
|
||||||
|
restart: unless-stopped
|
||||||
|
|
||||||
|
ports:
|
||||||
|
- "8000:8000"
|
||||||
|
|
||||||
|
environment:
|
||||||
|
HF_HOME: /hf_home
|
||||||
|
NVIDIA_VISIBLE_DEVICES: "1,2" # Im Host‑System: 3090 = 1,2; T600 = 0
|
||||||
|
|
||||||
|
volumes:
|
||||||
|
- /home/dschlueter/nvme2n1p7_home/huggingface:/hf_home:ro
|
||||||
|
|
||||||
|
command:
|
||||||
|
- -m
|
||||||
|
- /hf_home/models/qwen3/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-Q4_K_M.gguf
|
||||||
|
|
||||||
|
# Kontext & Ausgabe
|
||||||
|
|
||||||
|
- -c
|
||||||
|
- "262144" # 256k: ideal für große Kontexte
|
||||||
|
- -n
|
||||||
|
- "16384" # 16k: Begrenzung verhindert Text-Generierungs-Loops
|
||||||
|
|
||||||
|
# Sampler
|
||||||
|
|
||||||
|
- --temp
|
||||||
|
- "0.6" # höher als Tools_coding: Uncensored-Modell für kreativere Aufgaben
|
||||||
|
- --top-p
|
||||||
|
- "0.95" # Qwen-Empfehlung
|
||||||
|
- --top-k
|
||||||
|
- "40" # Qwen-Empfehlung
|
||||||
|
- --min-p
|
||||||
|
- "0.01" # stabilisiert Sampling-Verteilung
|
||||||
|
- --repeat-penalty
|
||||||
|
- "1.05" # minimal: verhindert Text-Wiederholungsschleifen
|
||||||
|
|
||||||
|
# GPU-/Multi-GPU-Setup
|
||||||
|
|
||||||
|
- --main-gpu
|
||||||
|
- "0" # erste 3090 als Haupt-GPU im Container
|
||||||
|
- --tensor-split
|
||||||
|
- "0.5,0.5" # symmetrisch: beide 3090 haben je 24 GB VRAM
|
||||||
|
- -ngl
|
||||||
|
- "999" # alle Layer auf GPU auslagern
|
||||||
|
- -fa
|
||||||
|
- "on" # Flash Attention: optimierte Speicherzugriffe und Matmul
|
||||||
|
|
||||||
|
# KV-Cache
|
||||||
|
|
||||||
|
- --kv-unified
|
||||||
|
- --cache-type-k
|
||||||
|
- q8_0 # guter Speed/Qualitäts-Kompromiss
|
||||||
|
- --cache-type-v
|
||||||
|
- q8_0
|
||||||
|
|
||||||
|
# Batching & Parallelität
|
||||||
|
|
||||||
|
- --batch-size
|
||||||
|
- "2048" # großer Prompt-Batch: schnellere Verarbeitung langer Kontexte
|
||||||
|
- --ubatch-size
|
||||||
|
- "512" # passend zu batch-size
|
||||||
|
- --parallel
|
||||||
|
- "2" # 2 parallele Slots für Single-User: spart ~10 GB KV-Cache
|
||||||
|
- --cont-batching # kontinuierliches Batching aktivieren
|
||||||
|
|
||||||
|
# Server
|
||||||
|
|
||||||
|
- --jinja
|
||||||
|
- --no-context-shift
|
||||||
|
- --host
|
||||||
|
- 0.0.0.0
|
||||||
|
- --port
|
||||||
|
- "8000"
|
||||||
|
|
||||||
|
healthcheck:
|
||||||
|
test: ["CMD-SHELL", "curl -fs http://localhost:8000/ || exit 1"]
|
||||||
|
interval: 30s
|
||||||
|
timeout: 5s
|
||||||
|
retries: 3
|
||||||
|
start_period: 120s # 35B-Modell braucht länger zum Laden
|
||||||
|
|
||||||
|
deploy:
|
||||||
|
resources:
|
||||||
|
reservations:
|
||||||
|
devices:
|
||||||
|
- driver: nvidia
|
||||||
|
device_ids: ["1", "2"] # beide 3090 (T600 = 0, nicht verwendet)
|
||||||
|
capabilities: [gpu]
|
||||||
92
docker-compose_Qwen3.6_Uncensored_RAG_faehig.yml
Normal file
92
docker-compose_Qwen3.6_Uncensored_RAG_faehig.yml
Normal file
|
|
@ -0,0 +1,92 @@
|
||||||
|
services:
|
||||||
|
qwen35b:
|
||||||
|
image: ghcr.io/ggml-org/llama.cpp:server-cuda
|
||||||
|
container_name: qwen35b-moe-uncensored-rag # eindeutiger Name, kein Konflikt mit RAG_faehig
|
||||||
|
restart: unless-stopped
|
||||||
|
|
||||||
|
ports:
|
||||||
|
- "8000:8000"
|
||||||
|
|
||||||
|
environment:
|
||||||
|
HF_HOME: /hf_home
|
||||||
|
NVIDIA_VISIBLE_DEVICES: "1,2" # Im Host‑System: 3090 = 1,2; T600 = 0
|
||||||
|
|
||||||
|
volumes:
|
||||||
|
- /home/dschlueter/nvme2n1p7_home/huggingface:/hf_home:ro
|
||||||
|
|
||||||
|
command:
|
||||||
|
- -m
|
||||||
|
- /hf_home/models/qwen3/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-Q4_K_M.gguf
|
||||||
|
|
||||||
|
# Kontext & Ausgabe
|
||||||
|
|
||||||
|
- -c
|
||||||
|
- "262144" # 256k: ideal für RAG mit langen Retrieval-Kontexten
|
||||||
|
- -n
|
||||||
|
- "16384" # 16k: Begrenzung verhindert Text-Generierungs-Loops
|
||||||
|
|
||||||
|
# Sampler
|
||||||
|
|
||||||
|
- --temp
|
||||||
|
- "0.2" # niedrig: RAG braucht faktentreue, präzise Antworten
|
||||||
|
- --top-p
|
||||||
|
- "0.95" # Qwen-Empfehlung
|
||||||
|
- --top-k
|
||||||
|
- "40" # Qwen-Empfehlung (0 = deaktiviert wäre zu unscharf)
|
||||||
|
- --min-p
|
||||||
|
- "0.01" # stabilisiert Sampling-Verteilung
|
||||||
|
- --repeat-penalty
|
||||||
|
- "1.05" # minimal: verhindert Text-Wiederholungsschleifen
|
||||||
|
|
||||||
|
# GPU-/Multi-GPU-Setup
|
||||||
|
|
||||||
|
- --main-gpu
|
||||||
|
- "0" # erste 3090 als Haupt-GPU im Container
|
||||||
|
- --tensor-split
|
||||||
|
- "0.5,0.5" # symmetrisch: beide 3090 haben je 24 GB VRAM
|
||||||
|
- -ngl
|
||||||
|
- "999" # alle Layer auf GPU auslagern
|
||||||
|
- -fa
|
||||||
|
- "on" # Flash Attention: optimierte Speicherzugriffe und Matmul
|
||||||
|
|
||||||
|
# KV-Cache
|
||||||
|
|
||||||
|
- --kv-unified
|
||||||
|
- --cache-type-k
|
||||||
|
- q8_0 # guter Speed/Qualitäts-Kompromiss
|
||||||
|
- --cache-type-v
|
||||||
|
- q8_0
|
||||||
|
|
||||||
|
# Batching & Parallelität
|
||||||
|
|
||||||
|
- --batch-size
|
||||||
|
- "2048" # großer Prompt-Batch: schnellere Verarbeitung langer RAG-Kontexte
|
||||||
|
- --ubatch-size
|
||||||
|
- "512" # passend zu batch-size
|
||||||
|
- --parallel
|
||||||
|
- "2" # 2 parallele Slots für Single-User: spart ~10 GB KV-Cache
|
||||||
|
- --cont-batching # kontinuierliches Batching aktivieren
|
||||||
|
|
||||||
|
# Server
|
||||||
|
|
||||||
|
- --jinja
|
||||||
|
- --no-context-shift
|
||||||
|
- --host
|
||||||
|
- 0.0.0.0
|
||||||
|
- --port
|
||||||
|
- "8000"
|
||||||
|
|
||||||
|
healthcheck:
|
||||||
|
test: ["CMD-SHELL", "curl -fs http://localhost:8000/ || exit 1"]
|
||||||
|
interval: 30s
|
||||||
|
timeout: 5s
|
||||||
|
retries: 3
|
||||||
|
start_period: 120s
|
||||||
|
|
||||||
|
deploy:
|
||||||
|
resources:
|
||||||
|
reservations:
|
||||||
|
devices:
|
||||||
|
- driver: nvidia
|
||||||
|
device_ids: ["1", "2"] # beide 3090 (T600 = 0, nicht verwendet)
|
||||||
|
capabilities: [gpu]
|
||||||
81
run_bge_m3_embedding_server.sh
Executable file
81
run_bge_m3_embedding_server.sh
Executable file
|
|
@ -0,0 +1,81 @@
|
||||||
|
#!/usr/bin/env bash
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
# Konfiguration
|
||||||
|
HF_HOME="${HF_HOME:-/home/dschlueter/nvme2n1p7_home/huggingface}"
|
||||||
|
MODEL_REL_PATH="models/embeddings/bge-m3-q8_0.gguf"
|
||||||
|
IMAGE="ghcr.io/ggml-org/llama.cpp:server-cuda"
|
||||||
|
CONTAINER_NAME="qwen-embeddings"
|
||||||
|
HOST_PORT=8001
|
||||||
|
CONTAINER_PORT=8001
|
||||||
|
|
||||||
|
echo "[*] Verwende HF_HOME = $HF_HOME"
|
||||||
|
if [ ! -f "$HF_HOME/$MODEL_REL_PATH" ]; then
|
||||||
|
echo "[!] Embedding-Modell-Datei nicht gefunden: $HF_HOME/$MODEL_REL_PATH" >&2
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Optional: altes gleichnamiges Container-Exemplar stoppen
|
||||||
|
if docker ps -a --format '{{.Names}}' | grep -q "^${CONTAINER_NAME}\\$"; then
|
||||||
|
echo "[*] Stoppe existierenden Container $CONTAINER_NAME ..."
|
||||||
|
docker rm -f "$CONTAINER_NAME" >/dev/null 2>&1 || true
|
||||||
|
fi
|
||||||
|
|
||||||
|
echo "[*] Starte llama.cpp-Embedding-Server-Container ($IMAGE) ..."
|
||||||
|
|
||||||
|
docker run -d --gpus '"device=0"' \
|
||||||
|
--name "$CONTAINER_NAME" \
|
||||||
|
-e HF_HOME="/hf_home" \
|
||||||
|
-v "$HF_HOME:/hf_home:ro" \
|
||||||
|
-p "${HOST_PORT}:${CONTAINER_PORT}" \
|
||||||
|
"$IMAGE" \
|
||||||
|
--embedding \
|
||||||
|
-m "/hf_home/${MODEL_REL_PATH}" \
|
||||||
|
-c 8192 \
|
||||||
|
-ngl 999 \
|
||||||
|
-fa on \
|
||||||
|
--batch-size 1024 \
|
||||||
|
--ubatch-size 512 \
|
||||||
|
--host 0.0.0.0 \
|
||||||
|
--port "$CONTAINER_PORT"
|
||||||
|
|
||||||
|
echo "[*] Container gestartet: $CONTAINER_NAME"
|
||||||
|
echo "[*] Warte, bis HTTP-Port ${HOST_PORT} antwortet ..."
|
||||||
|
|
||||||
|
READY=0
|
||||||
|
for i in {1..60}; do
|
||||||
|
if curl -s "http://localhost:${HOST_PORT}/" >/dev/null 2>&1; then
|
||||||
|
echo "[*] Server antwortet auf http://localhost:${HOST_PORT}/"
|
||||||
|
READY=1
|
||||||
|
break
|
||||||
|
fi
|
||||||
|
echo "[*] Warte (${i}/60) ..."
|
||||||
|
sleep 2
|
||||||
|
done
|
||||||
|
|
||||||
|
if [ "$READY" -ne 1 ]; then
|
||||||
|
echo "[!] Embedding-Server wurde nicht rechtzeitig erreichbar." >&2
|
||||||
|
echo "[*] Letzte Container-Logs:"
|
||||||
|
docker logs --tail 200 "$CONTAINER_NAME" || true
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
sleep 3
|
||||||
|
|
||||||
|
echo "[*] Sende Test-Embedding-Request an /v1/embeddings ..."
|
||||||
|
|
||||||
|
RESPONSE="$(curl -s -X POST "http://localhost:${HOST_PORT}/v1/embeddings" \
|
||||||
|
-H "Content-Type: application/json" \
|
||||||
|
-d '{
|
||||||
|
"model": "bge-m3-q8_0",
|
||||||
|
"input": "Dies ist ein kurzer Testtext für den Embedding-Server."
|
||||||
|
}')"
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "[*] Antwort vom Server:"
|
||||||
|
echo "$RESPONSE"
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "[*] Zum Stoppen des Servers:"
|
||||||
|
echo " docker rm -f $CONTAINER_NAME"
|
||||||
|
|
||||||
25
run_qwen35b_cli_tools_rag_longctx.sh
Executable file
25
run_qwen35b_cli_tools_rag_longctx.sh
Executable file
|
|
@ -0,0 +1,25 @@
|
||||||
|
|
||||||
|
docker run --rm -it \
|
||||||
|
--gpus '"device=1,2"' \
|
||||||
|
-p 8000:8000 \
|
||||||
|
-v "$HF_HOME/models/qwen3:/models" \
|
||||||
|
ghcr.io/ggml-org/llama.cpp:server-cuda \
|
||||||
|
-m /models/Carnice-Qwen3.6-MoE-35B-A3B-Q4_K_M.gguf \
|
||||||
|
-c 262144 \
|
||||||
|
-n 16384 \
|
||||||
|
--jinja \
|
||||||
|
--no-context-shift \
|
||||||
|
--temp 0.2 --top-p 0.95 --top-k 40 --min-p 0.01 --repeat-penalty 1.05 \
|
||||||
|
--main-gpu 0 \
|
||||||
|
--tensor-split 0.5,0.5 \
|
||||||
|
-ngl 999 \
|
||||||
|
-fa on \
|
||||||
|
--kv-unified \
|
||||||
|
--cache-type-k q8_0 \
|
||||||
|
--cache-type-v q8_0 \
|
||||||
|
--batch-size 2048 \
|
||||||
|
--ubatch-size 512 \
|
||||||
|
--parallel 2 \
|
||||||
|
--cont-batching \
|
||||||
|
--host 0.0.0.0 \
|
||||||
|
--port 8000
|
||||||
25
run_qwen35b_cli_uncensored_rag_longctx.sh
Executable file
25
run_qwen35b_cli_uncensored_rag_longctx.sh
Executable file
|
|
@ -0,0 +1,25 @@
|
||||||
|
|
||||||
|
docker run --rm -it \
|
||||||
|
--gpus '"device=1,2"' \
|
||||||
|
-p 8000:8000 \
|
||||||
|
-v "$HF_HOME/models/qwen3:/models" \
|
||||||
|
ghcr.io/ggml-org/llama.cpp:server-cuda \
|
||||||
|
-m /models/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-Q4_K_M.gguf \
|
||||||
|
-c 262144 \
|
||||||
|
-n 16384 \
|
||||||
|
--jinja \
|
||||||
|
--no-context-shift \
|
||||||
|
--temp 0.2 --top-p 0.95 --top-k 40 --min-p 0.01 --repeat-penalty 1.05 \
|
||||||
|
--main-gpu 0 \
|
||||||
|
--tensor-split 0.5,0.5 \
|
||||||
|
-ngl 999 \
|
||||||
|
-fa on \
|
||||||
|
--kv-unified \
|
||||||
|
--cache-type-k q8_0 \
|
||||||
|
--cache-type-v q8_0 \
|
||||||
|
--batch-size 2048 \
|
||||||
|
--ubatch-size 512 \
|
||||||
|
--parallel 2 \
|
||||||
|
--cont-batching \
|
||||||
|
--host 0.0.0.0 \
|
||||||
|
--port 8000
|
||||||
89
run_qwen35b_server_tools.sh
Executable file
89
run_qwen35b_server_tools.sh
Executable file
|
|
@ -0,0 +1,89 @@
|
||||||
|
#!/usr/bin/env bash
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
# Konfiguration
|
||||||
|
HF_HOME="${HF_HOME:-/home/dschlueter/nvme2n1p7_home/huggingface}"
|
||||||
|
MODEL_REL_PATH="models/qwen3/Carnice-Qwen3.6-MoE-35B-A3B-Q4_K_M.gguf"
|
||||||
|
IMAGE="ghcr.io/ggml-org/llama.cpp:server-cuda"
|
||||||
|
CONTAINER_NAME="qwen35b-moe-tools"
|
||||||
|
HOST_PORT=8000
|
||||||
|
CONTAINER_PORT=8000
|
||||||
|
|
||||||
|
echo "[*] Verwende HF_HOME = $HF_HOME"
|
||||||
|
if [ ! -f "$HF_HOME/$MODEL_REL_PATH" ]; then
|
||||||
|
echo "[!] Modell-Datei nicht gefunden: $HF_HOME/$MODEL_REL_PATH" >&2
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Optional: altes gleichnamiges Container-Exemplar stoppen
|
||||||
|
if docker ps -a --format '{{.Names}}' | grep -q "^${CONTAINER_NAME}\$"; then
|
||||||
|
echo "[*] Stoppe existierenden Container $CONTAINER_NAME ..."
|
||||||
|
docker rm -f "$CONTAINER_NAME" >/dev/null 2>&1 || true
|
||||||
|
fi
|
||||||
|
|
||||||
|
echo "[*] Starte llama.cpp-Server-Container ($IMAGE) ..."
|
||||||
|
|
||||||
|
docker run -d \
|
||||||
|
--gpus '"device=1,2"' \
|
||||||
|
--name "$CONTAINER_NAME" \
|
||||||
|
--restart unless-stopped \
|
||||||
|
-e HF_HOME="/hf_home" \
|
||||||
|
-v "$HF_HOME:/hf_home:ro" \
|
||||||
|
-p "${HOST_PORT}:${CONTAINER_PORT}" \
|
||||||
|
"$IMAGE" \
|
||||||
|
-m "/hf_home/${MODEL_REL_PATH}" \
|
||||||
|
-c 262144 \
|
||||||
|
-n 16384 \
|
||||||
|
--temp 0.3 --top-p 0.95 --top-k 40 --min-p 0.01 --repeat-penalty 1.05 \
|
||||||
|
--main-gpu 0 \
|
||||||
|
--tensor-split 0.5,0.5 \
|
||||||
|
-ngl 999 \
|
||||||
|
-fa on \
|
||||||
|
--kv-unified \
|
||||||
|
--cache-type-k q8_0 \
|
||||||
|
--cache-type-v q8_0 \
|
||||||
|
--batch-size 2048 \
|
||||||
|
--ubatch-size 512 \
|
||||||
|
--parallel 2 \
|
||||||
|
--cont-batching \
|
||||||
|
--jinja \
|
||||||
|
--no-context-shift \
|
||||||
|
--host 0.0.0.0 \
|
||||||
|
--port "$CONTAINER_PORT"
|
||||||
|
|
||||||
|
echo "[*] Container gestartet: $CONTAINER_NAME"
|
||||||
|
echo "[*] Warte, bis HTTP-Port ${HOST_PORT} antwortet ..."
|
||||||
|
|
||||||
|
for i in {1..60}; do
|
||||||
|
if curl -s "http://localhost:${HOST_PORT}/" >/dev/null 2>&1; then
|
||||||
|
echo "[*] Server antwortet auf http://localhost:${HOST_PORT}/"
|
||||||
|
break
|
||||||
|
fi
|
||||||
|
echo "[*] Warte (${i}/60) ..."
|
||||||
|
sleep 2
|
||||||
|
done
|
||||||
|
|
||||||
|
sleep 5
|
||||||
|
|
||||||
|
echo "[*] Sende Test-Chat-Request an /v1/chat/completions ..."
|
||||||
|
|
||||||
|
RESPONSE="$(curl -s -X POST "http://localhost:${HOST_PORT}/v1/chat/completions" \
|
||||||
|
-H "Content-Type: application/json" \
|
||||||
|
-d '{
|
||||||
|
"model": "qwen3.6-35b-a3b-moe",
|
||||||
|
"messages": [
|
||||||
|
{ "role": "system", "content": "Du bist ein hilfreicher deutscher Assistent." },
|
||||||
|
{ "role": "user", "content": "Gib eine sehr kurze Selbstdiagnose deiner Fähigkeiten." }
|
||||||
|
],
|
||||||
|
"max_tokens": 64,
|
||||||
|
"temperature": 0.3,
|
||||||
|
"stream": false
|
||||||
|
}')"
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "[*] Antwort vom Server:"
|
||||||
|
echo "$RESPONSE"
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "[*] Zum Stoppen des Servers:"
|
||||||
|
echo " docker rm -f $CONTAINER_NAME"
|
||||||
89
run_qwen35b_server_uncensored.sh
Executable file
89
run_qwen35b_server_uncensored.sh
Executable file
|
|
@ -0,0 +1,89 @@
|
||||||
|
#!/usr/bin/env bash
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
# Konfiguration
|
||||||
|
HF_HOME="${HF_HOME:-/home/dschlueter/nvme2n1p7_home/huggingface}"
|
||||||
|
MODEL_REL_PATH="models/qwen3/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-Q4_K_M.gguf"
|
||||||
|
IMAGE="ghcr.io/ggml-org/llama.cpp:server-cuda"
|
||||||
|
CONTAINER_NAME="qwen35b-moe-uncensored"
|
||||||
|
HOST_PORT=8000
|
||||||
|
CONTAINER_PORT=8000
|
||||||
|
|
||||||
|
echo "[*] Verwende HF_HOME = $HF_HOME"
|
||||||
|
if [ ! -f "$HF_HOME/$MODEL_REL_PATH" ]; then
|
||||||
|
echo "[!] Modell-Datei nicht gefunden: $HF_HOME/$MODEL_REL_PATH" >&2
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Optional: altes gleichnamiges Container-Exemplar stoppen
|
||||||
|
if docker ps -a --format '{{.Names}}' | grep -q "^${CONTAINER_NAME}\$"; then
|
||||||
|
echo "[*] Stoppe existierenden Container $CONTAINER_NAME ..."
|
||||||
|
docker rm -f "$CONTAINER_NAME" >/dev/null 2>&1 || true
|
||||||
|
fi
|
||||||
|
|
||||||
|
echo "[*] Starte llama.cpp-Server-Container ($IMAGE) ..."
|
||||||
|
|
||||||
|
docker run -d \
|
||||||
|
--gpus '"device=1,2"' \
|
||||||
|
--name "$CONTAINER_NAME" \
|
||||||
|
--restart unless-stopped \
|
||||||
|
-e HF_HOME="/hf_home" \
|
||||||
|
-v "$HF_HOME:/hf_home:ro" \
|
||||||
|
-p "${HOST_PORT}:${CONTAINER_PORT}" \
|
||||||
|
"$IMAGE" \
|
||||||
|
-m "/hf_home/${MODEL_REL_PATH}" \
|
||||||
|
-c 262144 \
|
||||||
|
-n 16384 \
|
||||||
|
--temp 0.6 --top-p 0.95 --top-k 40 --min-p 0.01 --repeat-penalty 1.05 \
|
||||||
|
--main-gpu 0 \
|
||||||
|
--tensor-split 0.5,0.5 \
|
||||||
|
-ngl 999 \
|
||||||
|
-fa on \
|
||||||
|
--kv-unified \
|
||||||
|
--cache-type-k q8_0 \
|
||||||
|
--cache-type-v q8_0 \
|
||||||
|
--batch-size 2048 \
|
||||||
|
--ubatch-size 512 \
|
||||||
|
--parallel 2 \
|
||||||
|
--cont-batching \
|
||||||
|
--jinja \
|
||||||
|
--no-context-shift \
|
||||||
|
--host 0.0.0.0 \
|
||||||
|
--port "$CONTAINER_PORT"
|
||||||
|
|
||||||
|
echo "[*] Container gestartet: $CONTAINER_NAME"
|
||||||
|
echo "[*] Warte, bis HTTP-Port ${HOST_PORT} antwortet ..."
|
||||||
|
|
||||||
|
for i in {1..60}; do
|
||||||
|
if curl -s "http://localhost:${HOST_PORT}/" >/dev/null 2>&1; then
|
||||||
|
echo "[*] Server antwortet auf http://localhost:${HOST_PORT}/"
|
||||||
|
break
|
||||||
|
fi
|
||||||
|
echo "[*] Warte (${i}/60) ..."
|
||||||
|
sleep 2
|
||||||
|
done
|
||||||
|
|
||||||
|
sleep 5
|
||||||
|
|
||||||
|
echo "[*] Sende Test-Chat-Request an /v1/chat/completions ..."
|
||||||
|
|
||||||
|
RESPONSE="$(curl -s -X POST "http://localhost:${HOST_PORT}/v1/chat/completions" \
|
||||||
|
-H "Content-Type: application/json" \
|
||||||
|
-d '{
|
||||||
|
"model": "qwen3.6-35b-a3b-moe",
|
||||||
|
"messages": [
|
||||||
|
{ "role": "system", "content": "Du bist ein hilfreicher deutscher Assistent." },
|
||||||
|
{ "role": "user", "content": "Gib eine sehr kurze Selbstdiagnose deiner Fähigkeiten." }
|
||||||
|
],
|
||||||
|
"max_tokens": 64,
|
||||||
|
"temperature": 0.6,
|
||||||
|
"stream": false
|
||||||
|
}')"
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "[*] Antwort vom Server:"
|
||||||
|
echo "$RESPONSE"
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "[*] Zum Stoppen des Servers:"
|
||||||
|
echo " docker rm -f $CONTAINER_NAME"
|
||||||
138
run_qwen35b_server_uncensored_rag_longctx.sh
Executable file
138
run_qwen35b_server_uncensored_rag_longctx.sh
Executable file
|
|
@ -0,0 +1,138 @@
|
||||||
|
#!/usr/bin/env bash
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
# Konfiguration
|
||||||
|
HF_HOME="${HF_HOME:-/home/dschlueter/nvme2n1p7_home/huggingface}"
|
||||||
|
MODEL_REL_PATH="models/qwen3/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-Q4_K_M.gguf"
|
||||||
|
IMAGE="ghcr.io/ggml-org/llama.cpp:server-cuda"
|
||||||
|
CONTAINER_NAME="qwen35b-moe-uncensored-rag-longctx"
|
||||||
|
HOST_PORT=8000
|
||||||
|
CONTAINER_PORT=8000
|
||||||
|
|
||||||
|
echo "[*] Verwende HF_HOME = $HF_HOME"
|
||||||
|
if [ ! -f "$HF_HOME/$MODEL_REL_PATH" ]; then
|
||||||
|
echo "[!] Modell-Datei nicht gefunden: $HF_HOME/$MODEL_REL_PATH" >&2
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Optional: altes gleichnamiges Container-Exemplar stoppen
|
||||||
|
if docker ps -a --format '{{.Names}}' | grep -q "^${CONTAINER_NAME}\\$"; then
|
||||||
|
echo "[*] Stoppe existierenden Container $CONTAINER_NAME ..."
|
||||||
|
docker rm -f "$CONTAINER_NAME" >/dev/null 2>&1 || true
|
||||||
|
fi
|
||||||
|
|
||||||
|
echo "[*] Starte llama.cpp-Server-Container ($IMAGE) ..."
|
||||||
|
echo "[*] Modus: Uncensored, RAG-fähig, Long Context"
|
||||||
|
|
||||||
|
docker run -d \
|
||||||
|
--gpus '"device=1,2"' \
|
||||||
|
--name "$CONTAINER_NAME" \
|
||||||
|
--restart unless-stopped \
|
||||||
|
-e HF_HOME="/hf_home" \
|
||||||
|
-v "$HF_HOME:/hf_home:ro" \
|
||||||
|
-p "${HOST_PORT}:${CONTAINER_PORT}" \
|
||||||
|
"$IMAGE" \
|
||||||
|
-m "/hf_home/${MODEL_REL_PATH}" \
|
||||||
|
-c 262144 \
|
||||||
|
-n 16384 \
|
||||||
|
--jinja \
|
||||||
|
--no-context-shift \
|
||||||
|
--temp 0.2 \
|
||||||
|
--top-p 0.95 \
|
||||||
|
--top-k 40 \
|
||||||
|
--min-p 0.01 \
|
||||||
|
--repeat-penalty 1.05 \
|
||||||
|
--main-gpu 0 \
|
||||||
|
--tensor-split 0.5,0.5 \
|
||||||
|
-ngl 999 \
|
||||||
|
-fa on \
|
||||||
|
--kv-unified \
|
||||||
|
--cache-type-k q8_0 \
|
||||||
|
--cache-type-v q8_0 \
|
||||||
|
--batch-size 2048 \
|
||||||
|
--ubatch-size 512 \
|
||||||
|
--parallel 2 \
|
||||||
|
--cont-batching \
|
||||||
|
--host 0.0.0.0 \
|
||||||
|
--port "$CONTAINER_PORT"
|
||||||
|
|
||||||
|
echo "[*] Container gestartet: $CONTAINER_NAME"
|
||||||
|
echo "[*] Warte, bis HTTP-Port ${HOST_PORT} antwortet ..."
|
||||||
|
|
||||||
|
HTTP_READY=0
|
||||||
|
for i in {1..90}; do
|
||||||
|
if curl -s "http://localhost:${HOST_PORT}/" >/dev/null 2>&1; then
|
||||||
|
echo "[*] Server antwortet auf http://localhost:${HOST_PORT}/"
|
||||||
|
HTTP_READY=1
|
||||||
|
break
|
||||||
|
fi
|
||||||
|
echo "[*] Warte (${i}/90) auf HTTP ..."
|
||||||
|
sleep 2
|
||||||
|
done
|
||||||
|
|
||||||
|
if [ "$HTTP_READY" -ne 1 ]; then
|
||||||
|
echo "[!] HTTP-Server wurde nicht rechtzeitig erreichbar." >&2
|
||||||
|
echo "[*] Letzte Container-Logs:"
|
||||||
|
docker logs --tail 200 "$CONTAINER_NAME" || true
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
echo "[*] Warte, bis das Modell wirklich geladen ist ..."
|
||||||
|
|
||||||
|
MODEL_READY=0
|
||||||
|
for i in {1..180}; do
|
||||||
|
HTTP_CODE="$(curl -s -o /tmp/${CONTAINER_NAME}_ready.json -w "%{http_code}" \
|
||||||
|
-X POST "http://localhost:${HOST_PORT}/v1/chat/completions" \
|
||||||
|
-H "Content-Type: application/json" \
|
||||||
|
-d '{
|
||||||
|
"model": "qwen3.6-35b-a3b-moe-rag-longctx",
|
||||||
|
"messages": [
|
||||||
|
{ "role": "system", "content": "Du bist ein hilfreicher deutscher Assistent." },
|
||||||
|
{ "role": "user", "content": "Antworte nur mit dem Wort: bereit" }
|
||||||
|
],
|
||||||
|
"max_tokens": 8,
|
||||||
|
"temperature": 0.0,
|
||||||
|
"stream": false
|
||||||
|
}' || true)"
|
||||||
|
|
||||||
|
BODY="$(cat /tmp/${CONTAINER_NAME}_ready.json 2>/dev/null || true)"
|
||||||
|
|
||||||
|
if [ "$HTTP_CODE" = "200" ]; then
|
||||||
|
echo "[*] Modell ist geladen und antwortet."
|
||||||
|
MODEL_READY=1
|
||||||
|
break
|
||||||
|
fi
|
||||||
|
|
||||||
|
echo "[*] Warte (${i}/180) auf Modell ... HTTP ${HTTP_CODE} - ${BODY}"
|
||||||
|
sleep 5
|
||||||
|
done
|
||||||
|
|
||||||
|
if [ "$MODEL_READY" -ne 1 ]; then
|
||||||
|
echo "[!] Modell wurde nicht rechtzeitig bereit." >&2
|
||||||
|
echo "[*] Letzte Container-Logs:"
|
||||||
|
docker logs --tail 200 "$CONTAINER_NAME" || true
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
echo "[*] Sende finalen Test-Chat-Request an /v1/chat/completions ..."
|
||||||
|
|
||||||
|
RESPONSE="$(curl -s -X POST "http://localhost:${HOST_PORT}/v1/chat/completions" \
|
||||||
|
-H "Content-Type: application/json" \
|
||||||
|
-d '{
|
||||||
|
"model": "qwen3.6-35b-a3b-moe-rag-longctx",
|
||||||
|
"messages": [
|
||||||
|
{ "role": "system", "content": "Du bist ein hilfreicher deutscher Assistent für RAG-gestützte Wissensarbeit." },
|
||||||
|
{ "role": "user", "content": "Antworte in einem Satz: Der Server für sehr langen Kontext ist betriebsbereit." }
|
||||||
|
],
|
||||||
|
"max_tokens": 64,
|
||||||
|
"temperature": 0.2,
|
||||||
|
"stream": false
|
||||||
|
}')"
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "[*] Antwort vom Server:"
|
||||||
|
echo "$RESPONSE"
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "[*] Zum Stoppen des Servers:"
|
||||||
|
echo " docker rm -f $CONTAINER_NAME"
|
||||||
Loading…
Add table
Add a link
Reference in a new issue