diff --git a/BEDIENUNGSANLEITUNG.md b/BEDIENUNGSANLEITUNG.md index cbbad76..6df98d2 100644 --- a/BEDIENUNGSANLEITUNG.md +++ b/BEDIENUNGSANLEITUNG.md @@ -73,6 +73,8 @@ docker compose -f docker-compose_Qwen3.6_Uncensored.yml up -d --force-recreate ./run_bge_m3_embedding_server.sh ``` +**Hinweis**: Die Qwopus3.6-Variante wird über Docker Compose gestartet, da sie multimodale Unterstützung benötigt (mmproj-Datei). Container-Name: `qwopus35b-moe-coding`. + **Hinweis**: Alle Shell-Skripte stoppen automatisch existierende Container gleichen Namens vor dem Start. ## Server-Verwaltung @@ -87,6 +89,7 @@ docker compose -f docker-compose_Qwen3.6_Uncensored.yml up -d --force-recreate | qwen35b-moe-coding | Carnice | docker-compose_Qwen3.6_Tools_coding.yml | | qwen35b-moe-tools | Carnice | docker-compose_Qwen3.6_Tools.yml | | qwen35b-moe-rag-longctx | Carnice | docker-compose_Qwen3.6_Tools_RAG_faehig.yml | +| qwopus35b-moe-coding | Qwopus3.6 | docker-compose_Qwen3.6_Qwopus3.6_coding.yml | | qwen35b-moe-uncensored | Uncensored | docker-compose_Qwen3.6_Uncensored.yml | | qwen35b-moe-uncensored-rag | Uncensored | docker-compose_Qwen3.6_Uncensored_RAG_faehig.yml | | qwen35b-moe-uncensored-rag-longctx | Uncensored | run_qwen35b_server_uncensored_rag_longctx.sh | @@ -144,6 +147,11 @@ KV-Cache: Unified Cache: --kv-unified ``` +**Qwopus3.6-Spezifikationen:** +- **Parallel-Slots**: 4 (statt 2) — KV-Cache ~2.5 GB/Slot, 4 Slots machbar +- **Micro-Batch-Größe**: 1024 (statt 512) — SSM-Layer verarbeitet Micro-Batches effizienter +- **Multimodale Unterstützung**: Erfordert mmproj-Datei (siehe docker-compose für Konfiguration) + ### Kontext- und Performance-Parameter | Parameter | Wert | Beschreibung | |-----------|------|--------------| @@ -173,6 +181,21 @@ min-p: 0.01 repeat-penalty: 1.05 ``` +#### Qwopus3.6-Modus +```yaml +temperature: 0.3 # Kompromiss für Kreativität und Präzision +top-p: 0.95 +top-k: 40 +min-p: 0.01 +repeat-penalty: 1.05 +``` + +**Qwopus3.6-Spezifikationen:** +- **Multimodale Unterstützung**: Erfordert mmproj-Datei (siehe docker-compose für Konfiguration) +- **Parallel-Slots**: 4 (statt 2) — KV-Cache ~2.5 GB/Slot, 4 Slots machbar +- **Micro-Batch-Größe**: 1024 (statt 512) — SSM-Layer verarbeitet Micro-Batches effizienter +- **Container-Name**: `qwopus35b-moe-coding` (vermeidet Konflikt mit Standard-Coding-Container) + ### Laufzeit-Parameter (ohne Neustart) Diese Parameter können pro API-Request überschrieben werden: - `temperature` diff --git a/FAQs.md b/FAQs.md index caec3d5..4ebb89d 100644 --- a/FAQs.md +++ b/FAQs.md @@ -53,6 +53,7 @@ docker compose -f docker-compose_Qwen3.6_Uncensored.yml up -d | `qwen35b-moe-coding` | Carnice | `docker-compose_Qwen3.6_Tools_coding.yml` | | `qwen35b-moe-tools` | Carnice | `docker-compose_Qwen3.6_Tools.yml` | | `qwen35b-moe-rag-longctx` | Carnice | `docker-compose_Qwen3.6_Tools_RAG_faehig.yml` | +| `qwopus35b-moe-coding` | Qwopus3.6 | `docker-compose_Qwen3.6_Qwopus3.6_coding.yml` | | `qwen35b-moe-uncensored` | Uncensored | `docker-compose_Qwen3.6_Uncensored.yml` | | `qwen35b-moe-uncensored-rag` | Uncensored | `docker-compose_Qwen3.6_Uncensored_RAG_faehig.yml` | | `qwen35b-moe-uncensored-rag-longctx` | Uncensored | `run_qwen35b_server_uncensored_rag_longctx.sh` | diff --git a/README.md b/README.md index b60f982..a6ea1b1 100644 --- a/README.md +++ b/README.md @@ -10,10 +10,12 @@ This project provides Docker-based inference servers for the Qwen3.6-MoE-35B-A3B |---------------|-------------| | `docker-compose_Qwen3.6_Tools_RAG_faehig.yml` | RAG-optimized with long context support (default) | | `docker-compose_Qwen3.6_Tools_coding.yml` | Coding-focused with tuned sampling parameters | +| `docker-compose_Qwen3.6_Qwopus3.6_coding.yml` | Qwopus3.6 coding variant with multimodal support | | `docker-compose_Qwen3.6_Uncensored.yml` | Uncensored variant for unrestricted use | | `docker-compose_Qwen3.6_Uncensored_RAG_faehig.yml` | Uncensored + RAG support | **Model**: Carnice-Qwen3.6-MoE-35B-A3B-Q4_K_M.gguf (standard) +**Qwopus Model**: Qwopus3.6-35B-A3B-v1-Q4_K_M.gguf (multimodal, requires mmproj) **Uncensored Model**: Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-Q4_K_M.gguf **Image**: ghcr.io/ggml-org/llama.cpp:server-cuda **API Endpoint**: http://localhost:8000/v1/chat/completions @@ -40,6 +42,9 @@ docker compose up -d --force-recreate # Start coding-optimized server docker compose -f docker-compose_Qwen3.6_Tools_coding.yml up -d --force-recreate +# Start Qwopus3.6 coding variant (multimodal) +docker compose -f docker-compose_Qwen3.6_Qwopus3.6_coding.yml up -d --force-recreate + # Stop and remove container docker compose rm -s -f qwen35b ``` @@ -73,6 +78,8 @@ docker compose rm -s -f qwen35b ./run_bge_m3_embedding_server.sh ``` +**Note**: The Qwopus3.6 variant requires Docker Compose for startup due to multimodal support (mmproj file). Container name: `qwopus35b-moe-coding`. + **Note**: All shell scripts automatically stop any existing containers with the same name before starting new ones. Use `docker rm -f ` to manually stop servers. ## Configuration Details @@ -87,22 +94,30 @@ docker compose rm -s -f qwen35b - Tensor split: 0.5,0.5 (symmetric across both GPUs) - All layers offloaded to GPU (`-ngl 999`) - Flash Attention enabled for optimized memory access +- **Qwopus3.6**: Uses 4 parallel slots (~2.5 GB KV-Cache per slot) ### Context & Performance - **Context window**: 262,144 tokens (256k) - **Max output**: 16,384 tokens -- **Parallel slots**: 2 (saves ~10GB KV cache vs 4) +- **Parallel slots**: 2 (saves ~10GB KV cache vs 4) — standard; Qwopus3.6 uses 4 slots - **Batch size**: 2,048 for long context processing +- **Micro-batch size**: 512 (standard); Qwopus3.6 uses 1024 for SSM-Layer efficiency - **KV cache**: q8_0 quantization for speed/quality balance ### Sampling Parameters -| Parameter | RAG Mode | Coding Mode | -|-----------|----------|-------------| -| Temperature | 0.2 | 0.3 | -| Top-p | 0.95 | 0.95 | -| Top-k | 40 | 40 | -| Min-p | 0.01 | 0.01 | -| Repeat penalty | 1.05 | 1.05 | +| Parameter | RAG Mode | Coding Mode | Qwopus3.6 | +|-----------|----------|-------------|-----------| +| Temperature | 0.2 | 0.3 | 0.3 | +| Top-p | 0.95 | 0.95 | 0.95 | +| Top-k | 40 | 40 | 40 | +| Min-p | 0.01 | 0.01 | 0.01 | +| Repeat penalty | 1.05 | 1.05 | 1.05 | + +### Qwopus3.6 Specifics +- **Multimodal support**: Requires mmproj file (see docker-compose for configuration) +- **Parallel slots**: 4 (vs 2 in standard) — KV-Cache ~2.5 GB/Slot, 4 slots feasible +- **Micro-batch size**: 1024 (vs 512) — SSM-Layer processes micro-batches more efficiently +- **Container name**: `qwopus35b-moe-coding` (avoids conflict with standard coding container) ## API Usage