Update documentation: Add Qwopus3.6 coding variant with multimodal support
This commit is contained in:
parent
260cb22740
commit
3e60a072b4
3 changed files with 47 additions and 8 deletions
|
|
@ -73,6 +73,8 @@ docker compose -f docker-compose_Qwen3.6_Uncensored.yml up -d --force-recreate
|
||||||
./run_bge_m3_embedding_server.sh
|
./run_bge_m3_embedding_server.sh
|
||||||
```
|
```
|
||||||
|
|
||||||
|
**Hinweis**: Die Qwopus3.6-Variante wird über Docker Compose gestartet, da sie multimodale Unterstützung benötigt (mmproj-Datei). Container-Name: `qwopus35b-moe-coding`.
|
||||||
|
|
||||||
**Hinweis**: Alle Shell-Skripte stoppen automatisch existierende Container gleichen Namens vor dem Start.
|
**Hinweis**: Alle Shell-Skripte stoppen automatisch existierende Container gleichen Namens vor dem Start.
|
||||||
|
|
||||||
## Server-Verwaltung
|
## Server-Verwaltung
|
||||||
|
|
@ -87,6 +89,7 @@ docker compose -f docker-compose_Qwen3.6_Uncensored.yml up -d --force-recreate
|
||||||
| qwen35b-moe-coding | Carnice | docker-compose_Qwen3.6_Tools_coding.yml |
|
| qwen35b-moe-coding | Carnice | docker-compose_Qwen3.6_Tools_coding.yml |
|
||||||
| qwen35b-moe-tools | Carnice | docker-compose_Qwen3.6_Tools.yml |
|
| qwen35b-moe-tools | Carnice | docker-compose_Qwen3.6_Tools.yml |
|
||||||
| qwen35b-moe-rag-longctx | Carnice | docker-compose_Qwen3.6_Tools_RAG_faehig.yml |
|
| qwen35b-moe-rag-longctx | Carnice | docker-compose_Qwen3.6_Tools_RAG_faehig.yml |
|
||||||
|
| qwopus35b-moe-coding | Qwopus3.6 | docker-compose_Qwen3.6_Qwopus3.6_coding.yml |
|
||||||
| qwen35b-moe-uncensored | Uncensored | docker-compose_Qwen3.6_Uncensored.yml |
|
| qwen35b-moe-uncensored | Uncensored | docker-compose_Qwen3.6_Uncensored.yml |
|
||||||
| qwen35b-moe-uncensored-rag | Uncensored | docker-compose_Qwen3.6_Uncensored_RAG_faehig.yml |
|
| qwen35b-moe-uncensored-rag | Uncensored | docker-compose_Qwen3.6_Uncensored_RAG_faehig.yml |
|
||||||
| qwen35b-moe-uncensored-rag-longctx | Uncensored | run_qwen35b_server_uncensored_rag_longctx.sh |
|
| qwen35b-moe-uncensored-rag-longctx | Uncensored | run_qwen35b_server_uncensored_rag_longctx.sh |
|
||||||
|
|
@ -144,6 +147,11 @@ KV-Cache:
|
||||||
Unified Cache: --kv-unified
|
Unified Cache: --kv-unified
|
||||||
```
|
```
|
||||||
|
|
||||||
|
**Qwopus3.6-Spezifikationen:**
|
||||||
|
- **Parallel-Slots**: 4 (statt 2) — KV-Cache ~2.5 GB/Slot, 4 Slots machbar
|
||||||
|
- **Micro-Batch-Größe**: 1024 (statt 512) — SSM-Layer verarbeitet Micro-Batches effizienter
|
||||||
|
- **Multimodale Unterstützung**: Erfordert mmproj-Datei (siehe docker-compose für Konfiguration)
|
||||||
|
|
||||||
### Kontext- und Performance-Parameter
|
### Kontext- und Performance-Parameter
|
||||||
| Parameter | Wert | Beschreibung |
|
| Parameter | Wert | Beschreibung |
|
||||||
|-----------|------|--------------|
|
|-----------|------|--------------|
|
||||||
|
|
@ -173,6 +181,21 @@ min-p: 0.01
|
||||||
repeat-penalty: 1.05
|
repeat-penalty: 1.05
|
||||||
```
|
```
|
||||||
|
|
||||||
|
#### Qwopus3.6-Modus
|
||||||
|
```yaml
|
||||||
|
temperature: 0.3 # Kompromiss für Kreativität und Präzision
|
||||||
|
top-p: 0.95
|
||||||
|
top-k: 40
|
||||||
|
min-p: 0.01
|
||||||
|
repeat-penalty: 1.05
|
||||||
|
```
|
||||||
|
|
||||||
|
**Qwopus3.6-Spezifikationen:**
|
||||||
|
- **Multimodale Unterstützung**: Erfordert mmproj-Datei (siehe docker-compose für Konfiguration)
|
||||||
|
- **Parallel-Slots**: 4 (statt 2) — KV-Cache ~2.5 GB/Slot, 4 Slots machbar
|
||||||
|
- **Micro-Batch-Größe**: 1024 (statt 512) — SSM-Layer verarbeitet Micro-Batches effizienter
|
||||||
|
- **Container-Name**: `qwopus35b-moe-coding` (vermeidet Konflikt mit Standard-Coding-Container)
|
||||||
|
|
||||||
### Laufzeit-Parameter (ohne Neustart)
|
### Laufzeit-Parameter (ohne Neustart)
|
||||||
Diese Parameter können pro API-Request überschrieben werden:
|
Diese Parameter können pro API-Request überschrieben werden:
|
||||||
- `temperature`
|
- `temperature`
|
||||||
|
|
|
||||||
1
FAQs.md
1
FAQs.md
|
|
@ -53,6 +53,7 @@ docker compose -f docker-compose_Qwen3.6_Uncensored.yml up -d
|
||||||
| `qwen35b-moe-coding` | Carnice | `docker-compose_Qwen3.6_Tools_coding.yml` |
|
| `qwen35b-moe-coding` | Carnice | `docker-compose_Qwen3.6_Tools_coding.yml` |
|
||||||
| `qwen35b-moe-tools` | Carnice | `docker-compose_Qwen3.6_Tools.yml` |
|
| `qwen35b-moe-tools` | Carnice | `docker-compose_Qwen3.6_Tools.yml` |
|
||||||
| `qwen35b-moe-rag-longctx` | Carnice | `docker-compose_Qwen3.6_Tools_RAG_faehig.yml` |
|
| `qwen35b-moe-rag-longctx` | Carnice | `docker-compose_Qwen3.6_Tools_RAG_faehig.yml` |
|
||||||
|
| `qwopus35b-moe-coding` | Qwopus3.6 | `docker-compose_Qwen3.6_Qwopus3.6_coding.yml` |
|
||||||
| `qwen35b-moe-uncensored` | Uncensored | `docker-compose_Qwen3.6_Uncensored.yml` |
|
| `qwen35b-moe-uncensored` | Uncensored | `docker-compose_Qwen3.6_Uncensored.yml` |
|
||||||
| `qwen35b-moe-uncensored-rag` | Uncensored | `docker-compose_Qwen3.6_Uncensored_RAG_faehig.yml` |
|
| `qwen35b-moe-uncensored-rag` | Uncensored | `docker-compose_Qwen3.6_Uncensored_RAG_faehig.yml` |
|
||||||
| `qwen35b-moe-uncensored-rag-longctx` | Uncensored | `run_qwen35b_server_uncensored_rag_longctx.sh` |
|
| `qwen35b-moe-uncensored-rag-longctx` | Uncensored | `run_qwen35b_server_uncensored_rag_longctx.sh` |
|
||||||
|
|
|
||||||
31
README.md
31
README.md
|
|
@ -10,10 +10,12 @@ This project provides Docker-based inference servers for the Qwen3.6-MoE-35B-A3B
|
||||||
|---------------|-------------|
|
|---------------|-------------|
|
||||||
| `docker-compose_Qwen3.6_Tools_RAG_faehig.yml` | RAG-optimized with long context support (default) |
|
| `docker-compose_Qwen3.6_Tools_RAG_faehig.yml` | RAG-optimized with long context support (default) |
|
||||||
| `docker-compose_Qwen3.6_Tools_coding.yml` | Coding-focused with tuned sampling parameters |
|
| `docker-compose_Qwen3.6_Tools_coding.yml` | Coding-focused with tuned sampling parameters |
|
||||||
|
| `docker-compose_Qwen3.6_Qwopus3.6_coding.yml` | Qwopus3.6 coding variant with multimodal support |
|
||||||
| `docker-compose_Qwen3.6_Uncensored.yml` | Uncensored variant for unrestricted use |
|
| `docker-compose_Qwen3.6_Uncensored.yml` | Uncensored variant for unrestricted use |
|
||||||
| `docker-compose_Qwen3.6_Uncensored_RAG_faehig.yml` | Uncensored + RAG support |
|
| `docker-compose_Qwen3.6_Uncensored_RAG_faehig.yml` | Uncensored + RAG support |
|
||||||
|
|
||||||
**Model**: Carnice-Qwen3.6-MoE-35B-A3B-Q4_K_M.gguf (standard)
|
**Model**: Carnice-Qwen3.6-MoE-35B-A3B-Q4_K_M.gguf (standard)
|
||||||
|
**Qwopus Model**: Qwopus3.6-35B-A3B-v1-Q4_K_M.gguf (multimodal, requires mmproj)
|
||||||
**Uncensored Model**: Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-Q4_K_M.gguf
|
**Uncensored Model**: Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-Q4_K_M.gguf
|
||||||
**Image**: ghcr.io/ggml-org/llama.cpp:server-cuda
|
**Image**: ghcr.io/ggml-org/llama.cpp:server-cuda
|
||||||
**API Endpoint**: http://localhost:8000/v1/chat/completions
|
**API Endpoint**: http://localhost:8000/v1/chat/completions
|
||||||
|
|
@ -40,6 +42,9 @@ docker compose up -d --force-recreate
|
||||||
# Start coding-optimized server
|
# Start coding-optimized server
|
||||||
docker compose -f docker-compose_Qwen3.6_Tools_coding.yml up -d --force-recreate
|
docker compose -f docker-compose_Qwen3.6_Tools_coding.yml up -d --force-recreate
|
||||||
|
|
||||||
|
# Start Qwopus3.6 coding variant (multimodal)
|
||||||
|
docker compose -f docker-compose_Qwen3.6_Qwopus3.6_coding.yml up -d --force-recreate
|
||||||
|
|
||||||
# Stop and remove container
|
# Stop and remove container
|
||||||
docker compose rm -s -f qwen35b
|
docker compose rm -s -f qwen35b
|
||||||
```
|
```
|
||||||
|
|
@ -73,6 +78,8 @@ docker compose rm -s -f qwen35b
|
||||||
./run_bge_m3_embedding_server.sh
|
./run_bge_m3_embedding_server.sh
|
||||||
```
|
```
|
||||||
|
|
||||||
|
**Note**: The Qwopus3.6 variant requires Docker Compose for startup due to multimodal support (mmproj file). Container name: `qwopus35b-moe-coding`.
|
||||||
|
|
||||||
**Note**: All shell scripts automatically stop any existing containers with the same name before starting new ones. Use `docker rm -f <container_name>` to manually stop servers.
|
**Note**: All shell scripts automatically stop any existing containers with the same name before starting new ones. Use `docker rm -f <container_name>` to manually stop servers.
|
||||||
|
|
||||||
## Configuration Details
|
## Configuration Details
|
||||||
|
|
@ -87,22 +94,30 @@ docker compose rm -s -f qwen35b
|
||||||
- Tensor split: 0.5,0.5 (symmetric across both GPUs)
|
- Tensor split: 0.5,0.5 (symmetric across both GPUs)
|
||||||
- All layers offloaded to GPU (`-ngl 999`)
|
- All layers offloaded to GPU (`-ngl 999`)
|
||||||
- Flash Attention enabled for optimized memory access
|
- Flash Attention enabled for optimized memory access
|
||||||
|
- **Qwopus3.6**: Uses 4 parallel slots (~2.5 GB KV-Cache per slot)
|
||||||
|
|
||||||
### Context & Performance
|
### Context & Performance
|
||||||
- **Context window**: 262,144 tokens (256k)
|
- **Context window**: 262,144 tokens (256k)
|
||||||
- **Max output**: 16,384 tokens
|
- **Max output**: 16,384 tokens
|
||||||
- **Parallel slots**: 2 (saves ~10GB KV cache vs 4)
|
- **Parallel slots**: 2 (saves ~10GB KV cache vs 4) — standard; Qwopus3.6 uses 4 slots
|
||||||
- **Batch size**: 2,048 for long context processing
|
- **Batch size**: 2,048 for long context processing
|
||||||
|
- **Micro-batch size**: 512 (standard); Qwopus3.6 uses 1024 for SSM-Layer efficiency
|
||||||
- **KV cache**: q8_0 quantization for speed/quality balance
|
- **KV cache**: q8_0 quantization for speed/quality balance
|
||||||
|
|
||||||
### Sampling Parameters
|
### Sampling Parameters
|
||||||
| Parameter | RAG Mode | Coding Mode |
|
| Parameter | RAG Mode | Coding Mode | Qwopus3.6 |
|
||||||
|-----------|----------|-------------|
|
|-----------|----------|-------------|-----------|
|
||||||
| Temperature | 0.2 | 0.3 |
|
| Temperature | 0.2 | 0.3 | 0.3 |
|
||||||
| Top-p | 0.95 | 0.95 |
|
| Top-p | 0.95 | 0.95 | 0.95 |
|
||||||
| Top-k | 40 | 40 |
|
| Top-k | 40 | 40 | 40 |
|
||||||
| Min-p | 0.01 | 0.01 |
|
| Min-p | 0.01 | 0.01 | 0.01 |
|
||||||
| Repeat penalty | 1.05 | 1.05 |
|
| Repeat penalty | 1.05 | 1.05 | 1.05 |
|
||||||
|
|
||||||
|
### Qwopus3.6 Specifics
|
||||||
|
- **Multimodal support**: Requires mmproj file (see docker-compose for configuration)
|
||||||
|
- **Parallel slots**: 4 (vs 2 in standard) — KV-Cache ~2.5 GB/Slot, 4 slots feasible
|
||||||
|
- **Micro-batch size**: 1024 (vs 512) — SSM-Layer processes micro-batches more efficiently
|
||||||
|
- **Container name**: `qwopus35b-moe-coding` (avoids conflict with standard coding container)
|
||||||
|
|
||||||
## API Usage
|
## API Usage
|
||||||
|
|
||||||
|
|
|
||||||
Loading…
Add table
Add a link
Reference in a new issue