Erweiterung: Stop-Mechanismus, REST-Service und MCP-Adapter

- chatterbox_cli_v4.py: kooperativer Stop-Mechanismus via threading.Event (STOP_REQUESTED, request_stop, clear_stop); PlaybackWorker, synthesize_non_streaming und synthesize_streaming prüfen das Event vor jedem Chunk; --stop CLI-Flag - tts_service.py: FastAPI-Service mit Modell-Caching, Job-Queue und Worker-Thread; Endpunkte: POST /speak, POST /stop, GET /health, GET /status, GET /voices - mcp_adapter.py: MCP-Adapter (stdio/streamable-http) über tts_service; Tools: speak, stop, get_status, list_voices - requirements.txt: fastapi, uvicorn, httpx, mcp ergänzt - CLAUDE.md: Architektur und Startbefehle dokumentiert - .gitignore: Ideen/-Verzeichnis ausgeschlossen Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 09:46:43 +02:00 · 2026-05-16 09:46:43 +02:00 · bcf6374c29
commit bcf6374c29
parent bed29fb1c8
6 changed files with 563 additions and 3 deletions
--- a/CLAUDE.md
+++ b/CLAUDE.md
@ -0,0 +1,86 @@
+# CLAUDE.md
+
+This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
+
+## Running the CLI
+
+```bash
+conda activate chatterbox
+
+# Deutschen Text aus Datei vorlesen
+python chatterbox_cli_v4.py --lang de --input text.txt
+
+# Mit Voice Cloning
+python chatterbox_cli_v4.py --lang de --voice my_voice.wav --input text.txt
+
+# Text direkt übergeben (Englisch)
+python chatterbox_cli_v4.py --lang en --text "Hello world"
+
+# Nur speichern, kein Playback
+python chatterbox_cli_v4.py --lang de --no-play --output ausgabe.wav --input text.txt
+
+# Geschwindigkeit anpassen (pitch-erhaltend, erfordert rubberband-cli)
+python chatterbox_cli_v4.py --lang de --speed 0.85 --input text.txt
+
+# Streaming-Modus (experimentell, niedrigere Latenz, kann abgehackt klingen)
+python chatterbox_cli_v4.py --lang de --stream --input text.txt
+
+# Aussprache-Wörterbuch (JSON: {"Eigenname": "Lautschrift"})
+python chatterbox_cli_v4.py --lang de --pronunciation-dict aussprache.json --input text.txt
+```
+
+No build step, no test suite, no linter configuration — this is a single-file script.
+
+## Architecture
+
+Everything lives in `chatterbox_cli_v4.py`. The processing pipeline is:
+
+**Text input → normalization → chunking → TTS generation → audio output**
+
+### Text normalization (`preprocess_tts_text`)
+Applied per chunk before synthesis. Order matters:
+1. Pronunciation dict substitutions (before acronym expansion, so proper names are caught first)
+2. Unit normalization (120 km/h → "120 Kilometer pro Stunde")
+3. Time normalization (14:58 → "vierzehn Uhr achtundfünfzig")
+4. Year normalization (2026 → "zweitausendsechsundzwanzig")
+5. Acronym spelling (ARD → "Ah Er De"; skips entries in `NON_SPELLED_ACRONYMS`)
+
+`DEFAULT_PRONUNCIATION_DE` contains built-in German phonetic approximations (e.g. Xi → "Schi").
+
+### Text chunking
+Three modes (chosen by CLI flags):
+- **sentence_mode** (default): `split_into_sentences()` — one sentence per TTS call, lowest latency to first audio
+- **conversation_mode**: `split_for_conversation()` — first chunk is small (`--first-chunk-len`, default 80 chars), rest up to `--len` (400)
+- **plain**: `split_long_text()` — paragraph-aware chunking up to `--len`
+
+`SENTENCE_END_RE` handles edge cases like ordinal numbers, ellipses, and CJK punctuation. `SEPARATOR_LINE_RE` silently drops lines like `--- Ende ---`.
+
+### Model loading (`load_model`)
+- `--lang en` → `ChatterboxTTS` (mono, always available)
+- Other languages → `ChatterboxMultilingualTTS` (requires multilingual package; `HAS_MULTILINGUAL` flag guards import)
+- `--t3-model v3` (default) or `v2` selects the multilingual T3 checkpoint
+- Models are downloaded to `~/.cache/huggingface/` on first use (~2–3 GB)
+- **Critical**: `attn_implementation = "eager"` is forced at import time because SDPA returns `None` attention weights, breaking the `AlignmentStreamAnalyzer` hook
+
+### Audio output (`PlaybackWorker`)
+- Uses `sounddevice.OutputStream` with a callback at 48 kHz (PipeWire/PulseAudio standard)
+- Internal producer thread converts Torch tensors → `CALLBACK_BLOCK`-sized (2048 samples) numpy arrays
+- If `--speed != 1.0`: pyrubberband R3-Engine (`--fine` flag) stretches time without pitch change before resampling
+- Resampling: `torchaudio.functional.resample(chunk, model_sr, 48000)`
+- `PlaybackWorker.stop()` sends `None` sentinel into the queue and joins the thread
+
+### Two synthesis paths
+- **`synthesize_non_streaming`**: generates each chunk fully, feeds finished tensors to `PlaybackWorker`, concatenates all wavs for `--save`
+- **`synthesize_streaming`**: calls `model.generate_stream()` with `chunk_size`; each yielded audio sub-chunk goes directly to `PlaybackWorker`; marked experimental in docs
+
+## Planned extensions (Ideen/)
+
+The `Ideen/` folder documents a planned **REST/MCP bridge**:
+- `tts_service.py` (FastAPI): `POST /speak`, `POST /stop`, `GET /health`, `GET /voices`
+- `mcp_adapter.py`: thin MCP wrapper calling the REST API
+- `chatterbox_backend.py`: imports `chatterbox_cli_v4.py` via `importlib` and calls `synthesize_non_streaming()` directly
+
+Key gaps to address before building the service:
+1. **Stop/interrupt**: `PlaybackWorker.stop()` drains the audio queue, but a blocking `model.generate()` call cannot be interrupted mid-run. A `threading.Event`-based cancel token threaded through `synthesize_non_streaming` is the planned approach.
+2. **Model caching**: `load_model()` reloads from disk on every call; a service needs a per-language singleton.
+3. **Status object**: progress is `print()`-based; a service needs structured state.