chatterbox-tts-cli/CLAUDE.md
dschlueter bcf6374c29 Erweiterung: Stop-Mechanismus, REST-Service und MCP-Adapter
- chatterbox_cli_v4.py: kooperativer Stop-Mechanismus via threading.Event
  (STOP_REQUESTED, request_stop, clear_stop); PlaybackWorker, synthesize_non_streaming
  und synthesize_streaming prüfen das Event vor jedem Chunk; --stop CLI-Flag
- tts_service.py: FastAPI-Service mit Modell-Caching, Job-Queue und Worker-Thread;
  Endpunkte: POST /speak, POST /stop, GET /health, GET /status, GET /voices
- mcp_adapter.py: MCP-Adapter (stdio/streamable-http) über tts_service; Tools:
  speak, stop, get_status, list_voices
- requirements.txt: fastapi, uvicorn, httpx, mcp ergänzt
- CLAUDE.md: Architektur und Startbefehle dokumentiert
- .gitignore: Ideen/-Verzeichnis ausgeschlossen

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 09:46:43 +02:00

86 lines
4.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
## Running the CLI
```bash
conda activate chatterbox
# Deutschen Text aus Datei vorlesen
python chatterbox_cli_v4.py --lang de --input text.txt
# Mit Voice Cloning
python chatterbox_cli_v4.py --lang de --voice my_voice.wav --input text.txt
# Text direkt übergeben (Englisch)
python chatterbox_cli_v4.py --lang en --text "Hello world"
# Nur speichern, kein Playback
python chatterbox_cli_v4.py --lang de --no-play --output ausgabe.wav --input text.txt
# Geschwindigkeit anpassen (pitch-erhaltend, erfordert rubberband-cli)
python chatterbox_cli_v4.py --lang de --speed 0.85 --input text.txt
# Streaming-Modus (experimentell, niedrigere Latenz, kann abgehackt klingen)
python chatterbox_cli_v4.py --lang de --stream --input text.txt
# Aussprache-Wörterbuch (JSON: {"Eigenname": "Lautschrift"})
python chatterbox_cli_v4.py --lang de --pronunciation-dict aussprache.json --input text.txt
```
No build step, no test suite, no linter configuration — this is a single-file script.
## Architecture
Everything lives in `chatterbox_cli_v4.py`. The processing pipeline is:
**Text input → normalization → chunking → TTS generation → audio output**
### Text normalization (`preprocess_tts_text`)
Applied per chunk before synthesis. Order matters:
1. Pronunciation dict substitutions (before acronym expansion, so proper names are caught first)
2. Unit normalization (120 km/h → "120 Kilometer pro Stunde")
3. Time normalization (14:58 → "vierzehn Uhr achtundfünfzig")
4. Year normalization (2026 → "zweitausendsechsundzwanzig")
5. Acronym spelling (ARD → "Ah Er De"; skips entries in `NON_SPELLED_ACRONYMS`)
`DEFAULT_PRONUNCIATION_DE` contains built-in German phonetic approximations (e.g. Xi → "Schi").
### Text chunking
Three modes (chosen by CLI flags):
- **sentence_mode** (default): `split_into_sentences()` — one sentence per TTS call, lowest latency to first audio
- **conversation_mode**: `split_for_conversation()` — first chunk is small (`--first-chunk-len`, default 80 chars), rest up to `--len` (400)
- **plain**: `split_long_text()` — paragraph-aware chunking up to `--len`
`SENTENCE_END_RE` handles edge cases like ordinal numbers, ellipses, and CJK punctuation. `SEPARATOR_LINE_RE` silently drops lines like `--- Ende ---`.
### Model loading (`load_model`)
- `--lang en``ChatterboxTTS` (mono, always available)
- Other languages → `ChatterboxMultilingualTTS` (requires multilingual package; `HAS_MULTILINGUAL` flag guards import)
- `--t3-model v3` (default) or `v2` selects the multilingual T3 checkpoint
- Models are downloaded to `~/.cache/huggingface/` on first use (~23 GB)
- **Critical**: `attn_implementation = "eager"` is forced at import time because SDPA returns `None` attention weights, breaking the `AlignmentStreamAnalyzer` hook
### Audio output (`PlaybackWorker`)
- Uses `sounddevice.OutputStream` with a callback at 48 kHz (PipeWire/PulseAudio standard)
- Internal producer thread converts Torch tensors → `CALLBACK_BLOCK`-sized (2048 samples) numpy arrays
- If `--speed != 1.0`: pyrubberband R3-Engine (`--fine` flag) stretches time without pitch change before resampling
- Resampling: `torchaudio.functional.resample(chunk, model_sr, 48000)`
- `PlaybackWorker.stop()` sends `None` sentinel into the queue and joins the thread
### Two synthesis paths
- **`synthesize_non_streaming`**: generates each chunk fully, feeds finished tensors to `PlaybackWorker`, concatenates all wavs for `--save`
- **`synthesize_streaming`**: calls `model.generate_stream()` with `chunk_size`; each yielded audio sub-chunk goes directly to `PlaybackWorker`; marked experimental in docs
## Planned extensions (Ideen/)
The `Ideen/` folder documents a planned **REST/MCP bridge**:
- `tts_service.py` (FastAPI): `POST /speak`, `POST /stop`, `GET /health`, `GET /voices`
- `mcp_adapter.py`: thin MCP wrapper calling the REST API
- `chatterbox_backend.py`: imports `chatterbox_cli_v4.py` via `importlib` and calls `synthesize_non_streaming()` directly
Key gaps to address before building the service:
1. **Stop/interrupt**: `PlaybackWorker.stop()` drains the audio queue, but a blocking `model.generate()` call cannot be interrupted mid-run. A `threading.Event`-based cancel token threaded through `synthesize_non_streaming` is the planned approach.
2. **Model caching**: `load_model()` reloads from disk on every call; a service needs a per-language singleton.
3. **Status object**: progress is `print()`-based; a service needs structured state.