- chatterbox_cli_v4.py: kooperativer Stop-Mechanismus via threading.Event (STOP_REQUESTED, request_stop, clear_stop); PlaybackWorker, synthesize_non_streaming und synthesize_streaming prüfen das Event vor jedem Chunk; --stop CLI-Flag - tts_service.py: FastAPI-Service mit Modell-Caching, Job-Queue und Worker-Thread; Endpunkte: POST /speak, POST /stop, GET /health, GET /status, GET /voices - mcp_adapter.py: MCP-Adapter (stdio/streamable-http) über tts_service; Tools: speak, stop, get_status, list_voices - requirements.txt: fastapi, uvicorn, httpx, mcp ergänzt - CLAUDE.md: Architektur und Startbefehle dokumentiert - .gitignore: Ideen/-Verzeichnis ausgeschlossen Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
86 lines
4.5 KiB
Markdown
86 lines
4.5 KiB
Markdown
# CLAUDE.md
|
||
|
||
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
|
||
|
||
## Running the CLI
|
||
|
||
```bash
|
||
conda activate chatterbox
|
||
|
||
# Deutschen Text aus Datei vorlesen
|
||
python chatterbox_cli_v4.py --lang de --input text.txt
|
||
|
||
# Mit Voice Cloning
|
||
python chatterbox_cli_v4.py --lang de --voice my_voice.wav --input text.txt
|
||
|
||
# Text direkt übergeben (Englisch)
|
||
python chatterbox_cli_v4.py --lang en --text "Hello world"
|
||
|
||
# Nur speichern, kein Playback
|
||
python chatterbox_cli_v4.py --lang de --no-play --output ausgabe.wav --input text.txt
|
||
|
||
# Geschwindigkeit anpassen (pitch-erhaltend, erfordert rubberband-cli)
|
||
python chatterbox_cli_v4.py --lang de --speed 0.85 --input text.txt
|
||
|
||
# Streaming-Modus (experimentell, niedrigere Latenz, kann abgehackt klingen)
|
||
python chatterbox_cli_v4.py --lang de --stream --input text.txt
|
||
|
||
# Aussprache-Wörterbuch (JSON: {"Eigenname": "Lautschrift"})
|
||
python chatterbox_cli_v4.py --lang de --pronunciation-dict aussprache.json --input text.txt
|
||
```
|
||
|
||
No build step, no test suite, no linter configuration — this is a single-file script.
|
||
|
||
## Architecture
|
||
|
||
Everything lives in `chatterbox_cli_v4.py`. The processing pipeline is:
|
||
|
||
**Text input → normalization → chunking → TTS generation → audio output**
|
||
|
||
### Text normalization (`preprocess_tts_text`)
|
||
Applied per chunk before synthesis. Order matters:
|
||
1. Pronunciation dict substitutions (before acronym expansion, so proper names are caught first)
|
||
2. Unit normalization (120 km/h → "120 Kilometer pro Stunde")
|
||
3. Time normalization (14:58 → "vierzehn Uhr achtundfünfzig")
|
||
4. Year normalization (2026 → "zweitausendsechsundzwanzig")
|
||
5. Acronym spelling (ARD → "Ah Er De"; skips entries in `NON_SPELLED_ACRONYMS`)
|
||
|
||
`DEFAULT_PRONUNCIATION_DE` contains built-in German phonetic approximations (e.g. Xi → "Schi").
|
||
|
||
### Text chunking
|
||
Three modes (chosen by CLI flags):
|
||
- **sentence_mode** (default): `split_into_sentences()` — one sentence per TTS call, lowest latency to first audio
|
||
- **conversation_mode**: `split_for_conversation()` — first chunk is small (`--first-chunk-len`, default 80 chars), rest up to `--len` (400)
|
||
- **plain**: `split_long_text()` — paragraph-aware chunking up to `--len`
|
||
|
||
`SENTENCE_END_RE` handles edge cases like ordinal numbers, ellipses, and CJK punctuation. `SEPARATOR_LINE_RE` silently drops lines like `--- Ende ---`.
|
||
|
||
### Model loading (`load_model`)
|
||
- `--lang en` → `ChatterboxTTS` (mono, always available)
|
||
- Other languages → `ChatterboxMultilingualTTS` (requires multilingual package; `HAS_MULTILINGUAL` flag guards import)
|
||
- `--t3-model v3` (default) or `v2` selects the multilingual T3 checkpoint
|
||
- Models are downloaded to `~/.cache/huggingface/` on first use (~2–3 GB)
|
||
- **Critical**: `attn_implementation = "eager"` is forced at import time because SDPA returns `None` attention weights, breaking the `AlignmentStreamAnalyzer` hook
|
||
|
||
### Audio output (`PlaybackWorker`)
|
||
- Uses `sounddevice.OutputStream` with a callback at 48 kHz (PipeWire/PulseAudio standard)
|
||
- Internal producer thread converts Torch tensors → `CALLBACK_BLOCK`-sized (2048 samples) numpy arrays
|
||
- If `--speed != 1.0`: pyrubberband R3-Engine (`--fine` flag) stretches time without pitch change before resampling
|
||
- Resampling: `torchaudio.functional.resample(chunk, model_sr, 48000)`
|
||
- `PlaybackWorker.stop()` sends `None` sentinel into the queue and joins the thread
|
||
|
||
### Two synthesis paths
|
||
- **`synthesize_non_streaming`**: generates each chunk fully, feeds finished tensors to `PlaybackWorker`, concatenates all wavs for `--save`
|
||
- **`synthesize_streaming`**: calls `model.generate_stream()` with `chunk_size`; each yielded audio sub-chunk goes directly to `PlaybackWorker`; marked experimental in docs
|
||
|
||
## Planned extensions (Ideen/)
|
||
|
||
The `Ideen/` folder documents a planned **REST/MCP bridge**:
|
||
- `tts_service.py` (FastAPI): `POST /speak`, `POST /stop`, `GET /health`, `GET /voices`
|
||
- `mcp_adapter.py`: thin MCP wrapper calling the REST API
|
||
- `chatterbox_backend.py`: imports `chatterbox_cli_v4.py` via `importlib` and calls `synthesize_non_streaming()` directly
|
||
|
||
Key gaps to address before building the service:
|
||
1. **Stop/interrupt**: `PlaybackWorker.stop()` drains the audio queue, but a blocking `model.generate()` call cannot be interrupted mid-run. A `threading.Event`-based cancel token threaded through `synthesize_non_streaming` is the planned approach.
|
||
2. **Model caching**: `load_model()` reloads from disk on every call; a service needs a per-language singleton.
|
||
3. **Status object**: progress is `print()`-based; a service needs structured state.
|