# CLAUDE.md This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. ## Running the CLI ```bash conda activate chatterbox # Deutschen Text aus Datei vorlesen python chatterbox_cli_v4.py --lang de --input text.txt # Mit Voice Cloning python chatterbox_cli_v4.py --lang de --voice my_voice.wav --input text.txt # Text direkt übergeben (Englisch) python chatterbox_cli_v4.py --lang en --text "Hello world" # Nur speichern, kein Playback python chatterbox_cli_v4.py --lang de --no-play --output ausgabe.wav --input text.txt # Geschwindigkeit anpassen (pitch-erhaltend, erfordert rubberband-cli) python chatterbox_cli_v4.py --lang de --speed 0.85 --input text.txt # Streaming-Modus (experimentell, niedrigere Latenz, kann abgehackt klingen) python chatterbox_cli_v4.py --lang de --stream --input text.txt # Aussprache-Wörterbuch (JSON: {"Eigenname": "Lautschrift"}) python chatterbox_cli_v4.py --lang de --pronunciation-dict aussprache.json --input text.txt ``` No build step, no test suite, no linter configuration — this is a single-file script. ## Architecture Everything lives in `chatterbox_cli_v4.py`. The processing pipeline is: **Text input → normalization → chunking → TTS generation → audio output** ### Text normalization (`preprocess_tts_text`) Applied per chunk before synthesis. Order matters: 1. Pronunciation dict substitutions (before acronym expansion, so proper names are caught first) 2. Unit normalization (120 km/h → "120 Kilometer pro Stunde") 3. Time normalization (14:58 → "vierzehn Uhr achtundfünfzig") 4. Year normalization (2026 → "zweitausendsechsundzwanzig") 5. Acronym spelling (ARD → "Ah Er De"; skips entries in `NON_SPELLED_ACRONYMS`) `DEFAULT_PRONUNCIATION_DE` contains built-in German phonetic approximations (e.g. Xi → "Schi"). ### Text chunking Three modes (chosen by CLI flags): - **sentence_mode** (default): `split_into_sentences()` — one sentence per TTS call, lowest latency to first audio - **conversation_mode**: `split_for_conversation()` — first chunk is small (`--first-chunk-len`, default 80 chars), rest up to `--len` (400) - **plain**: `split_long_text()` — paragraph-aware chunking up to `--len` `SENTENCE_END_RE` handles edge cases like ordinal numbers, ellipses, and CJK punctuation. `SEPARATOR_LINE_RE` silently drops lines like `--- Ende ---`. ### Model loading (`load_model`) - `--lang en` → `ChatterboxTTS` (mono, always available) - Other languages → `ChatterboxMultilingualTTS` (requires multilingual package; `HAS_MULTILINGUAL` flag guards import) - `--t3-model v3` (default) or `v2` selects the multilingual T3 checkpoint - Models are downloaded to `~/.cache/huggingface/` on first use (~2–3 GB) - **Critical**: `attn_implementation = "eager"` is forced at import time because SDPA returns `None` attention weights, breaking the `AlignmentStreamAnalyzer` hook ### Audio output (`PlaybackWorker`) - Uses `sounddevice.OutputStream` with a callback at 48 kHz (PipeWire/PulseAudio standard) - Internal producer thread converts Torch tensors → `CALLBACK_BLOCK`-sized (2048 samples) numpy arrays - If `--speed != 1.0`: pyrubberband R3-Engine (`--fine` flag) stretches time without pitch change before resampling - Resampling: `torchaudio.functional.resample(chunk, model_sr, 48000)` - `PlaybackWorker.stop()` sends `None` sentinel into the queue and joins the thread ### Two synthesis paths - **`synthesize_non_streaming`**: generates each chunk fully, feeds finished tensors to `PlaybackWorker`, concatenates all wavs for `--save` - **`synthesize_streaming`**: calls `model.generate_stream()` with `chunk_size`; each yielded audio sub-chunk goes directly to `PlaybackWorker`; marked experimental in docs ## Planned extensions (Ideen/) The `Ideen/` folder documents a planned **REST/MCP bridge**: - `tts_service.py` (FastAPI): `POST /speak`, `POST /stop`, `GET /health`, `GET /voices` - `mcp_adapter.py`: thin MCP wrapper calling the REST API - `chatterbox_backend.py`: imports `chatterbox_cli_v4.py` via `importlib` and calls `synthesize_non_streaming()` directly Key gaps to address before building the service: 1. **Stop/interrupt**: `PlaybackWorker.stop()` drains the audio queue, but a blocking `model.generate()` call cannot be interrupted mid-run. A `threading.Event`-based cancel token threaded through `synthesize_non_streaming` is the planned approach. 2. **Model caching**: `load_model()` reloads from disk on every call; a service needs a per-language singleton. 3. **Status object**: progress is `print()`-based; a service needs structured state.