chatterbox-tts-cli/CLAUDE.md

# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## Running the CLI

```bash
conda activate chatterbox

# Deutschen Text aus Datei vorlesen
python chatterbox_cli_v4.py --lang de --input text.txt

# Mit Voice Cloning
python chatterbox_cli_v4.py --lang de --voice my_voice.wav --input text.txt

# Text direkt übergeben (Englisch)
python chatterbox_cli_v4.py --lang en --text "Hello world"

# Nur speichern, kein Playback
python chatterbox_cli_v4.py --lang de --no-play --output ausgabe.wav --input text.txt

# Geschwindigkeit anpassen (pitch-erhaltend, erfordert rubberband-cli)
python chatterbox_cli_v4.py --lang de --speed 0.85 --input text.txt

# Streaming-Modus (experimentell, niedrigere Latenz, kann abgehackt klingen)
python chatterbox_cli_v4.py --lang de --stream --input text.txt

# Aussprache-Wörterbuch (JSON: {"Eigenname": "Lautschrift"})
python chatterbox_cli_v4.py --lang de --pronunciation-dict aussprache.json --input text.txt
```

No build step, no test suite, no linter configuration — this is a single-file script.

## Architecture

Everything lives in `chatterbox_cli_v4.py`. The processing pipeline is:

**Text input → normalization → chunking → TTS generation → audio output**

### Text normalization (`preprocess_tts_text`)
Applied per chunk before synthesis. Order matters:
1. Pronunciation dict substitutions (before acronym expansion, so proper names are caught first)
2. Unit normalization (120 km/h → "120 Kilometer pro Stunde")
3. Time normalization (14:58 → "vierzehn Uhr achtundfünfzig")
4. Year normalization (2026 → "zweitausendsechsundzwanzig")
5. Acronym spelling (ARD → "Ah Er De"; skips entries in `NON_SPELLED_ACRONYMS`)

`DEFAULT_PRONUNCIATION_DE` contains built-in German phonetic approximations (e.g. Xi → "Schi").

### Text chunking
Three modes (chosen by CLI flags):
- **sentence_mode** (default): `split_into_sentences()` — one sentence per TTS call, lowest latency to first audio
- **conversation_mode**: `split_for_conversation()` — first chunk is small (`--first-chunk-len`, default 80 chars), rest up to `--len` (400)
- **plain**: `split_long_text()` — paragraph-aware chunking up to `--len`

`SENTENCE_END_RE` handles edge cases like ordinal numbers, ellipses, and CJK punctuation. `SEPARATOR_LINE_RE` silently drops lines like `--- Ende ---`.

### Model loading (`load_model`)
- `--lang en` → `ChatterboxTTS` (mono, always available)
- Other languages → `ChatterboxMultilingualTTS` (requires multilingual package; `HAS_MULTILINGUAL` flag guards import)
- `--t3-model v3` (default) or `v2` selects the multilingual T3 checkpoint
- Models are downloaded to `~/.cache/huggingface/` on first use (~2–3 GB)
- **Critical**: `attn_implementation = "eager"` is forced at import time because SDPA returns `None` attention weights, breaking the `AlignmentStreamAnalyzer` hook

### Audio output (`PlaybackWorker`)
- Uses `sounddevice.OutputStream` with a callback at 48 kHz (PipeWire/PulseAudio standard)
- Internal producer thread converts Torch tensors → `CALLBACK_BLOCK`-sized (2048 samples) numpy arrays
- If `--speed != 1.0`: pyrubberband R3-Engine (`--fine` flag) stretches time without pitch change before resampling
- Resampling: `torchaudio.functional.resample(chunk, model_sr, 48000)`
- `PlaybackWorker.stop()` sends `None` sentinel into the queue and joins the thread

### Two synthesis paths
- **`synthesize_non_streaming`**: generates each chunk fully, feeds finished tensors to `PlaybackWorker`, concatenates all wavs for `--save`
- **`synthesize_streaming`**: calls `model.generate_stream()` with `chunk_size`; each yielded audio sub-chunk goes directly to `PlaybackWorker`; marked experimental in docs

## Planned extensions (Ideen/)

The `Ideen/` folder documents a planned **REST/MCP bridge**:
- `tts_service.py` (FastAPI): `POST /speak`, `POST /stop`, `GET /health`, `GET /voices`
- `mcp_adapter.py`: thin MCP wrapper calling the REST API
- `chatterbox_backend.py`: imports `chatterbox_cli_v4.py` via `importlib` and calls `synthesize_non_streaming()` directly

Key gaps to address before building the service:
1. **Stop/interrupt**: `PlaybackWorker.stop()` drains the audio queue, but a blocking `model.generate()` call cannot be interrupted mid-run. A `threading.Event`-based cancel token threaded through `synthesize_non_streaming` is the planned approach.
2. **Model caching**: `load_model()` reloads from disk on every call; a service needs a per-language singleton.
3. **Status object**: progress is `print()`-based; a service needs structured state.