86 lines
4.5 KiB
Markdown
86 lines
4.5 KiB
Markdown
|
|
# CLAUDE.md
|
|||
|
|
|
|||
|
|
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
|
|||
|
|
|
|||
|
|
## Running the CLI
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
conda activate chatterbox
|
|||
|
|
|
|||
|
|
# Deutschen Text aus Datei vorlesen
|
|||
|
|
python chatterbox_cli_v4.py --lang de --input text.txt
|
|||
|
|
|
|||
|
|
# Mit Voice Cloning
|
|||
|
|
python chatterbox_cli_v4.py --lang de --voice my_voice.wav --input text.txt
|
|||
|
|
|
|||
|
|
# Text direkt übergeben (Englisch)
|
|||
|
|
python chatterbox_cli_v4.py --lang en --text "Hello world"
|
|||
|
|
|
|||
|
|
# Nur speichern, kein Playback
|
|||
|
|
python chatterbox_cli_v4.py --lang de --no-play --output ausgabe.wav --input text.txt
|
|||
|
|
|
|||
|
|
# Geschwindigkeit anpassen (pitch-erhaltend, erfordert rubberband-cli)
|
|||
|
|
python chatterbox_cli_v4.py --lang de --speed 0.85 --input text.txt
|
|||
|
|
|
|||
|
|
# Streaming-Modus (experimentell, niedrigere Latenz, kann abgehackt klingen)
|
|||
|
|
python chatterbox_cli_v4.py --lang de --stream --input text.txt
|
|||
|
|
|
|||
|
|
# Aussprache-Wörterbuch (JSON: {"Eigenname": "Lautschrift"})
|
|||
|
|
python chatterbox_cli_v4.py --lang de --pronunciation-dict aussprache.json --input text.txt
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
No build step, no test suite, no linter configuration — this is a single-file script.
|
|||
|
|
|
|||
|
|
## Architecture
|
|||
|
|
|
|||
|
|
Everything lives in `chatterbox_cli_v4.py`. The processing pipeline is:
|
|||
|
|
|
|||
|
|
**Text input → normalization → chunking → TTS generation → audio output**
|
|||
|
|
|
|||
|
|
### Text normalization (`preprocess_tts_text`)
|
|||
|
|
Applied per chunk before synthesis. Order matters:
|
|||
|
|
1. Pronunciation dict substitutions (before acronym expansion, so proper names are caught first)
|
|||
|
|
2. Unit normalization (120 km/h → "120 Kilometer pro Stunde")
|
|||
|
|
3. Time normalization (14:58 → "vierzehn Uhr achtundfünfzig")
|
|||
|
|
4. Year normalization (2026 → "zweitausendsechsundzwanzig")
|
|||
|
|
5. Acronym spelling (ARD → "Ah Er De"; skips entries in `NON_SPELLED_ACRONYMS`)
|
|||
|
|
|
|||
|
|
`DEFAULT_PRONUNCIATION_DE` contains built-in German phonetic approximations (e.g. Xi → "Schi").
|
|||
|
|
|
|||
|
|
### Text chunking
|
|||
|
|
Three modes (chosen by CLI flags):
|
|||
|
|
- **sentence_mode** (default): `split_into_sentences()` — one sentence per TTS call, lowest latency to first audio
|
|||
|
|
- **conversation_mode**: `split_for_conversation()` — first chunk is small (`--first-chunk-len`, default 80 chars), rest up to `--len` (400)
|
|||
|
|
- **plain**: `split_long_text()` — paragraph-aware chunking up to `--len`
|
|||
|
|
|
|||
|
|
`SENTENCE_END_RE` handles edge cases like ordinal numbers, ellipses, and CJK punctuation. `SEPARATOR_LINE_RE` silently drops lines like `--- Ende ---`.
|
|||
|
|
|
|||
|
|
### Model loading (`load_model`)
|
|||
|
|
- `--lang en` → `ChatterboxTTS` (mono, always available)
|
|||
|
|
- Other languages → `ChatterboxMultilingualTTS` (requires multilingual package; `HAS_MULTILINGUAL` flag guards import)
|
|||
|
|
- `--t3-model v3` (default) or `v2` selects the multilingual T3 checkpoint
|
|||
|
|
- Models are downloaded to `~/.cache/huggingface/` on first use (~2–3 GB)
|
|||
|
|
- **Critical**: `attn_implementation = "eager"` is forced at import time because SDPA returns `None` attention weights, breaking the `AlignmentStreamAnalyzer` hook
|
|||
|
|
|
|||
|
|
### Audio output (`PlaybackWorker`)
|
|||
|
|
- Uses `sounddevice.OutputStream` with a callback at 48 kHz (PipeWire/PulseAudio standard)
|
|||
|
|
- Internal producer thread converts Torch tensors → `CALLBACK_BLOCK`-sized (2048 samples) numpy arrays
|
|||
|
|
- If `--speed != 1.0`: pyrubberband R3-Engine (`--fine` flag) stretches time without pitch change before resampling
|
|||
|
|
- Resampling: `torchaudio.functional.resample(chunk, model_sr, 48000)`
|
|||
|
|
- `PlaybackWorker.stop()` sends `None` sentinel into the queue and joins the thread
|
|||
|
|
|
|||
|
|
### Two synthesis paths
|
|||
|
|
- **`synthesize_non_streaming`**: generates each chunk fully, feeds finished tensors to `PlaybackWorker`, concatenates all wavs for `--save`
|
|||
|
|
- **`synthesize_streaming`**: calls `model.generate_stream()` with `chunk_size`; each yielded audio sub-chunk goes directly to `PlaybackWorker`; marked experimental in docs
|
|||
|
|
|
|||
|
|
## Planned extensions (Ideen/)
|
|||
|
|
|
|||
|
|
The `Ideen/` folder documents a planned **REST/MCP bridge**:
|
|||
|
|
- `tts_service.py` (FastAPI): `POST /speak`, `POST /stop`, `GET /health`, `GET /voices`
|
|||
|
|
- `mcp_adapter.py`: thin MCP wrapper calling the REST API
|
|||
|
|
- `chatterbox_backend.py`: imports `chatterbox_cli_v4.py` via `importlib` and calls `synthesize_non_streaming()` directly
|
|||
|
|
|
|||
|
|
Key gaps to address before building the service:
|
|||
|
|
1. **Stop/interrupt**: `PlaybackWorker.stop()` drains the audio queue, but a blocking `model.generate()` call cannot be interrupted mid-run. A `threading.Event`-based cancel token threaded through `synthesize_non_streaming` is the planned approach.
|
|||
|
|
2. **Model caching**: `load_model()` reloads from disk on every call; a service needs a per-language singleton.
|
|||
|
|
3. **Status object**: progress is `print()`-based; a service needs structured state.
|