- chatterbox_cli_v4.py: kooperativer Stop-Mechanismus via threading.Event (STOP_REQUESTED, request_stop, clear_stop); PlaybackWorker, synthesize_non_streaming und synthesize_streaming prüfen das Event vor jedem Chunk; --stop CLI-Flag - tts_service.py: FastAPI-Service mit Modell-Caching, Job-Queue und Worker-Thread; Endpunkte: POST /speak, POST /stop, GET /health, GET /status, GET /voices - mcp_adapter.py: MCP-Adapter (stdio/streamable-http) über tts_service; Tools: speak, stop, get_status, list_voices - requirements.txt: fastapi, uvicorn, httpx, mcp ergänzt - CLAUDE.md: Architektur und Startbefehle dokumentiert - .gitignore: Ideen/-Verzeichnis ausgeschlossen Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
4.5 KiB
4.5 KiB
CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
Running the CLI
conda activate chatterbox
# Deutschen Text aus Datei vorlesen
python chatterbox_cli_v4.py --lang de --input text.txt
# Mit Voice Cloning
python chatterbox_cli_v4.py --lang de --voice my_voice.wav --input text.txt
# Text direkt übergeben (Englisch)
python chatterbox_cli_v4.py --lang en --text "Hello world"
# Nur speichern, kein Playback
python chatterbox_cli_v4.py --lang de --no-play --output ausgabe.wav --input text.txt
# Geschwindigkeit anpassen (pitch-erhaltend, erfordert rubberband-cli)
python chatterbox_cli_v4.py --lang de --speed 0.85 --input text.txt
# Streaming-Modus (experimentell, niedrigere Latenz, kann abgehackt klingen)
python chatterbox_cli_v4.py --lang de --stream --input text.txt
# Aussprache-Wörterbuch (JSON: {"Eigenname": "Lautschrift"})
python chatterbox_cli_v4.py --lang de --pronunciation-dict aussprache.json --input text.txt
No build step, no test suite, no linter configuration — this is a single-file script.
Architecture
Everything lives in chatterbox_cli_v4.py. The processing pipeline is:
Text input → normalization → chunking → TTS generation → audio output
Text normalization (preprocess_tts_text)
Applied per chunk before synthesis. Order matters:
- Pronunciation dict substitutions (before acronym expansion, so proper names are caught first)
- Unit normalization (120 km/h → "120 Kilometer pro Stunde")
- Time normalization (14:58 → "vierzehn Uhr achtundfünfzig")
- Year normalization (2026 → "zweitausendsechsundzwanzig")
- Acronym spelling (ARD → "Ah Er De"; skips entries in
NON_SPELLED_ACRONYMS)
DEFAULT_PRONUNCIATION_DE contains built-in German phonetic approximations (e.g. Xi → "Schi").
Text chunking
Three modes (chosen by CLI flags):
- sentence_mode (default):
split_into_sentences()— one sentence per TTS call, lowest latency to first audio - conversation_mode:
split_for_conversation()— first chunk is small (--first-chunk-len, default 80 chars), rest up to--len(400) - plain:
split_long_text()— paragraph-aware chunking up to--len
SENTENCE_END_RE handles edge cases like ordinal numbers, ellipses, and CJK punctuation. SEPARATOR_LINE_RE silently drops lines like --- Ende ---.
Model loading (load_model)
--lang en→ChatterboxTTS(mono, always available)- Other languages →
ChatterboxMultilingualTTS(requires multilingual package;HAS_MULTILINGUALflag guards import) --t3-model v3(default) orv2selects the multilingual T3 checkpoint- Models are downloaded to
~/.cache/huggingface/on first use (~2–3 GB) - Critical:
attn_implementation = "eager"is forced at import time because SDPA returnsNoneattention weights, breaking theAlignmentStreamAnalyzerhook
Audio output (PlaybackWorker)
- Uses
sounddevice.OutputStreamwith a callback at 48 kHz (PipeWire/PulseAudio standard) - Internal producer thread converts Torch tensors →
CALLBACK_BLOCK-sized (2048 samples) numpy arrays - If
--speed != 1.0: pyrubberband R3-Engine (--fineflag) stretches time without pitch change before resampling - Resampling:
torchaudio.functional.resample(chunk, model_sr, 48000) PlaybackWorker.stop()sendsNonesentinel into the queue and joins the thread
Two synthesis paths
synthesize_non_streaming: generates each chunk fully, feeds finished tensors toPlaybackWorker, concatenates all wavs for--savesynthesize_streaming: callsmodel.generate_stream()withchunk_size; each yielded audio sub-chunk goes directly toPlaybackWorker; marked experimental in docs
Planned extensions (Ideen/)
The Ideen/ folder documents a planned REST/MCP bridge:
tts_service.py(FastAPI):POST /speak,POST /stop,GET /health,GET /voicesmcp_adapter.py: thin MCP wrapper calling the REST APIchatterbox_backend.py: importschatterbox_cli_v4.pyviaimportliband callssynthesize_non_streaming()directly
Key gaps to address before building the service:
- Stop/interrupt:
PlaybackWorker.stop()drains the audio queue, but a blockingmodel.generate()call cannot be interrupted mid-run. Athreading.Event-based cancel token threaded throughsynthesize_non_streamingis the planned approach. - Model caching:
load_model()reloads from disk on every call; a service needs a per-language singleton. - Status object: progress is
print()-based; a service needs structured state.