chatterbox-tts-cli/CLAUDE.md
dschlueter bcf6374c29 Erweiterung: Stop-Mechanismus, REST-Service und MCP-Adapter
- chatterbox_cli_v4.py: kooperativer Stop-Mechanismus via threading.Event
  (STOP_REQUESTED, request_stop, clear_stop); PlaybackWorker, synthesize_non_streaming
  und synthesize_streaming prüfen das Event vor jedem Chunk; --stop CLI-Flag
- tts_service.py: FastAPI-Service mit Modell-Caching, Job-Queue und Worker-Thread;
  Endpunkte: POST /speak, POST /stop, GET /health, GET /status, GET /voices
- mcp_adapter.py: MCP-Adapter (stdio/streamable-http) über tts_service; Tools:
  speak, stop, get_status, list_voices
- requirements.txt: fastapi, uvicorn, httpx, mcp ergänzt
- CLAUDE.md: Architektur und Startbefehle dokumentiert
- .gitignore: Ideen/-Verzeichnis ausgeschlossen

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 09:46:43 +02:00

4.5 KiB
Raw Blame History

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Running the CLI

conda activate chatterbox

# Deutschen Text aus Datei vorlesen
python chatterbox_cli_v4.py --lang de --input text.txt

# Mit Voice Cloning
python chatterbox_cli_v4.py --lang de --voice my_voice.wav --input text.txt

# Text direkt übergeben (Englisch)
python chatterbox_cli_v4.py --lang en --text "Hello world"

# Nur speichern, kein Playback
python chatterbox_cli_v4.py --lang de --no-play --output ausgabe.wav --input text.txt

# Geschwindigkeit anpassen (pitch-erhaltend, erfordert rubberband-cli)
python chatterbox_cli_v4.py --lang de --speed 0.85 --input text.txt

# Streaming-Modus (experimentell, niedrigere Latenz, kann abgehackt klingen)
python chatterbox_cli_v4.py --lang de --stream --input text.txt

# Aussprache-Wörterbuch (JSON: {"Eigenname": "Lautschrift"})
python chatterbox_cli_v4.py --lang de --pronunciation-dict aussprache.json --input text.txt

No build step, no test suite, no linter configuration — this is a single-file script.

Architecture

Everything lives in chatterbox_cli_v4.py. The processing pipeline is:

Text input → normalization → chunking → TTS generation → audio output

Text normalization (preprocess_tts_text)

Applied per chunk before synthesis. Order matters:

  1. Pronunciation dict substitutions (before acronym expansion, so proper names are caught first)
  2. Unit normalization (120 km/h → "120 Kilometer pro Stunde")
  3. Time normalization (14:58 → "vierzehn Uhr achtundfünfzig")
  4. Year normalization (2026 → "zweitausendsechsundzwanzig")
  5. Acronym spelling (ARD → "Ah Er De"; skips entries in NON_SPELLED_ACRONYMS)

DEFAULT_PRONUNCIATION_DE contains built-in German phonetic approximations (e.g. Xi → "Schi").

Text chunking

Three modes (chosen by CLI flags):

  • sentence_mode (default): split_into_sentences() — one sentence per TTS call, lowest latency to first audio
  • conversation_mode: split_for_conversation() — first chunk is small (--first-chunk-len, default 80 chars), rest up to --len (400)
  • plain: split_long_text() — paragraph-aware chunking up to --len

SENTENCE_END_RE handles edge cases like ordinal numbers, ellipses, and CJK punctuation. SEPARATOR_LINE_RE silently drops lines like --- Ende ---.

Model loading (load_model)

  • --lang enChatterboxTTS (mono, always available)
  • Other languages → ChatterboxMultilingualTTS (requires multilingual package; HAS_MULTILINGUAL flag guards import)
  • --t3-model v3 (default) or v2 selects the multilingual T3 checkpoint
  • Models are downloaded to ~/.cache/huggingface/ on first use (~23 GB)
  • Critical: attn_implementation = "eager" is forced at import time because SDPA returns None attention weights, breaking the AlignmentStreamAnalyzer hook

Audio output (PlaybackWorker)

  • Uses sounddevice.OutputStream with a callback at 48 kHz (PipeWire/PulseAudio standard)
  • Internal producer thread converts Torch tensors → CALLBACK_BLOCK-sized (2048 samples) numpy arrays
  • If --speed != 1.0: pyrubberband R3-Engine (--fine flag) stretches time without pitch change before resampling
  • Resampling: torchaudio.functional.resample(chunk, model_sr, 48000)
  • PlaybackWorker.stop() sends None sentinel into the queue and joins the thread

Two synthesis paths

  • synthesize_non_streaming: generates each chunk fully, feeds finished tensors to PlaybackWorker, concatenates all wavs for --save
  • synthesize_streaming: calls model.generate_stream() with chunk_size; each yielded audio sub-chunk goes directly to PlaybackWorker; marked experimental in docs

Planned extensions (Ideen/)

The Ideen/ folder documents a planned REST/MCP bridge:

  • tts_service.py (FastAPI): POST /speak, POST /stop, GET /health, GET /voices
  • mcp_adapter.py: thin MCP wrapper calling the REST API
  • chatterbox_backend.py: imports chatterbox_cli_v4.py via importlib and calls synthesize_non_streaming() directly

Key gaps to address before building the service:

  1. Stop/interrupt: PlaybackWorker.stop() drains the audio queue, but a blocking model.generate() call cannot be interrupted mid-run. A threading.Event-based cancel token threaded through synthesize_non_streaming is the planned approach.
  2. Model caching: load_model() reloads from disk on every call; a service needs a per-language singleton.
  3. Status object: progress is print()-based; a service needs structured state.