dschlueter bcf6374c29 Erweiterung: Stop-Mechanismus, REST-Service und MCP-Adapter

- chatterbox_cli_v4.py: kooperativer Stop-Mechanismus via threading.Event
  (STOP_REQUESTED, request_stop, clear_stop); PlaybackWorker, synthesize_non_streaming
  und synthesize_streaming prüfen das Event vor jedem Chunk; --stop CLI-Flag
- tts_service.py: FastAPI-Service mit Modell-Caching, Job-Queue und Worker-Thread;
  Endpunkte: POST /speak, POST /stop, GET /health, GET /status, GET /voices
- mcp_adapter.py: MCP-Adapter (stdio/streamable-http) über tts_service; Tools:
  speak, stop, get_status, list_voices
- requirements.txt: fastapi, uvicorn, httpx, mcp ergänzt
- CLAUDE.md: Architektur und Startbefehle dokumentiert
- .gitignore: Ideen/-Verzeichnis ausgeschlossen

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-05-16 09:46:43 +02:00

4.5 KiB

Raw Blame History

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Running the CLI

conda activate chatterbox

# Deutschen Text aus Datei vorlesen
python chatterbox_cli_v4.py --lang de --input text.txt

# Mit Voice Cloning
python chatterbox_cli_v4.py --lang de --voice my_voice.wav --input text.txt

# Text direkt übergeben (Englisch)
python chatterbox_cli_v4.py --lang en --text "Hello world"

# Nur speichern, kein Playback
python chatterbox_cli_v4.py --lang de --no-play --output ausgabe.wav --input text.txt

# Geschwindigkeit anpassen (pitch-erhaltend, erfordert rubberband-cli)
python chatterbox_cli_v4.py --lang de --speed 0.85 --input text.txt

# Streaming-Modus (experimentell, niedrigere Latenz, kann abgehackt klingen)
python chatterbox_cli_v4.py --lang de --stream --input text.txt

# Aussprache-Wörterbuch (JSON: {"Eigenname": "Lautschrift"})
python chatterbox_cli_v4.py --lang de --pronunciation-dict aussprache.json --input text.txt

No build step, no test suite, no linter configuration — this is a single-file script.

Architecture

Everything lives in chatterbox_cli_v4.py. The processing pipeline is:

Text input → normalization → chunking → TTS generation → audio output

Text normalization (`preprocess_tts_text`)

Applied per chunk before synthesis. Order matters:

Pronunciation dict substitutions (before acronym expansion, so proper names are caught first)
Unit normalization (120 km/h → "120 Kilometer pro Stunde")
Time normalization (14:58 → "vierzehn Uhr achtundfünfzig")
Year normalization (2026 → "zweitausendsechsundzwanzig")
Acronym spelling (ARD → "Ah Er De"; skips entries in NON_SPELLED_ACRONYMS)

DEFAULT_PRONUNCIATION_DE contains built-in German phonetic approximations (e.g. Xi → "Schi").

Text chunking

Three modes (chosen by CLI flags):

sentence_mode (default): split_into_sentences() — one sentence per TTS call, lowest latency to first audio
conversation_mode: split_for_conversation() — first chunk is small (--first-chunk-len, default 80 chars), rest up to --len (400)
plain: split_long_text() — paragraph-aware chunking up to --len

SENTENCE_END_RE handles edge cases like ordinal numbers, ellipses, and CJK punctuation. SEPARATOR_LINE_RE silently drops lines like --- Ende ---.

Model loading (`load_model`)

--lang en → ChatterboxTTS (mono, always available)
Other languages → ChatterboxMultilingualTTS (requires multilingual package; HAS_MULTILINGUAL flag guards import)
--t3-model v3 (default) or v2 selects the multilingual T3 checkpoint
Models are downloaded to ~/.cache/huggingface/ on first use (~2–3 GB)
Critical: attn_implementation = "eager" is forced at import time because SDPA returns None attention weights, breaking the AlignmentStreamAnalyzer hook

Audio output (`PlaybackWorker`)

Uses sounddevice.OutputStream with a callback at 48 kHz (PipeWire/PulseAudio standard)
Internal producer thread converts Torch tensors → CALLBACK_BLOCK-sized (2048 samples) numpy arrays
If --speed != 1.0: pyrubberband R3-Engine (--fine flag) stretches time without pitch change before resampling
Resampling: torchaudio.functional.resample(chunk, model_sr, 48000)
PlaybackWorker.stop() sends None sentinel into the queue and joins the thread

Two synthesis paths

synthesize_non_streaming: generates each chunk fully, feeds finished tensors to PlaybackWorker, concatenates all wavs for --save
synthesize_streaming: calls model.generate_stream() with chunk_size; each yielded audio sub-chunk goes directly to PlaybackWorker; marked experimental in docs

Planned extensions (Ideen/)

The Ideen/ folder documents a planned REST/MCP bridge:

tts_service.py (FastAPI): POST /speak, POST /stop, GET /health, GET /voices
mcp_adapter.py: thin MCP wrapper calling the REST API
chatterbox_backend.py: imports chatterbox_cli_v4.py via importlib and calls synthesize_non_streaming() directly

Key gaps to address before building the service:

Stop/interrupt: PlaybackWorker.stop() drains the audio queue, but a blocking model.generate() call cannot be interrupted mid-run. A threading.Event-based cancel token threaded through synthesize_non_streaming is the planned approach.
Model caching: load_model() reloads from disk on every call; a service needs a per-language singleton.
Status object: progress is print()-based; a service needs structured state.

4.5 KiB Raw Blame History Unescape Escape