Erweiterung: Stop-Mechanismus, REST-Service und MCP-Adapter
- chatterbox_cli_v4.py: kooperativer Stop-Mechanismus via threading.Event (STOP_REQUESTED, request_stop, clear_stop); PlaybackWorker, synthesize_non_streaming und synthesize_streaming prüfen das Event vor jedem Chunk; --stop CLI-Flag - tts_service.py: FastAPI-Service mit Modell-Caching, Job-Queue und Worker-Thread; Endpunkte: POST /speak, POST /stop, GET /health, GET /status, GET /voices - mcp_adapter.py: MCP-Adapter (stdio/streamable-http) über tts_service; Tools: speak, stop, get_status, list_voices - requirements.txt: fastapi, uvicorn, httpx, mcp ergänzt - CLAUDE.md: Architektur und Startbefehle dokumentiert - .gitignore: Ideen/-Verzeichnis ausgeschlossen Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
parent
bed29fb1c8
commit
bcf6374c29
6 changed files with 563 additions and 3 deletions
4
.gitignore
vendored
4
.gitignore
vendored
|
|
@ -25,3 +25,7 @@ env/
|
||||||
|
|
||||||
# Claude Code
|
# Claude Code
|
||||||
.claude/
|
.claude/
|
||||||
|
|
||||||
|
# Ideen
|
||||||
|
Ideen/
|
||||||
|
|
||||||
|
|
|
||||||
86
CLAUDE.md
Normal file
86
CLAUDE.md
Normal file
|
|
@ -0,0 +1,86 @@
|
||||||
|
# CLAUDE.md
|
||||||
|
|
||||||
|
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
|
||||||
|
|
||||||
|
## Running the CLI
|
||||||
|
|
||||||
|
```bash
|
||||||
|
conda activate chatterbox
|
||||||
|
|
||||||
|
# Deutschen Text aus Datei vorlesen
|
||||||
|
python chatterbox_cli_v4.py --lang de --input text.txt
|
||||||
|
|
||||||
|
# Mit Voice Cloning
|
||||||
|
python chatterbox_cli_v4.py --lang de --voice my_voice.wav --input text.txt
|
||||||
|
|
||||||
|
# Text direkt übergeben (Englisch)
|
||||||
|
python chatterbox_cli_v4.py --lang en --text "Hello world"
|
||||||
|
|
||||||
|
# Nur speichern, kein Playback
|
||||||
|
python chatterbox_cli_v4.py --lang de --no-play --output ausgabe.wav --input text.txt
|
||||||
|
|
||||||
|
# Geschwindigkeit anpassen (pitch-erhaltend, erfordert rubberband-cli)
|
||||||
|
python chatterbox_cli_v4.py --lang de --speed 0.85 --input text.txt
|
||||||
|
|
||||||
|
# Streaming-Modus (experimentell, niedrigere Latenz, kann abgehackt klingen)
|
||||||
|
python chatterbox_cli_v4.py --lang de --stream --input text.txt
|
||||||
|
|
||||||
|
# Aussprache-Wörterbuch (JSON: {"Eigenname": "Lautschrift"})
|
||||||
|
python chatterbox_cli_v4.py --lang de --pronunciation-dict aussprache.json --input text.txt
|
||||||
|
```
|
||||||
|
|
||||||
|
No build step, no test suite, no linter configuration — this is a single-file script.
|
||||||
|
|
||||||
|
## Architecture
|
||||||
|
|
||||||
|
Everything lives in `chatterbox_cli_v4.py`. The processing pipeline is:
|
||||||
|
|
||||||
|
**Text input → normalization → chunking → TTS generation → audio output**
|
||||||
|
|
||||||
|
### Text normalization (`preprocess_tts_text`)
|
||||||
|
Applied per chunk before synthesis. Order matters:
|
||||||
|
1. Pronunciation dict substitutions (before acronym expansion, so proper names are caught first)
|
||||||
|
2. Unit normalization (120 km/h → "120 Kilometer pro Stunde")
|
||||||
|
3. Time normalization (14:58 → "vierzehn Uhr achtundfünfzig")
|
||||||
|
4. Year normalization (2026 → "zweitausendsechsundzwanzig")
|
||||||
|
5. Acronym spelling (ARD → "Ah Er De"; skips entries in `NON_SPELLED_ACRONYMS`)
|
||||||
|
|
||||||
|
`DEFAULT_PRONUNCIATION_DE` contains built-in German phonetic approximations (e.g. Xi → "Schi").
|
||||||
|
|
||||||
|
### Text chunking
|
||||||
|
Three modes (chosen by CLI flags):
|
||||||
|
- **sentence_mode** (default): `split_into_sentences()` — one sentence per TTS call, lowest latency to first audio
|
||||||
|
- **conversation_mode**: `split_for_conversation()` — first chunk is small (`--first-chunk-len`, default 80 chars), rest up to `--len` (400)
|
||||||
|
- **plain**: `split_long_text()` — paragraph-aware chunking up to `--len`
|
||||||
|
|
||||||
|
`SENTENCE_END_RE` handles edge cases like ordinal numbers, ellipses, and CJK punctuation. `SEPARATOR_LINE_RE` silently drops lines like `--- Ende ---`.
|
||||||
|
|
||||||
|
### Model loading (`load_model`)
|
||||||
|
- `--lang en` → `ChatterboxTTS` (mono, always available)
|
||||||
|
- Other languages → `ChatterboxMultilingualTTS` (requires multilingual package; `HAS_MULTILINGUAL` flag guards import)
|
||||||
|
- `--t3-model v3` (default) or `v2` selects the multilingual T3 checkpoint
|
||||||
|
- Models are downloaded to `~/.cache/huggingface/` on first use (~2–3 GB)
|
||||||
|
- **Critical**: `attn_implementation = "eager"` is forced at import time because SDPA returns `None` attention weights, breaking the `AlignmentStreamAnalyzer` hook
|
||||||
|
|
||||||
|
### Audio output (`PlaybackWorker`)
|
||||||
|
- Uses `sounddevice.OutputStream` with a callback at 48 kHz (PipeWire/PulseAudio standard)
|
||||||
|
- Internal producer thread converts Torch tensors → `CALLBACK_BLOCK`-sized (2048 samples) numpy arrays
|
||||||
|
- If `--speed != 1.0`: pyrubberband R3-Engine (`--fine` flag) stretches time without pitch change before resampling
|
||||||
|
- Resampling: `torchaudio.functional.resample(chunk, model_sr, 48000)`
|
||||||
|
- `PlaybackWorker.stop()` sends `None` sentinel into the queue and joins the thread
|
||||||
|
|
||||||
|
### Two synthesis paths
|
||||||
|
- **`synthesize_non_streaming`**: generates each chunk fully, feeds finished tensors to `PlaybackWorker`, concatenates all wavs for `--save`
|
||||||
|
- **`synthesize_streaming`**: calls `model.generate_stream()` with `chunk_size`; each yielded audio sub-chunk goes directly to `PlaybackWorker`; marked experimental in docs
|
||||||
|
|
||||||
|
## Planned extensions (Ideen/)
|
||||||
|
|
||||||
|
The `Ideen/` folder documents a planned **REST/MCP bridge**:
|
||||||
|
- `tts_service.py` (FastAPI): `POST /speak`, `POST /stop`, `GET /health`, `GET /voices`
|
||||||
|
- `mcp_adapter.py`: thin MCP wrapper calling the REST API
|
||||||
|
- `chatterbox_backend.py`: imports `chatterbox_cli_v4.py` via `importlib` and calls `synthesize_non_streaming()` directly
|
||||||
|
|
||||||
|
Key gaps to address before building the service:
|
||||||
|
1. **Stop/interrupt**: `PlaybackWorker.stop()` drains the audio queue, but a blocking `model.generate()` call cannot be interrupted mid-run. A `threading.Event`-based cancel token threaded through `synthesize_non_streaming` is the planned approach.
|
||||||
|
2. **Model caching**: `load_model()` reloads from disk on every call; a service needs a per-language singleton.
|
||||||
|
3. **Status object**: progress is `print()`-based; a service needs structured state.
|
||||||
|
|
@ -10,6 +10,20 @@ import time
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
from typing import List, Optional, Tuple
|
from typing import List, Optional, Tuple
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Kooperativer Stop-Mechanismus
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
STOP_REQUESTED = threading.Event()
|
||||||
|
|
||||||
|
def request_stop() -> None:
|
||||||
|
STOP_REQUESTED.set()
|
||||||
|
|
||||||
|
def clear_stop() -> None:
|
||||||
|
STOP_REQUESTED.clear()
|
||||||
|
|
||||||
|
def stop_requested() -> bool:
|
||||||
|
return STOP_REQUESTED.is_set()
|
||||||
|
|
||||||
import torch
|
import torch
|
||||||
import torchaudio as ta
|
import torchaudio as ta
|
||||||
|
|
||||||
|
|
@ -556,10 +570,12 @@ class PlaybackWorker:
|
||||||
PLAYBACK_RATE = 48000 # PipeWire/PulseAudio standard
|
PLAYBACK_RATE = 48000 # PipeWire/PulseAudio standard
|
||||||
CALLBACK_BLOCK = 2048 # ~43 ms pro Callback-Block bei 48 kHz
|
CALLBACK_BLOCK = 2048 # ~43 ms pro Callback-Block bei 48 kHz
|
||||||
|
|
||||||
def __init__(self, sample_rate: int, device: Optional[str] = "pulse", speed: float = 1.0):
|
def __init__(self, sample_rate: int, device: Optional[str] = "pulse", speed: float = 1.0,
|
||||||
|
stop_event: Optional[threading.Event] = None):
|
||||||
self.sample_rate = sample_rate
|
self.sample_rate = sample_rate
|
||||||
self.device = device
|
self.device = device
|
||||||
self.speed = speed
|
self.speed = speed
|
||||||
|
self.stop_event = stop_event
|
||||||
# Eingang: Torch-Tensoren vom TTS-Modell
|
# Eingang: Torch-Tensoren vom TTS-Modell
|
||||||
self.audio_queue: "queue.Queue[Optional[torch.Tensor]]" = queue.Queue()
|
self.audio_queue: "queue.Queue[Optional[torch.Tensor]]" = queue.Queue()
|
||||||
# Intern: fertig vorbereitete numpy-Blöcke für den Callback
|
# Intern: fertig vorbereitete numpy-Blöcke für den Callback
|
||||||
|
|
@ -579,6 +595,9 @@ class PlaybackWorker:
|
||||||
|
|
||||||
def _callback(self, outdata, frames, time_info, status):
|
def _callback(self, outdata, frames, time_info, status):
|
||||||
# Läuft im Audio-Thread: so schnell wie möglich, kein Lock nötig.
|
# Läuft im Audio-Thread: so schnell wie möglich, kein Lock nötig.
|
||||||
|
if self.stop_event and self.stop_event.is_set():
|
||||||
|
outdata[:] = 0.0
|
||||||
|
return
|
||||||
try:
|
try:
|
||||||
data = self._block_queue.get_nowait()
|
data = self._block_queue.get_nowait()
|
||||||
outdata[:, 0] = data
|
outdata[:, 0] = data
|
||||||
|
|
@ -593,6 +612,8 @@ class PlaybackWorker:
|
||||||
remainder = np.zeros(0, dtype="float32")
|
remainder = np.zeros(0, dtype="float32")
|
||||||
|
|
||||||
while True:
|
while True:
|
||||||
|
if self.stop_event and self.stop_event.is_set():
|
||||||
|
break
|
||||||
item = self.audio_queue.get()
|
item = self.audio_queue.get()
|
||||||
if item is None:
|
if item is None:
|
||||||
break
|
break
|
||||||
|
|
@ -690,6 +711,7 @@ def synthesize_non_streaming(
|
||||||
debug_delay: float = 0.0,
|
debug_delay: float = 0.0,
|
||||||
t3_model: Optional[str] = None,
|
t3_model: Optional[str] = None,
|
||||||
pronunciation_dict: Optional[dict] = None,
|
pronunciation_dict: Optional[dict] = None,
|
||||||
|
stop_event: Optional[threading.Event] = None,
|
||||||
) -> Optional[Path]:
|
) -> Optional[Path]:
|
||||||
if lang not in SUPPORTED_LANGS:
|
if lang not in SUPPORTED_LANGS:
|
||||||
raise ValueError(
|
raise ValueError(
|
||||||
|
|
@ -739,7 +761,8 @@ def synthesize_non_streaming(
|
||||||
print(f"Ausgabe: {output_path}")
|
print(f"Ausgabe: {output_path}")
|
||||||
|
|
||||||
if play_audio:
|
if play_audio:
|
||||||
playback = PlaybackWorker(sample_rate=sr, device=audio_device, speed=speed)
|
playback = PlaybackWorker(sample_rate=sr, device=audio_device, speed=speed,
|
||||||
|
stop_event=stop_event)
|
||||||
playback.start()
|
playback.start()
|
||||||
else:
|
else:
|
||||||
playback = None
|
playback = None
|
||||||
|
|
@ -747,6 +770,10 @@ def synthesize_non_streaming(
|
||||||
wavs = []
|
wavs = []
|
||||||
try:
|
try:
|
||||||
for i, chunk in enumerate(chunks, start=1):
|
for i, chunk in enumerate(chunks, start=1):
|
||||||
|
if stop_event and stop_event.is_set():
|
||||||
|
if show_progress:
|
||||||
|
print("Abbruch angefordert – Synthese gestoppt.")
|
||||||
|
break
|
||||||
if debug_delay > 0:
|
if debug_delay > 0:
|
||||||
if show_progress:
|
if show_progress:
|
||||||
print(f"[{i}/{len(chunks)}] Warte {debug_delay:.0f}s (debug_delay) ...")
|
print(f"[{i}/{len(chunks)}] Warte {debug_delay:.0f}s (debug_delay) ...")
|
||||||
|
|
@ -793,6 +820,7 @@ def synthesize_streaming(
|
||||||
save_wav: bool = True,
|
save_wav: bool = True,
|
||||||
stream_chunk_size: int = 25,
|
stream_chunk_size: int = 25,
|
||||||
audio_device: Optional[str] = None,
|
audio_device: Optional[str] = None,
|
||||||
|
stop_event: Optional[threading.Event] = None,
|
||||||
) -> Optional[Path]:
|
) -> Optional[Path]:
|
||||||
if lang not in SUPPORTED_LANGS:
|
if lang not in SUPPORTED_LANGS:
|
||||||
raise ValueError(
|
raise ValueError(
|
||||||
|
|
@ -829,7 +857,7 @@ def synthesize_streaming(
|
||||||
raise ValueError("Kein verwertbarer Text nach dem Einlesen gefunden.")
|
raise ValueError("Kein verwertbarer Text nach dem Einlesen gefunden.")
|
||||||
|
|
||||||
if play_audio:
|
if play_audio:
|
||||||
playback = PlaybackWorker(sample_rate=sr, device=audio_device)
|
playback = PlaybackWorker(sample_rate=sr, device=audio_device, stop_event=stop_event)
|
||||||
playback.start()
|
playback.start()
|
||||||
else:
|
else:
|
||||||
playback = None
|
playback = None
|
||||||
|
|
@ -853,6 +881,10 @@ def synthesize_streaming(
|
||||||
|
|
||||||
try:
|
try:
|
||||||
for text_idx, text_chunk in enumerate(text_chunks, start=1):
|
for text_idx, text_chunk in enumerate(text_chunks, start=1):
|
||||||
|
if stop_event and stop_event.is_set():
|
||||||
|
if show_progress:
|
||||||
|
print("Abbruch angefordert – Streaming gestoppt.")
|
||||||
|
break
|
||||||
if show_progress:
|
if show_progress:
|
||||||
print(f"[Text {text_idx}/{len(text_chunks)}] Starte Streaming für {len(text_chunk)} Zeichen ...")
|
print(f"[Text {text_idx}/{len(text_chunks)}] Starte Streaming für {len(text_chunk)} Zeichen ...")
|
||||||
|
|
||||||
|
|
@ -866,6 +898,8 @@ def synthesize_streaming(
|
||||||
)
|
)
|
||||||
|
|
||||||
for audio_idx, item in enumerate(stream_iter, start=1):
|
for audio_idx, item in enumerate(stream_iter, start=1):
|
||||||
|
if stop_event and stop_event.is_set():
|
||||||
|
break
|
||||||
if isinstance(item, tuple) and len(item) == 2:
|
if isinstance(item, tuple) and len(item) == 2:
|
||||||
audio_chunk, metrics = item
|
audio_chunk, metrics = item
|
||||||
else:
|
else:
|
||||||
|
|
@ -944,6 +978,7 @@ def build_argparser() -> argparse.ArgumentParser:
|
||||||
p.add_argument("--debug-delay", type=float, default=0.0, help="Sekunden Pause vor jedem Satz (simuliert langsame KI). Nur zum Testen.")
|
p.add_argument("--debug-delay", type=float, default=0.0, help="Sekunden Pause vor jedem Satz (simuliert langsame KI). Nur zum Testen.")
|
||||||
p.add_argument("--t3-model", type=str, default="v3", help="Multilingual T3-Modell: 'v3' (default), 'v2' oder Dateiname.")
|
p.add_argument("--t3-model", type=str, default="v3", help="Multilingual T3-Modell: 'v3' (default), 'v2' oder Dateiname.")
|
||||||
p.add_argument("--no-conversation-mode", action="store_true", help="Ersten Chunk nicht künstlich kleiner machen (nur ohne --no-sentence-mode).")
|
p.add_argument("--no-conversation-mode", action="store_true", help="Ersten Chunk nicht künstlich kleiner machen (nur ohne --no-sentence-mode).")
|
||||||
|
p.add_argument("--stop", action="store_true", help="Globales Stop-Signal setzen (für Tests und Service-Integration).")
|
||||||
return p
|
return p
|
||||||
|
|
||||||
|
|
||||||
|
|
@ -951,6 +986,11 @@ def main() -> int:
|
||||||
parser = build_argparser()
|
parser = build_argparser()
|
||||||
args = parser.parse_args()
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
if args.stop:
|
||||||
|
request_stop()
|
||||||
|
print("Stop-Signal gesetzt.")
|
||||||
|
return 0
|
||||||
|
|
||||||
try:
|
try:
|
||||||
text = read_input_text(args.text, args.input)
|
text = read_input_text(args.text, args.input)
|
||||||
device = get_device(args.device)
|
device = get_device(args.device)
|
||||||
|
|
@ -970,6 +1010,8 @@ def main() -> int:
|
||||||
raise FileNotFoundError(f"Aussprache-Dict nicht gefunden: {pron_path}")
|
raise FileNotFoundError(f"Aussprache-Dict nicht gefunden: {pron_path}")
|
||||||
pronunciation_dict = json.loads(pron_path.read_text(encoding="utf-8"))
|
pronunciation_dict = json.loads(pron_path.read_text(encoding="utf-8"))
|
||||||
|
|
||||||
|
clear_stop()
|
||||||
|
|
||||||
if args.stream:
|
if args.stream:
|
||||||
out = synthesize_streaming(
|
out = synthesize_streaming(
|
||||||
text=text,
|
text=text,
|
||||||
|
|
@ -990,6 +1032,7 @@ def main() -> int:
|
||||||
save_wav=save_wav,
|
save_wav=save_wav,
|
||||||
stream_chunk_size=args.stream_chunk_size,
|
stream_chunk_size=args.stream_chunk_size,
|
||||||
audio_device=args.audio_device,
|
audio_device=args.audio_device,
|
||||||
|
stop_event=STOP_REQUESTED,
|
||||||
)
|
)
|
||||||
else:
|
else:
|
||||||
out = synthesize_non_streaming(
|
out = synthesize_non_streaming(
|
||||||
|
|
@ -1015,6 +1058,7 @@ def main() -> int:
|
||||||
debug_delay=args.debug_delay,
|
debug_delay=args.debug_delay,
|
||||||
t3_model=args.t3_model,
|
t3_model=args.t3_model,
|
||||||
pronunciation_dict=pronunciation_dict,
|
pronunciation_dict=pronunciation_dict,
|
||||||
|
stop_event=STOP_REQUESTED,
|
||||||
)
|
)
|
||||||
|
|
||||||
if out is not None:
|
if out is not None:
|
||||||
|
|
|
||||||
134
mcp_adapter.py
Normal file
134
mcp_adapter.py
Normal file
|
|
@ -0,0 +1,134 @@
|
||||||
|
#!/usr/bin/env python3
|
||||||
|
"""
|
||||||
|
Chatterbox TTS – MCP-Adapter
|
||||||
|
|
||||||
|
Setzt einen laufenden tts_service.py voraus (Standard: http://127.0.0.1:8000).
|
||||||
|
|
||||||
|
Start (streamable-http, Port 8001 – für beliebige MCP-Clients):
|
||||||
|
python mcp_adapter.py
|
||||||
|
|
||||||
|
Start (stdio – für Claude Code / Claude Desktop):
|
||||||
|
python mcp_adapter.py --stdio
|
||||||
|
|
||||||
|
Claude Code Konfiguration (.claude/settings.json):
|
||||||
|
{
|
||||||
|
"mcpServers": {
|
||||||
|
"chatterbox-tts": {
|
||||||
|
"command": "python",
|
||||||
|
"args": ["/home/dschlueter/chatterbox-tts-cli/mcp_adapter.py", "--stdio"]
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
Umgebungsvariable TTS_URL überschreibt die Service-Adresse:
|
||||||
|
TTS_URL=http://192.168.1.10:8000 python mcp_adapter.py --stdio
|
||||||
|
"""
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
import os
|
||||||
|
|
||||||
|
import httpx
|
||||||
|
from mcp.server.fastmcp import FastMCP
|
||||||
|
|
||||||
|
TTS_URL = os.environ.get("TTS_URL", "http://127.0.0.1:8000").rstrip("/")
|
||||||
|
|
||||||
|
mcp = FastMCP(
|
||||||
|
"Chatterbox TTS",
|
||||||
|
instructions=(
|
||||||
|
"Lokaler Text-to-Speech-Service. Liest Texte auf Deutsch und 20+ weiteren "
|
||||||
|
"Sprachen vor. Unterstützt Voice Cloning, Geschwindigkeitsanpassung und "
|
||||||
|
"Aussprache-Wörterbücher."
|
||||||
|
),
|
||||||
|
port=8001,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Tools
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
@mcp.tool()
|
||||||
|
async def speak(
|
||||||
|
text: str,
|
||||||
|
lang: str = "de",
|
||||||
|
voice: str | None = None,
|
||||||
|
interrupt: bool = False,
|
||||||
|
speed: float = 1.0,
|
||||||
|
) -> dict:
|
||||||
|
"""Text als Sprache ausgeben.
|
||||||
|
|
||||||
|
Reiht den Text in die Ausgabewarteschlange ein. Das Modell generiert
|
||||||
|
satzweise und beginnt sofort mit der Wiedergabe.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
text: Auszugebender Text (max. 4000 Zeichen).
|
||||||
|
lang: Sprachcode, z. B. 'de', 'en', 'fr'. Standard: 'de'.
|
||||||
|
voice: Optionaler Pfad zu einer WAV-Referenzdatei (10–30s) für
|
||||||
|
Voice Cloning.
|
||||||
|
interrupt: True = laufende Ausgabe sofort unterbrechen und diesen
|
||||||
|
Text vorgezogen abspielen.
|
||||||
|
speed: Wiedergabegeschwindigkeit (0.5–2.0). Pitch bleibt gleich.
|
||||||
|
"""
|
||||||
|
async with httpx.AsyncClient(timeout=15) as client:
|
||||||
|
r = await client.post(f"{TTS_URL}/speak", json={
|
||||||
|
"text": text,
|
||||||
|
"lang": lang,
|
||||||
|
"voice": voice,
|
||||||
|
"interrupt": interrupt,
|
||||||
|
"speed": speed,
|
||||||
|
})
|
||||||
|
r.raise_for_status()
|
||||||
|
return r.json()
|
||||||
|
|
||||||
|
|
||||||
|
@mcp.tool()
|
||||||
|
async def stop() -> dict:
|
||||||
|
"""Laufende Sprachausgabe sofort stoppen und Warteschlange leeren."""
|
||||||
|
async with httpx.AsyncClient(timeout=5) as client:
|
||||||
|
r = await client.post(f"{TTS_URL}/stop")
|
||||||
|
r.raise_for_status()
|
||||||
|
return r.json()
|
||||||
|
|
||||||
|
|
||||||
|
@mcp.tool()
|
||||||
|
async def get_status() -> dict:
|
||||||
|
"""Aktuellen Ausgabe-Status abfragen.
|
||||||
|
|
||||||
|
Gibt zurück: laufender Job (mit Chunk-Fortschritt), Queue-Länge und
|
||||||
|
die letzten abgeschlossenen Jobs.
|
||||||
|
"""
|
||||||
|
async with httpx.AsyncClient(timeout=5) as client:
|
||||||
|
r = await client.get(f"{TTS_URL}/status")
|
||||||
|
r.raise_for_status()
|
||||||
|
return r.json()
|
||||||
|
|
||||||
|
|
||||||
|
@mcp.tool()
|
||||||
|
async def list_voices() -> dict:
|
||||||
|
"""Unterstützte Sprachen und Hinweise zu Voice Cloning abfragen."""
|
||||||
|
async with httpx.AsyncClient(timeout=5) as client:
|
||||||
|
r = await client.get(f"{TTS_URL}/voices")
|
||||||
|
r.raise_for_status()
|
||||||
|
return r.json()
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Einstiegspunkt
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
if __name__ == "__main__":
|
||||||
|
parser = argparse.ArgumentParser(description="Chatterbox TTS MCP-Adapter")
|
||||||
|
parser.add_argument(
|
||||||
|
"--stdio", action="store_true",
|
||||||
|
help="stdio-Transport (für Claude Code / Claude Desktop)",
|
||||||
|
)
|
||||||
|
parser.add_argument("--host", default="127.0.0.1",
|
||||||
|
help="Host für streamable-http (Standard: 127.0.0.1)")
|
||||||
|
parser.add_argument("--port", type=int, default=8001,
|
||||||
|
help="Port für streamable-http (Standard: 8001)")
|
||||||
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
if args.stdio:
|
||||||
|
mcp.run() # stdio ist der Default-Transport
|
||||||
|
else:
|
||||||
|
mcp.run(transport="streamable-http", host=args.host, port=args.port)
|
||||||
|
|
@ -15,3 +15,13 @@ sounddevice>=0.4.0
|
||||||
pyrubberband>=0.4.0
|
pyrubberband>=0.4.0
|
||||||
# rubberband-cli muss zusätzlich als Systempakete installiert sein:
|
# rubberband-cli muss zusätzlich als Systempakete installiert sein:
|
||||||
# sudo apt install rubberband-cli
|
# sudo apt install rubberband-cli
|
||||||
|
|
||||||
|
# HTTP-Service (Phase 2)
|
||||||
|
fastapi>=0.115.0
|
||||||
|
uvicorn[standard]>=0.32.0
|
||||||
|
|
||||||
|
# HTTP-Client für MCP-Adapter (Phase 3)
|
||||||
|
httpx>=0.28.0
|
||||||
|
|
||||||
|
# MCP-Adapter (Phase 3)
|
||||||
|
mcp>=1.0.0
|
||||||
|
|
|
||||||
282
tts_service.py
Normal file
282
tts_service.py
Normal file
|
|
@ -0,0 +1,282 @@
|
||||||
|
#!/usr/bin/env python3
|
||||||
|
"""
|
||||||
|
Chatterbox TTS – lokaler HTTP-Service
|
||||||
|
|
||||||
|
Start:
|
||||||
|
uvicorn tts_service:app --host 127.0.0.1 --port 8000
|
||||||
|
|
||||||
|
Endpunkte:
|
||||||
|
POST /speak – Text in Warteschlange einreihen
|
||||||
|
POST /stop – laufende Ausgabe abbrechen, Queue leeren
|
||||||
|
GET /health – Service-Status
|
||||||
|
GET /status – aktueller Job + Queue-Länge
|
||||||
|
GET /voices – unterstützte Sprachen
|
||||||
|
"""
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import queue
|
||||||
|
import sys
|
||||||
|
import threading
|
||||||
|
import uuid
|
||||||
|
from dataclasses import dataclass, field
|
||||||
|
from enum import Enum
|
||||||
|
from pathlib import Path
|
||||||
|
from typing import Optional
|
||||||
|
|
||||||
|
# CLI-Modul aus demselben Verzeichnis laden
|
||||||
|
sys.path.insert(0, str(Path(__file__).parent))
|
||||||
|
import chatterbox_cli_v4 as tts # noqa: E402
|
||||||
|
|
||||||
|
import torch
|
||||||
|
import torchaudio as ta
|
||||||
|
from fastapi import FastAPI, HTTPException
|
||||||
|
from pydantic import BaseModel, Field
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Gerät einmalig bestimmen
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
_DEVICE = tts.get_device(None)
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Modell-Cache (lang, t3_model) → (model, model_kind, sr)
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
_model_cache: dict[tuple, tuple] = {}
|
||||||
|
_model_lock = threading.Lock()
|
||||||
|
|
||||||
|
|
||||||
|
def _get_or_load_model(lang: str, t3_model: str) -> tuple:
|
||||||
|
key = (lang, t3_model)
|
||||||
|
with _model_lock:
|
||||||
|
if key not in _model_cache:
|
||||||
|
_model_cache[key] = tts.load_model(lang, _DEVICE, t3_model=t3_model)
|
||||||
|
return _model_cache[key]
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Job-Datenmodell
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
class JobStatus(str, Enum):
|
||||||
|
pending = "pending"
|
||||||
|
running = "running"
|
||||||
|
done = "done"
|
||||||
|
cancelled = "cancelled"
|
||||||
|
error = "error"
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class SpeakJob:
|
||||||
|
id: str
|
||||||
|
text: str
|
||||||
|
lang: str
|
||||||
|
t3_model: str
|
||||||
|
voice: Optional[str]
|
||||||
|
speed: float
|
||||||
|
audio_device: str
|
||||||
|
max_len: int
|
||||||
|
save_wav: bool
|
||||||
|
output_path: Optional[str]
|
||||||
|
pronunciation_dict: Optional[dict]
|
||||||
|
session_id: Optional[str]
|
||||||
|
status: JobStatus = field(default=JobStatus.pending)
|
||||||
|
text_preview: str = field(default="")
|
||||||
|
chunks_total: int = 0
|
||||||
|
chunks_done: int = 0
|
||||||
|
error: Optional[str] = None
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Worker-Thread
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
_job_queue: queue.Queue[SpeakJob] = queue.Queue()
|
||||||
|
_current_job: Optional[SpeakJob] = None
|
||||||
|
_state_lock = threading.Lock()
|
||||||
|
_recent_jobs: list[SpeakJob] = []
|
||||||
|
_MAX_RECENT = 20
|
||||||
|
|
||||||
|
|
||||||
|
def _worker() -> None:
|
||||||
|
global _current_job
|
||||||
|
|
||||||
|
while True:
|
||||||
|
job = _job_queue.get()
|
||||||
|
|
||||||
|
with _state_lock:
|
||||||
|
_current_job = job
|
||||||
|
job.status = JobStatus.running
|
||||||
|
|
||||||
|
tts.clear_stop()
|
||||||
|
|
||||||
|
try:
|
||||||
|
model, model_kind, sr = _get_or_load_model(job.lang, job.t3_model)
|
||||||
|
|
||||||
|
raw = tts.clean_raw_text(job.text)
|
||||||
|
raw_chunks = tts.split_into_sentences(raw, max_len=job.max_len)
|
||||||
|
chunks = [
|
||||||
|
tts.preprocess_tts_text(c, lang=job.lang,
|
||||||
|
pronunciation_dict=job.pronunciation_dict)
|
||||||
|
for c in raw_chunks
|
||||||
|
]
|
||||||
|
chunks = [c for c in chunks if c.strip()]
|
||||||
|
|
||||||
|
job.chunks_total = len(chunks)
|
||||||
|
job.text_preview = job.text[:80]
|
||||||
|
|
||||||
|
playback = tts.PlaybackWorker(
|
||||||
|
sample_rate=sr,
|
||||||
|
device=job.audio_device,
|
||||||
|
speed=job.speed,
|
||||||
|
stop_event=tts.STOP_REQUESTED,
|
||||||
|
)
|
||||||
|
playback.start()
|
||||||
|
|
||||||
|
wavs: list[torch.Tensor] = []
|
||||||
|
try:
|
||||||
|
for chunk in chunks:
|
||||||
|
if tts.stop_requested():
|
||||||
|
break
|
||||||
|
wav = tts.generate_chunk(model, model_kind, chunk, job.lang, job.voice)
|
||||||
|
wavs.append(wav)
|
||||||
|
playback.put(wav)
|
||||||
|
job.chunks_done += 1
|
||||||
|
finally:
|
||||||
|
playback.stop()
|
||||||
|
|
||||||
|
if job.save_wav and job.output_path and wavs:
|
||||||
|
out = Path(job.output_path)
|
||||||
|
out.parent.mkdir(parents=True, exist_ok=True)
|
||||||
|
final = wavs[0] if len(wavs) == 1 else torch.cat(wavs, dim=-1)
|
||||||
|
ta.save(str(out), final, sr)
|
||||||
|
|
||||||
|
job.status = (
|
||||||
|
JobStatus.cancelled if tts.stop_requested() else JobStatus.done
|
||||||
|
)
|
||||||
|
|
||||||
|
except Exception as exc: # noqa: BLE001
|
||||||
|
job.status = JobStatus.error
|
||||||
|
job.error = str(exc)
|
||||||
|
|
||||||
|
finally:
|
||||||
|
with _state_lock:
|
||||||
|
_current_job = None
|
||||||
|
_recent_jobs.append(job)
|
||||||
|
if len(_recent_jobs) > _MAX_RECENT:
|
||||||
|
_recent_jobs.pop(0)
|
||||||
|
_job_queue.task_done()
|
||||||
|
|
||||||
|
|
||||||
|
_worker_thread = threading.Thread(target=_worker, daemon=True, name="tts-worker")
|
||||||
|
_worker_thread.start()
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# API-Modelle
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
class SpeakRequest(BaseModel):
|
||||||
|
text: str = Field(min_length=1, max_length=4000)
|
||||||
|
lang: str = "de"
|
||||||
|
voice: Optional[str] = None
|
||||||
|
interrupt: bool = False
|
||||||
|
speed: float = Field(default=1.0, ge=0.5, le=2.0)
|
||||||
|
t3_model: str = "v3"
|
||||||
|
audio_device: str = "pulse"
|
||||||
|
max_len: int = Field(default=400, ge=50, le=1000)
|
||||||
|
save_wav: bool = False
|
||||||
|
output_path: Optional[str] = None
|
||||||
|
session_id: Optional[str] = None
|
||||||
|
pronunciation_dict: Optional[dict] = None
|
||||||
|
|
||||||
|
|
||||||
|
def _job_to_dict(j: SpeakJob) -> dict:
|
||||||
|
return {
|
||||||
|
"id": j.id,
|
||||||
|
"status": j.status,
|
||||||
|
"lang": j.lang,
|
||||||
|
"text_preview": j.text_preview,
|
||||||
|
"chunks_total": j.chunks_total,
|
||||||
|
"chunks_done": j.chunks_done,
|
||||||
|
"error": j.error,
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def _drain_queue() -> None:
|
||||||
|
while not _job_queue.empty():
|
||||||
|
try:
|
||||||
|
_job_queue.get_nowait()
|
||||||
|
_job_queue.task_done()
|
||||||
|
except queue.Empty:
|
||||||
|
break
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# FastAPI-App
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
app = FastAPI(title="Chatterbox TTS Service", version="1.0")
|
||||||
|
|
||||||
|
|
||||||
|
@app.get("/health")
|
||||||
|
def health():
|
||||||
|
return {"status": "ok", "device": _DEVICE}
|
||||||
|
|
||||||
|
|
||||||
|
@app.get("/voices")
|
||||||
|
def voices():
|
||||||
|
return {
|
||||||
|
"languages": sorted(tts.SUPPORTED_LANGS),
|
||||||
|
"note": "Voice cloning via 'voice' field (WAV-Pfad, 10–30s Aufnahme)",
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
@app.post("/speak")
|
||||||
|
def speak(req: SpeakRequest):
|
||||||
|
if req.lang not in tts.SUPPORTED_LANGS:
|
||||||
|
raise HTTPException(status_code=422,
|
||||||
|
detail=f"Sprache nicht unterstützt: {req.lang}")
|
||||||
|
if req.voice and not Path(req.voice).exists():
|
||||||
|
raise HTTPException(status_code=422,
|
||||||
|
detail=f"Voice-Datei nicht gefunden: {req.voice}")
|
||||||
|
|
||||||
|
if req.interrupt:
|
||||||
|
tts.request_stop()
|
||||||
|
_drain_queue()
|
||||||
|
|
||||||
|
job = SpeakJob(
|
||||||
|
id=str(uuid.uuid4()),
|
||||||
|
text=req.text,
|
||||||
|
lang=req.lang,
|
||||||
|
t3_model=req.t3_model,
|
||||||
|
voice=req.voice,
|
||||||
|
speed=req.speed,
|
||||||
|
audio_device=req.audio_device,
|
||||||
|
max_len=req.max_len,
|
||||||
|
save_wav=req.save_wav,
|
||||||
|
output_path=req.output_path,
|
||||||
|
pronunciation_dict=req.pronunciation_dict,
|
||||||
|
session_id=req.session_id,
|
||||||
|
)
|
||||||
|
_job_queue.put(job)
|
||||||
|
|
||||||
|
return {
|
||||||
|
"job_id": job.id,
|
||||||
|
"status": job.status,
|
||||||
|
"queue_position": _job_queue.qsize(),
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
@app.post("/stop")
|
||||||
|
def stop():
|
||||||
|
tts.request_stop()
|
||||||
|
_drain_queue()
|
||||||
|
return {"stopped": True}
|
||||||
|
|
||||||
|
|
||||||
|
@app.get("/status")
|
||||||
|
def status():
|
||||||
|
with _state_lock:
|
||||||
|
cur = _current_job
|
||||||
|
recent = list(_recent_jobs)
|
||||||
|
|
||||||
|
return {
|
||||||
|
"current_job": _job_to_dict(cur) if cur else None,
|
||||||
|
"queue_length": _job_queue.qsize(),
|
||||||
|
"recent_jobs": [_job_to_dict(j) for j in reversed(recent)],
|
||||||
|
}
|
||||||
Loading…
Add table
Add a link
Reference in a new issue