kotoba audio gateway — Speech-to-Text (Chirp 2) + Text-to-Speech (Neural2), 日本語

① Speech-to-Text idle

streams 16 kHz mono PCM → /v1/stt/stream

② Text-to-Speech idle

POST /v1/tts · cache: (repeat the same text → HIT, no Google call)

③ Pronunciation analysis idle

records a short word → POST raw 16 kHz PCM to /v1/pronunciation/analyze → pitch-accent + timing