kotoba audio gateway — Speech-to-Text (Chirp 2) + Text-to-Speech (Neural2), 日本語

Firebase ID token (or DEV_AUTH_TOKEN locally)

① Speech-to-Text idle

streams 16 kHz mono PCM → /v1/stt/stream

Text (日本語)

Voice (Neural2)

Encoding

POST /v1/tts · cache: — (repeat the same text → HIT, no Google call)

pitchRef (seeded word)

records a short word → POST raw 16 kHz PCM to /v1/pronunciation/analyze → pitch-accent + timing