Show HN: Voice AI Toys on ESP32 with Cloudflare Durable Objects

hackernews | | 📦 오픈소스
#ai 모델 #gpt-4 #openai
원문 출처: hackernews · Genesis Park에서 요약 및 분석

요약

ESP32 기반의 음성 AI 장치를 위해 Cloudflare Workers와 Durable Objects를 활용한 백엔드가 구축되었습니다. 이 시스템은 STT와 TTS에는 Cloudflare Workers AI를, LLM에는 OpenAI를 사용하여 WebSocket을 통해 실시간 오디오를 처리합니다. 각 세션은 독립적인 Durable Object로 관리되어 격리된 상태를 유지하며, 기존 Deno 서버의 대안으로 제공되지만 인증과 DB 구현은 아직 미완성 상태입니다.

본문

Cloudflare Workers + Durable Objects backend for Elato's ESP32 realtime voice flow. This server keeps the existing Elato device protocol and routes audio through Cloudflare-hosted services: - STT: Cloudflare Workers AI via @cloudflare/voice - LLM: OpenAI Chat Completions - TTS: Cloudflare Workers AI Deepgram Aura - Transport: WebSocket + Opus packetization for ESP32 If you are new to the overall project, start with the root README first: /Users/akashdeepdeb/Desktop/Projects/ElatoAI/README.md /Users/akashdeepdeb/Desktop/Projects/ElatoAI/server/README.md This backend is meant to be an alternative to the Deno edge server, not a separate firmware protocol. The ESP32 still talks to the same Elato-style control surface: auth AUDIO.COMMITTED RESPONSE.CREATED - binary audio frames RESPONSE.COMPLETE SESSION.END Public route: /ws/esp32 Health check: /healthz server/cloudflare/ ├── models/ │ ├── llm.ts │ ├── session.ts │ ├── stt.ts │ └── tts.ts ├── src/ │ ├── index.ts │ ├── opus.ts │ ├── prompt.ts │ └── types.ts ├── package.json └── wrangler.toml - The ESP32 opens a secure websocket to /ws/esp32 . - The Worker creates a fresh Durable Object session for that websocket. - The server sends the Elato auth payload. - The server triggers the first assistant turn. - LLM output is synthesized to audio. - Audio is packetized into Opus frames and streamed back to the ESP32. - After playback, the ESP32 goes back to listening. - Incoming mic audio is fed to the STT session for the next turn. You need: - Node.js 22+ - npm - a Cloudflare account with Workers enabled - a Workers AI binding - an OpenAI API key for the LLM path cd /Users/akashdeepdeb/Desktop/Projects/ElatoAI/server/cloudflare npm install Copy the example file: cp .dev.vars.example .dev.vars Then fill in the values you actually need. Typical local file: OPENAI_API_KEY=... ELATO_OPENAI_MODEL=gpt-4.1-mini ELATO_OPENAI_SYSTEM_PROMPT=You are a friendly toy character. ELATO_OPENAI_FIRST_MESSAGE=Say hello first in one short sentence. Notes: JWT_SECRET_KEY is not currently required for the stripped-down iteration unless you wire auth back in.- Do not commit real secrets. npm run dev This uses: wrangler dev --ip 0.0.0.0 --port 8787 So local access is typically: http://:8787/healthz ws://:8787/ws/esp32 For local firmware testing: - point the ESP32 at your machine's LAN IP, not 0.0.0.0 - local plain ws:// is fine for quick testing if your firmware build allows it - production firmware should use wss:// Set the runtime secrets in Cloudflare: OPENAI_API_KEY - optionally ELATO_OPENAI_MODEL - optionally ELATO_OPENAI_SYSTEM_PROMPT - optionally ELATO_OPENAI_FIRST_MESSAGE cd /Users/akashdeepdeb/Desktop/Projects/ElatoAI/server/cloudflare npm run deploy Example production route: wss://.workers.dev/ws/esp32 The current setup uses one fresh Durable Object per websocket voice session. That is the sensible default for realtime voice apps because: - each call/session gets isolated state - reconnects do not inherit stale memory - turn state is easier to reason about - cleanup is straightforward This is what the Worker does in /Users/akashdeepdeb/Desktop/Projects/ElatoAI/server/cloudflare/src/index.ts . This backend already has a Durable Object rename migration in /Users/akashdeepdeb/Desktop/Projects/ElatoAI/server/cloudflare/wrangler.toml : ElatoOpenAiVoiceAgent ->ElatoVoiceSession If you rename the DO again later, add another migration instead of just changing the class name. Typecheck: npm run typecheck Local dev: npm run dev Deploy: npm run deploy A few things matter in practice: - Rapid reconnect testing can trigger Workers AI rate limits, especially on TTS. - If you redeploy while a websocket session is active, Cloudflare may log: This script has been upgraded. Please send a new request to connect to the new version. That is expected during deploy churn. - If the ESP32 flips into speaking briefly and then falls back, check whether TTS actually produced audio or hit a 429 . - If STT does not advance turns, inspect the STT provider logs first before debugging firmware state. This Cloudflare backend is still a pragmatic project backend, not a polished platform product. Current caveats: - auth is still intentionally stubbed out with comments - DB writes are still placeholders - Workers AI rate limiting can affect repeated testing - the stack is still operationally rough compared with the more mature Deno path If you are modifying this backend, read these first: /Users/akashdeepdeb/Desktop/Projects/ElatoAI/server/cloudflare/src/index.ts /Users/akashdeepdeb/Desktop/Projects/ElatoAI/server/cloudflare/models/session.ts /Users/akashdeepdeb/Desktop/Projects/ElatoAI/server/cloudflare/models/stt.ts /Users/akashdeepdeb/Desktop/Projects/ElatoAI/server/cloudflare/models/llm.ts /Users/akashdeepdeb/Desktop/Projects/ElatoAI/server/cloudflare/models/tts.ts /Users/akashdeepdeb/Desktop/Projects/ElatoAI/firmware-arduino/src/Audio.cpp /Users/akashdeepdeb/Desktop/Projects/ElatoAI/firmware-arduino/src/Config.cpp Elato currently includes multiple backend paths: /Users/akashdeepdeb/Desktop/Projects/ElatoAI/server/deno /Users/akashdeepdeb/Desktop/Projects/ElatoAI/server/cloudflare /Users/akashdeepdeb/Desktop/Projects/ElatoAI/server/fastapi Use Cloudflare when you want: - Workers + Durable Objects - Cloudflare-hosted STT/TTS - a stateful edge session model Use Deno when you want: - the most battle-tested Elato path right now - direct provider integrations already working in production

Genesis Park 편집팀이 AI를 활용하여 작성한 분석입니다. 원문은 출처 링크를 통해 확인할 수 있습니다.

공유

관련 저널 읽기

전체 보기 →