HN 표시: 브라우저에서 온디바이스 ASR을 사용하는 음성 추적 텔레프롬프터
hackernews
|
|
📦 오픈소스
#asr
#on-device
#review
#teleprompter
#브라우저
#음성인식
원문 출처: hackernews · Genesis Park에서 요약 및 분석
요약
음성 인식 기술을 활용해 화자의 발화 속도와 위치를 실시간으로 추적하는 새로운 웹 기반 텔레프롬프터가 공개되었습니다. ONNX 모델과 Web Worker를 통해 별도의 백엔드 서버나 계정 없이 브라우저 내에서 완전히 작동하며, WebGPU 또는 WASM을 이용해 약 100MB의 모델을 실행합니다. 또한 Double Metaphone를 활용한 표준화 과정과 편집 거리 알고리즘을 적용하여, 동음이의어 구분 및 발음 실수나 애드리브 상황에서도 정확한 스크립트 위치를 찾아내는 것이 특징입니다.
본문
Most teleprompters scroll at a fixed speed and expect you to keep up. This one does the opposite: it listens to your voice and tracks your position in the script in real time. If you pause it waits. If you skip a line it finds its way back. Everything runs entirely inside the browser – speech recognition, fuzzy matching, and scrolling. No backend, no accounts, no audio leaving the tab. Most teleprompters scroll at a fixed speed and trust you to keep up. This one inverts that: it listens continuously and tracks your position in the script in real time. If you pause, it waits. If you ad-lib a sentence, it finds its way back. The speech recognition runs entirely inside the browser tab, embedded as an ONNX model. Nothing is sent anywhere. Recognition is handled by Moonshine (Useful Sensors), a compact ASR model designed for on-device use. It's loaded via Transformers.js and runs in dedicated Web Workers (one for voice activity detection, one for transcription), so it never competes with the UI thread. The full pipeline looks like this: Microphone → AudioWorklet (PCM @ 16kHz) → Silero VAD (skips inference on silent frames) → Moonshine Tiny ONNX (speech-to-text) → Main thread (script matching + scroll) ┌─────────────────────────────────────────────────────────────────┐ │ Browser main thread (app.js) │ │ │ │ ┌──────────┐ PCM ┌───────────────────┐ │ │ │ Audio │───────▶│ VAD Worker │ │ │ │ Worklet │ frames │ (vad.worker.js) │ │ │ │ 16 kHz │ │ │ │ │ └──────────┘ │ Silero VAD ONNX │ │ │ └────────┬──────────┘ │ │ │ speech segments │ │ ▼ │ │ ┌───────────────────┐ │ │ │ TX Worker │ │ │ │ (transcribe. │ │ │ │ worker.js) │ │ │ │ │ │ │ │ Moonshine ONNX │ │ │ │ (WebGPU / WASM) │ │ │ └────────┬──────────┘ │ │ │ transcript text │ │ ▼ │ │ ┌──────────────────────────────────────────────────────────┐ │ │ │ Script Matcher │ │ │ │ │ │ │ │ Token index ──▶ Banded Levenshtein ──▶ Beam tracker │ │ │ │ (Double Metaphone) (O(n·k)) (multi-hyp) │ │ │ └──────────────────────────┬───────────────────────────────┘ │ │ │ confirmed word position │ │ ▼ │ │ ┌───────────────────────────────────────────────────────────┐ │ │ │ UI: highlight pill + rAF scroll lerp + creep ticker │ │ │ └───────────────────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────────────────┘ Where WebGPU is available the model runs on the GPU; otherwise it falls back to WASM. On first load the model weights (~100 MB) are fetched and cached — after that the page works offline. Getting the position right is the hard part. Speech is messy: filler words, mispronunciations, homophones, repetition across sections, and the fact that Moonshine emits results in ~600 ms batches rather than word by word. A simple substring search breaks almost immediately. The tracker uses several layers to handle this: Position tracking is fundamentally a fuzzy matching problem. Spoken tokens are compared to sliding windows of script tokens using word-level Levenshtein distance. To avoid O(n²) cost over every window, only the diagonal band of width k is computed — bringing complexity down to O(n·k) — with an early exit once the running minimum exceeds the edit budget. DP rows use pre-allocated Int16Array buffers that are reused across calls, keeping allocator pressure close to zero. Before any DP runs, the script is indexed into a map of token → [positions] . Candidate windows are found by hash lookup on the incoming spoken tokens rather than by scanning the full script. The vast majority of windows are eliminated before a single Levenshtein cell is computed. Before any matching runs, each token is converted to its Double Metaphone code — a consonant-cluster encoding (Lawrence Philips, 2000) that maps homophones to the same key automatically. right / write / rite all become RT ; to / two / too all become T ; their / there / they're, peace / piece, whether / weather and thousands of other pairs collapse without any manual list. The encoding is computed once at script parse time, so query-time cost is just a hash lookup. A four-entry ASR_OVERRIDES table handles the two cases Double Metaphone cannot resolve algorithmically: articles (a / the, which ASR frequently swaps) and the an / and pair (which DM encodes differently). Everything else is covered by the algorithm. Scripts repeat words. Without a locality signal, the matcher has no way to distinguish the third occurrence of "however" from the first. A distance penalty is applied to candidates further ahead of the last confirmed position, halving the effective score every 20 words of offset. This keeps the tracker anchored to the actual reading position rather than jumping to any high-scoring window in the script. Moonshine's ~600 ms batch latency would cause the highlight to jump forward in visible lurches. Instead, the tracker measures your current speaking rate and speculatively advances the highlight at ~85% of the measured WPM between confirmed results. When the next transcript arrives the position sna
Genesis Park 편집팀이 AI를 활용하여 작성한 분석입니다. 원문은 출처 링크를 통해 확인할 수 있습니다.
공유