Wispr Flow 대 데스 메탈

hackernews | | 🔬 연구
#claude #openai #review
원문 출처: hackernews · Genesis Park에서 요약 및 분석

요약

비미국식 억양 때문에 기존 STT 소프트웨어 사용을 포기했던 저자는 화제의 음성 인식 도구인 위스프 플로우의 성능을 테스트해 보았습니다. 모델이 주로 깨끗한 대화체 영어로 학습했다는 점을 고려해 데스 메탈이라는 가장 어려운 입력값을 실험에 활용했습니다. 이는 훈련 데이터 분포 밖의 입력이 음성 인식 실패의 원인임을 확인하기 위한 과학적이지 않은 시도입니다.

본문

Wispr Flow vs. Death Metal An unscientific experiment to see if it's as good as people say it is I have a non-American accent. Most STT software hears me say “deploy the build” and writes “destroy the bill,” or just gives up. The failure mode is so consistent I stopped reaching for these tools years ago. Then a few months ago everyone in my feed started talking about Wispr Flow. I tried it. It just worked. It got my accent right. It got my friends’ accents right. So, as an engineer, I wanted to know how far I could push it. The Experiment Speech recognition fails on accents for the same reason it fails on anything else outside its training distribution. Models are trained overwhelmingly on clean conversational English. Death metal is the platonic adversarial input. Guttural and unintelligible screams, growls, sustained notes at frequencies the conversational corpus simply doesn’t contain. So, this is what I did: The content: A song that exists in two versions; Iron Maiden’s “Hallowed Be Thy Name,” and a death metal cover by Cradle of Filth. The Iron Maiden version is melodic, mostly clean vocals. The Cradle of Filth version sounds like someone gargling a chainsaw. Same lyrics underneath. I found isolated vocal tracks for both on the Isolated Tracks youtube channel, plus the full original songs. Four transcribers: Wispr Flow, OpenAI Dictate, Claude voice mode, and one unnamed human listening cold without the lyrics in front of them. Equipment: Macbook with an M2 chip running Wispr flow, OpenAI and Claude apps. A podcast quality microphone. An iPhone 17 Pro for producing sound. The setup: Start the app and enable native dictation → Play sound from the phone directly into the physical microphone → Pause the source after getting through the intro → Capture results in a notion doc. The Results Just to ground the discussion, here are the original lyrics taken from genius.com I’m waiting in my cold cell when the bell begins to chime. Reflecting on my past life and it doesn’t have much time. ‘Cause at five o’clock they take me to the gallows pole. The sands of time for me are running low The table below shows what I got back from the different transcribers. Iron Maiden On the Iron Maiden isolated vocals, all three models do roughly fine. Small slips like "looting" for "waiting," "rotting" for "waiting," a "saddens" for "sands", but the song is intact. On the song version, OpenAI does a surprisingly good job - better than the clean vocals - I suspect it’s doing something similar to Shazam here. Claude doesn’t do well with interspersed music + vocals. Cradle of Filth On the Cradle of Filth tracks, the gap really opens. On the isolated vocals, Wispr gets it cleanly, including the death-growled "motherfuckers!" tag at the end!! I’m surprised at how close to the actual lyrics Wispr is. OpenAI Dictate loses the thread mid-line and starts paraphrasing. Claude voice gives up entirely after "gallows" and produces a single trailing word. The human, listening without the lyrics in front of them, also did badly (death metal is hard for humans too) On the full song, Wispr does a pretty good job. Way better than OpenAI as well as the human! Wispr I ran this test because for a decade STT was something I’d try, get burned by, and put down. I type at >100 wpm, which is fast, and dictation used to be a downgrade once you counted cleanup time. It isn't anymore. I can talk faster than I type, and the bottleneck on getting thoughts out of my head has moved from my fingers to my brain. The other thing is that typing pins you to a screen and talking doesn't. I can think out loud at an LLM while walking the dog, and the conversation is still there when I get back. That's time I didn't have access to before. I started this expecting Wispr to fail. It didn’t. The tool got most of it right, including a death-growled “motherfuckers” that isn’t in the original lyrics! If anyone from Wispr Flow is reading - do you train on death metal? And if so, please tell me it happens in the office, at full volume, with the engineers head-banging through QA.

Genesis Park 편집팀이 AI를 활용하여 작성한 분석입니다. 원문은 출처 링크를 통해 확인할 수 있습니다.

공유

관련 저널 읽기

전체 보기 →