메타뮤즈 스파크 정말 좋아요
hackernews
|
|
🔬 연구
#ai 딜
#ai 모델
#chatgpt
#claude
#gemini
#gpt-5
#meta
#muse spark
#멀티모달
#추론 모델
원문 출처: hackernews · Genesis Park에서 요약 및 분석
요약
최근 공개된 메타(Meta)의 멀티모달 추론 모델인 '뮤즈 스파크(Muse Spark)'를 5개 주요 프론티어 AI 모델과 비교 테스트한 결과, 전반적인 성능에서 메타가 1위를 차지했습니다. 필기체 메뉴판 이미지 인식 테스트에서 메타는 환각 오류를 가장 적게 일으키며 정확도를 입증했고, 엔비디아, AMD, 인텔의 실시간 주가 분석에서도 모든 모델이 엔비디아를 최적의 가치 주식으로 꼽았지만 메타의 데이터 분석 깊이가 가장 돋보였습니다. 다만, Three.js를 활용한 3D 스노우 글로브 코드 생성 테스트에서는 모든 모델이 완벽한 구현에 실패했으며, 기술적으로 가장 훌륭한 코드를 짠 메타는 오히려 화면이 까맣게 나오는 렌더링 문제를 겪어 해당 평가에서는 GPT-5.4에 밀렸습니다. 그럼에도 불구하고 비전 및 데이터 분석 분야에서 확실한 우위를 점한 메타 뮤즈 스파크는 종합 평가에서 가장 뛰어난 성능을 보여주었습니다.
본문
Meta just shipped Muse Spark — their first natively multimodal reasoning model. The blog post was the usual parade of benchmark tables and cherry-picked demos. Benchmarks are benchmarks. I wanted to see for myself. So I ran three tests across five frontier models: Meta Muse Spark (Thinking), Claude Opus 4.6 (Thinking), GPT-5.4, Gemini 3.1 (Thinking), and Grok 4.2 (Expert). Same prompts, same context, zero retries. One shot each. Test 1: Read the Menu I grabbed a photo of a chalkboard menu from Yezzi’s — handwritten chalk, glass reflections, multiple sections with prices, add-ons, and fine print. Then I asked each model: “What’s on the menu?” This tests whether a model can actually read messy real-world images, not just describe them. The difference between “Boursin Turkey Sandwich” and “Brazini Sandwich” is the difference between useful and hallucinated. Menu Reading: Accuracy vs. Hallucinations Green = items correctly identified (of 17). Red = hallucinated items or prices. Consensus Scores Instead of hand-verifying every item, I used a consensus method: if 4+ out of 5 models agreed on an item or price, it was marked as ground truth. Here’s how each model aligned with that consensus. The Hallucination Hall of Fame Described the spicy chicken sandwich as having “Carolina spices, turbo w/ lemon, and a salty martini olive.” It was Nashville spices, lettuce, and spicy mayo. Not even close. Confidently identified a “Slapped Wagyu Dog” for $8. It was a Salami Wrapped Dog. I wish the Wagyu version existed. Read “Junior Beef” as “Lemon Beef,” “Boursin” as “Brazini,” and “Fried Chx” as “French Dip.” Barely recognizable. Like reading a menu through a waterfall. Added “apples” to the chicken salad that weren’t there and renamed “Spicy Aioli” to “Garlic Aioli.” At least these sound like real food. The smoothest hallucinator. The most telling pattern: each model handles uncertainty differently. Meta gets it right or stays vague. Gemini smooths things over. GPT-5.4 guesses confidently. Grok invents food. Claude gives up and misreads the word entirely. Test 2: Stock Analysis with Real Numbers I asked each model to find current stock prices for NVIDIA, AMD, and Intel, calculate their P/E ratios from latest reported earnings, and pick the best value. This tests tool use (can it fetch live data?), math (does the arithmetic check out?), and reasoning (does the recommendation follow from the evidence?). | Data Point | Meta | Claude | GPT-5.4 | Gemini | Grok | |---|---|---|---|---|---| | NVDA Price | $177.64 | $177.64 | $181.53 | $177.64 | $181.60 | | NVDA EPS | $4.90 | $4.90 | $4.90 | $4.90 | $4.90 | | NVDA P/E | 36.3x | 36.3x | 37.0x | 36.3x | 37.1x | | AMD P/E | 82–84x | 81.5x | 87.5x | 83.6x | 89.2x | | INTC P/E | N/A | N/A | N/A | N/A | N/A | | Best Value | NVDA | NVDA | NVDA | NVDA | NVDA | All five models picked NVIDIA. Unanimous. NVIDIA’s $4.90 EPS was the one hard number every model nailed. The differentiation was in the depth of analysis, not the conclusion. Test 3: Build a Snow Globe I asked each model to generate a single HTML file with a 3D snow globe using Three.js — glass sphere with refraction, pine trees and a house inside, falling snow particles, auto-orbiting camera, translucent materials, the works. One prompt, one shot, no iteration. This is where things got interesting. I analyzed the code before opening the files in a browser. The code-level ranking looked like this: Meta had the best glass material, Gemini had 478 lines of sophisticated particle physics with shadow maps, and ChatGPT wrote the simplest, least ambitious code. Then I opened them in a browser. Code Analysis vs. What You Actually See Code quality (static analysis) Meta wrote the most technically correct Three.js. Then it rendered a black screen. The ranking completely inverted. Meta’s technically correct glass material produced a black screen — the lighting couldn’t penetrate. Gemini’s 478 lines of sophisticated snow physics had a clock bug (getElapsedTime() internally calls getDelta() , so calling both gives dt ≈ 0 ) that froze the snow completely. ChatGPT’s simple, safe code was the only one where you could actually see what was happening. Nobody passed this test. Zero models produced a snow globe you’d actually want to look at. But GPT-5.4 came closest — not by writing the best code, but by getting the basics right: enough light, visible glass, a scene you can see. See For Yourself All five snow globes, running live. Same prompt, one shot each. The Final Scoreboard Each model gets a composite score from 0–100. The formula: convert each test’s rank (1st–5th) to points (100, 75, 50, 25, 0), then average across all three tests. Meta Muse Spark takes it. Two first-place finishes on vision and analysis gave it enough runway to absorb a 4th-place on code generation. Nobody else won more than one test. GPT-5.4’s snow globe victory wasn’t enough to offset a mediocre showing on the other two. Claude was the most consistent — never first, never last
Genesis Park 편집팀이 AI를 활용하여 작성한 분석입니다. 원문은 출처 링크를 통해 확인할 수 있습니다.
공유