HN 표시: AI Benchy – AI 벤치마크 및 비교

hackernews | 2026년 3월 6일 14:13 | 🔬 연구

#ai #gemini #gpt-5 #openai #review #리더보드 #모델 비교 #벤치마크 #성능 평가

원문 출처: hackernews · Genesis Park에서 요약 및 분석

요약

AI 벤치마킹 플랫폼 'AI Benchy'의 새로운 버전이 출시되었습니다. 이번 업데이트를 통해 데스크톱 환경의 사용자 경험(UX)을 대폭 개선했으며, 특히 모델 페이지를 탐색하는 과정이 더욱 직관적이고 재미 있도록 설계되었습니다.

본문

AI BENCHY AI Benchmark Leaderboard Last updated at: 2026-03-30 Models Evaluated: 78 78/80 Filter models No models match the current search and filters. | Rank | Model | Score Summarizes broad quality across our full private benchmark suite, so ranking reflects consistent performance. | Company | Total Cost | Response Time (avg) Response Time (avg) | Tests Correct Shows how many tests are fully passed (all runs pass). | |---|---|---|---|---|---|---| | #1🥇 #1 | Gemini 3 Flash Previewmedium | 10.0… | $0.166… | 11.39s… | A test is fully passed only if every run passed for that test. No failed answers. Response Time (avg)11.39s Response Time (max)50.16s Response Time (total)113.86s … | | | |||||| | #2🥈 #2 | Gemini 3.1 Pro Previewmedium | 9.6… | $0.522… | 15.56s… | A test is fully passed only if every run passed for that test. Wrong answer: 1 Response Time (avg)15.56s Response Time (max)40.61s Response Time (total)155.64s … | | | |||||| | #3🥉 #3 | Gemini 3 Flash Previewlow | 8.7… | $0.081… | 5.95s… | A test is fully passed only if every run passed for that test. Wrong answer: 3 Response Time (avg)5.95s Response Time (max)14.72s Response Time (total)101.19s … | | | |||||| | #4#4 | Gemini 3 PRO Previewmedium | 8.7… | $0.197… | 9.06s… | A test is fully passed only if every run passed for that test. Wrong answer: 3 Response Time (avg)9.06s Response Time (max)26.24s Response Time (total)90.58s … | | | |||||| | #5#5 | Seed-2.0-Litemedium | 8.5… | Bytedance Seed | $0.105… | 27.78s… | A test is fully passed only if every run passed for that test. Wrong answer: 3 Did not follow instructions: 2 Response Time (avg)27.78s Response Time (max)168.71s Response Time (total)472.24s … | | |||||| | #6#6 | Qwen3.6 Plus Previewmedium | 8.5… | Qwen | $0.000… | 13.94s… | A test is fully passed only if every run passed for that test. Wrong answer: 3 Did not follow instructions: 1 Response Time (avg)13.94s Response Time (max)43.55s Response Time (total)237.01s … | | |||||| | #7#7 | GPT-5.3-Codexmedium | 8.5… | OpenAI | $0.544… | 15.76s… | A test is fully passed only if every run passed for that test. Wrong answer: 3 Did not follow instructions: 2 Response Time (avg)15.76s Response Time (max)100.93s Response Time (total)267.97s … | | |||||| | #8#8 | Gemini 3.1 Flash Lite PreviewhighArchived model: this model is no longer updated or tested on new tests. | 8.4… | $2.310… | 68.83s… | A test is fully passed only if every run passed for that test. Wrong answer: 3 Did not follow instructions: 1 Response Time (avg)68.83s Response Time (max)280.52s Response Time (total)1101.32s … | | | |||||| | #9#9 | Qwen3.5 Plus 2026-02-15medium | 8.4… | Qwen | $0.189… | 39.13s… | A test is fully passed only if every run passed for that test. Timed out: 2 Wrong answer: 2 Response Time (avg)39.13s Response Time (max)81.20s Response Time (total)391.29s … | | |||||| | #10#10 | Qwen3.5-122B-A10Bmedium | 8.4… | Qwen | $0.505… | 29.05s… | A test is fully passed only if every run passed for that test. Wrong answer: 3 Timed out: 1 Response Time (avg)29.05s Response Time (max)119.29s Response Time (total)493.86s … | | |||||| | #11#11 | Qwen3.5-27Bmedium | 8.3… | Qwen | $0.467… | 52.01s… | A test is fully passed only if every run passed for that test. Did not follow instructions: 2 Extra formatting: 1 Timed out: 1 Wrong answer: 1 Response Time (avg)52.01s Response Time (max)163.96s Response Time (total)884.10s … | | |||||| | #12#12 | GLM 5medium | 8.3… | Z.ai | $0.108… | 17.15s… | A test is fully passed only if every run passed for that test. Wrong answer: 2 Did not follow instructions: 1 No answer: 1 Timed out: 1 Response Time (avg)17.15s Response Time (max)28.96s Response Time (total)154.32s … | | |||||| | #13#13 | DeepSeek V3.2medium | 8.2… | DeepSeek | $0.026… | 38.49s… | A test is fully passed only if every run passed for that test. Wrong answer: 3 Did not follow instructions: 1 Timed out: 1 Response Time (avg)38.49s Response Time (max)93.11s Response Time (total)654.41s … | | |||||| | #14#14 | Gemini 2.5 Flashmedium | 8.1… | $0.292… | 11.88s… | A test is fully passed only if every run passed for that test. Wrong answer: 4 Did not follow instructions: 1 Response Time (avg)11.88s Response Time (max)95.48s Response Time (total)201.89s … | | | |||||| | #15#15 | Gemini 3.1 Flash Lite Previewmedium | 8.1… | $0.050… | 3.70s… | A test is fully passed only if every run passed for that test. Wrong answer: 4 Did not follow instructions: 1 Response Time (avg)3.70s Response Time (max)14.93s Response Time (total)62.97s … | | | |||||| | #16#16 | GPT-5.4medium | 8.1… | OpenAI | $0.794… | 18.95s… | A test is fully passed only if every run passed for that test. Wrong answer: 3 Did not follow instructions: 2 Response Time (avg)18.95s Response Time (max)100.41s Response Time (total)322.23s … | | |||||| | #17#17 | GLM 5 Turbomedium | 8.0… | Z.ai | $0.166… | 17.98s… | A test is fully passed only if every run passed for that test. Wrong answer: 3 Did not follow instructions: 2 Timed out:

원문 보기 (hackernews)

Genesis Park 편집팀이 AI를 활용하여 작성한 분석입니다. 원문은 출처 링크를 통해 확인할 수 있습니다.

요약

본문

관련 저널 읽기