Gemma 4 E4B와 Gemma 제품군: 8개 작업 제품군에 대한 엔터프라이즈 벤치마크
hackernews
|
|
🔬 연구
#apple silicon
#benchmark
#enterprise
#gemma
#gemma 4 e4b
#review
원문 출처: hackernews · Genesis Park에서 요약 및 분석
요약
2026년 초 공개된 40억 파라미터 엣지용 모델인 구글의 'Gemma 4 E4B'가 8개 기업용 작업 및 약 50건의 테스트 케이스로 구성된 벤치마크에서 83.6%의 종합 정확도를 기록하며 120억 파라미터 모델인 Gemma 3 12B(82.3%)를 제치고 최고 성능을 기록했습니다. 특히 분류(92.9%), 다국어(85.1%), 환각 없는 요약(80%) 등의 분야에서 압도적인 효율성을 보여주었으며, 구조화된 JSON 출력의 파싱 성공률도 96%에 달해 안정적인 엔터프라이즈 활용이 기대됩니다. 하지만 다중 턴 대화에서는 0%의 완전한 실패를 기록하여 범용 어시스턴트보다는 특수 목적의 전문 모델로 활용해야 하는 명확한 한계가 존재합니다.
본문
Scope & limitations — read first 4 Gemma models (2B, 4B, E4B, 12B) · 8 enterprise test suites · ~50 test cases · Apple Silicon (MPS) · temperature 0.0 · deterministic runs · local inference via Hugging Face Transformers Google released Gemma 4 E4B in early 2026 — a 4-billion parameter model positioned as a strong efficiency play for on-device and edge deployment. The claim: competitive with much larger models at a fraction of the compute. Claims are easy. Benchmarks are harder. So I built a custom enterprise testing suite and ran all four Gemma-family models through it: Gemma 2 2B, Gemma 3 4B, Gemma 4 E4B, and Gemma 3 12B. Every test ran locally on Apple Silicon (MPS), temperature 0.0, deterministic. No API calls, no cloud inference. The test suites Eight enterprise-relevant task suites, each designed to probe a capability that matters in production: - Function Calling — can the model emit valid tool-call JSON with correct arguments? - Information Extraction — NER and relation extraction from unstructured text - Classification — intent routing and multi-label classification - Summarization — faithfulness and hallucination-free condensation - RAG Grounding — answering from provided context without fabrication - Code Generation — producing correct, runnable code from natural language specs - Multilingual — quality across non-English languages - Multi-turn — maintaining coherence across conversation turns Overall results: E4B takes the crown The full ranking: Gemma 4 E4B (83.6%) > Gemma 3 12B (82.3%) > Gemma 3 4B (80.8%) > Gemma 2 2B (77.6%). Each generation shows clear improvement, and E4B punches well above its weight class. Suite-by-suite breakdown | Suite | Gemma 2 2B | Gemma 3 4B | Gemma 4 E4B | Gemma 3 12B | |---|---|---|---|---| | Function Calling | 70% | 80% | 75% | 85% | | Info Extraction | 78.4% | 78.9% | 69.2% | 80.2% | | Classification | 85.7% | 85.7% | 92.9% | 92.9% | | Summarization (Halluc-Free) | 60% | 60% | 80% | 60% | | RAG Grounding | 33.3% | 58.3% | 41.7% | 41.7% | | Code Generation | 100% | 100% | 83.3% | 100% | | Multilingual | 73.9% | 69.4% | 85.1% | 82.9% | The bar chart tells the story clearly. Gemma 4 E4B dominates in Classification (93%), RAG Grounding (92%), and Multilingual (85%). It's competitive in Code Generation (83%) and Summarization (80%). Its weakest areas are Function Calling (75%) and Info Extraction (77%) — still respectable, but behind the 12B model. Capability radar profiles The radar chart reveals something interesting about model profiles. Gemma 3 4B (green) has an unusual spike in Code Generation (100%) but collapses on multi-turn. Gemma 3 12B (yellow) is well-rounded but never exceptional. E4B (red) has the most consistently outward profile — strong across the board, with classification and RAG as clear standouts. The heatmap: where each model wins The heatmap makes E4B's one critical weakness impossible to miss: multi-turn conversation scores 0%. This is a complete failure — the model could not maintain coherent conversation across turns in our test format. Every other model handled multi-turn reasonably (Gemma 2 2B: 40%, Gemma 3 4B: 60%, Gemma 3 12B: N/A due to test constraints). A model that scores 93% on classification but 0% on multi-turn is not a general-purpose assistant. It's a specialist. Deploy it accordingly. E4B deep dive: where it beats the average When you compare E4B against the average of the other three models, it leads in Classification (+5), RAG Grounding (+17), Multilingual (+10), Summarization (+20), and Code Gen (+83 vs avg). The areas where it trails: Function Calling (-3 vs avg), Info Extraction (-2 vs avg), and the catastrophic Multi-turn (-50 vs avg). Latency and memory: the practical cost On Apple Silicon (MPS backend), Gemma 4 E4B uses 8.2 GB of memory compared to 5.0 GB for Gemma 2 2B. Latency is higher across all input sizes — roughly 2-3x slower on time-to-first-token for long inputs. Throughput follows the same pattern: the 2B model generates tokens faster. This is the trade-off: better quality costs compute. Lost in the middle: positional retrieval bias The 'lost in the middle' test checks whether models retrieve information equally well regardless of where it appears in a long context. All three tested models (Gemma 2 2B, Gemma 3 4B, Gemma 4 E4B) show accuracy drops when the gold document sits in middle positions. Gemma 2 2B has the most severe dip (81.9% at position 10). E4B is more stable but still shows variation — accuracy ranges from ~88% to ~95% depending on position. Structured JSON output reliability For enterprise use, structured output is non-negotiable. The JSON reliability test reveals that Gemma 4 E4B achieves 96% parse success rate — the raw JSON it produces is valid. Schema compliance sits at 80%, meaning 4 in 5 outputs match the expected structure exactly. Hallucination rate is near zero. By schema complexity, E4B handles simple and medium schemas well but struggles with complex and edge-case schemas more than the a
Genesis Park 편집팀이 AI를 활용하여 작성한 분석입니다. 원문은 출처 링크를 통해 확인할 수 있습니다.
공유