사르바마이/사르밤-105B

hackernews | 2026년 3월 7일 13:25 | 🔬 연구

#ai모델 #mix #review #sarvam-105b #sarvam-30b #리뷰 #benchmarks #llm #mixtral

원문 출처: hackernews · Genesis Park에서 요약 및 분석

요약

Sarvam-105B는 103억 개의 활성 파라미터를 사용하는 혼합 전문가(MoE) 구조의 오픈소스 AI 모델로, 코딩, 수학 및 에이전트 작업에서 뛰어난 추론 능력을 발휘합니다. 특히 22개 인도 언어를 최고 수준으로 지원하며, MMLU 90.6, Math500 98.6 등 여러 벤치마크에서 주요 폐쇄형 모델들과 맞먹는 성능을 입증했습니다. 아파치 라이선스로 공개된 이 모델은 YaRN 스케일링을 통해 128K의 긴 문맥 처리가 가능하도록 설계되었습니다.

본문

Want a smaller model? Download Sarvam-30B! - Introduction - Architecture - Benchmarks - Knowledge & Coding - Reasoning & Math - Agentic - Inference - Footnote - Citation Sarvam-105B is an advanced Mixture-of-Experts (MoE) model with 10.3B active parameters, designed for superior performance across a wide range of complex tasks. It is highly optimized for complex reasoning, with particular strength in agentic tasks, mathematics, and coding. Sarvam-105B is a top-tier performer, consistently matching or surpassing several major closed-source models and staying within a narrow margin of frontier models across diverse reasoning and agentic benchmarks. It demonstrates exceptional agentic and reasoning capabilities in real-world applications such as web search and technical troubleshooting. A major focus during training was the Indian context and languages, resulting in state-of-the-art performance across 22 Indian languages for its model size. Sarvam-105B is open-sourced under the Apache License. For more details, see our blog. The 105B model adopts an MLA-style attention stack with decoupled QK head dimensions (q_head_dim=192 split into RoPE and noPE components, v_head_dim=128 ) and a large head_dim of 576, enabling higher representational bandwidth per head while keeping the hidden size at 4096. This approach improves attention expressivity and long-context extrapolation (via YaRN scaling with a factor of 40 and 128K context). It has an intermediate_size (16384) and moe_intermediate_size (2048), combined with top-8 routing over 128 experts, which increases per-token active capacity while keeping activation cost manageable. The model has one shared expert, a routed scaling factor of 2.5, and auxiliary-loss-free router balancing. Knowledge & Coding | Benchmark | Sarvam-105B | GLM-4.5-Air | GPT-OSS-120B | Qwen3-Next-80B-A3B-Thinking | |---|---|---|---|---| | Math500 | 98.6 | 97.2 | 97.0 | 98.2 | | Live Code Bench v6 | 71.7 | 59.5 | 72.3 | 68.7 | | MMLU | 90.6 | 87.3 | 90.0 | 90.0 | | MMLU Pro | 81.7 | 81.4 | 80.8 | 82.7 | | Writing Bench | 80.5 | 83.8 | 86.5 | 84.6 | | Arena Hard v2 | 71.0 | 68.1 | 88.5 | 68.2 | | IF Eval | 84.8 | 83.5 | 85.4 | 88.9 | Reasoning & Math | Benchmark | Sarvam-105B | GLM-4.5-Air | GPT-OSS-120B | Qwen3-Next-80B-A3B-Thinking | |---|---|---|---|---| | GPQA Diamond | 78.7 | 75.0 | 80.1 | 77.2 | | AIME 25 (w/ Tools) | 88.3 (96.7) | 83.3 | 90.0 | 87.8 | | Beyond AIME | 69.1 | 61.5 | 51.0 | 68.0 | | HMMT (Feb 25) | 85.8 | 69.2 | 90.0 | 73.9 | | HMMT (Nov 25) | 85.8 | 75.0 | 90.0 | 80.0 | Agentic | Benchmark | Sarvam-105B | GLM-4.5-Air | GPT-OSS-120B | Qwen3-Next-80B-A3B-Thinking | |---|---|---|---|---| | BrowseComp | 49.5 | 21.3 | - | 38.0 | | SWE Bench Verified (SWE-Agent Harness) | 45.0 | 57.6 | 50.6 | 60.9 | | τ² Bench (avg.) | 68.3 | 53.2 | 65.8 | 55.0 | See footnote for evaluation details. Huggingface import torch from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig model_name = "sarvamai/sarvam-105b" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True, device_map="auto") def generate_text( prompt: str, max_new_tokens: int = 2048, temperature: float = 0.8, top_p: float = 0.95, repetition_penalty: float = 1.0, None: inputs = tokenizer(prompt, return_tensors="pt").to("cuda:0") generation_config = GenerationConfig( max_new_tokens=max_new_tokens, repetition_penalty=repetition_penalty, temperature=temperature, top_p=top_p, do_sample=True, ) with torch.no_grad(): output_ids = model.generate( input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"], generation_config=generation_config, ) return tokenizer.decode(output_ids[0], skip_special_tokens=True) prompts = [ "Which country won the FIFA World Cup in 2012?", ] for prompt in prompts: templated_prompt = tokenizer.apply_chat_template( [{"role": "user", "content": prompt}], tokenize=False, add_generation_prompt=True, enable_thinking=True ) output = generate_text(templated_prompt, max_new_tokens=512) print("Prompt: ", prompt) print("Generated text: ", output) print("=" * 100) SGLang Install latest SGLang from source git clone https://github.com/sgl-project/sglang.git cd sglang pip install -e "python[all]" Instantiate model and Run import sglang as sgl from transformers import AutoTokenizer model_path = "sarvamai/sarvam-105b" engine = sgl.Engine( model_path=model_path, tp_size=4, mem_fraction_static=0.70, trust_remote_code=True, dtype="bfloat16", moe_runner_backend="flashinfer_cutedsl", prefill_attention_backend="fa3", decode_attention_backend="flashmla", disable_radix_cache=False, ) sampling_params = { "temperature": 0.8, "max_new_tokens": 2048, "repetition_penalty": 1.0, } prompts = [ "Which band released the album Dark Side of the Moon in 1973?", ] outputs = engine.generate([ tokenizer.apply_chat_template([ {"role": "user", "content": prompt}], tokenize=False, add_generation_prompt=True, enable_thinking=True) for prompt in p

원문 보기 (hackernews)

Genesis Park 편집팀이 AI를 활용하여 작성한 분석입니다. 원문은 출처 링크를 통해 확인할 수 있습니다.

요약

본문

관련 저널 읽기