MiniMax M2.7은 230B 매개변수입니다. 실제로 집에서 실행할 수 있나요? 저는 Unsloth의 UD-IQ3_XXS(80GB)를 4개의 다른 장비에서 테스트했습니다: 🟠 4x RTX 4090(96GB): 71.52톡/초, TTFT 1045ms 🟢 4x RTX 5090(128GB): 120.54톡/초, TTFT 725ms 🟡 1x RTX PRO 6000(96GB): 118.74톡/초, TTFT 765ms 🟣 DGX 스파크(128GB) — 24.41톡/초, TTFT 741ms 백엔드: llama.cpp. 컨텍스트: 32k. 최대 토큰: 4096. IQ3_XXS를 선택한 이유는 96GB VRAM에 맞는 가장 큰 퀀트이면서도 32k 컨텍스트에 안전한 여유 공간을 남겨두기 때문입니다. 같은 qu
twitter
|
|
{'이벤트': '📰', '머신러닝/연구': '📰', '하드웨어/반도체': '📰', '취약점/보안': '📰', '기타 AI': '📰', 'AI 딜': '📰', 'AI 모델': '📰', 'AI 서비스': '📰', 'discount': '📰', 'news': '📰', 'review': '📰', 'tip': '📰'} review
#llama
#review
요약
230B 매개변수를 가진 MiniMax M2.7 모델을 가정에서 구동하기 위해 4종류의 하드웨어 환경에서 Unsloth의 UD-IQ3_XXS(80GB) 양자화 모델로 성능 테스트를 진행했습니다. 4중 RTX 5090 장비가 120.54 tok/s로 가장 빠른 속도를 기록했으나 최대 2,300W의 높은 전력을 소모한 반면, 단일 RTX PRO 6000 장비는 118.74 tok/s의 유사한 속도를 내면서도 약 600W의 전력만 사용해 토큰당 전력 효율에서 가장 우수한 성과를 보였습니다. 또한 DGX Spark는 24.41 tok/s의 낮은 연산 속도를 보였으나 시스템 전체가 약 240W의 매우 낮은 전력을 소모하며 저전력 환경에 적합하다는 점이 확인되었습니다.
왜 중요한가
관련 엔티티
MiniMax
Unsloth
llama.cpp
RTX 4090
RTX 5090
RTX PRO 6000
DGX 스파크
본문
MiniMax M2.7 is 230B params. Can you actually run it at home?
I tested Unsloth's UD-IQ3_XXS (80GB) on 4 different rigs:
🟠 4x RTX 4090 (96GB): 71.52 tok/s, TTFT 1045ms
🟢 4x RTX 5090 (128GB): 120.54 tok/s, TTFT 725ms
🟡 1x RTX PRO 6000 (96GB): 118.74 tok/s, TTFT 765ms
🟣 DGX Spark (128GB) — 24.41 tok/s, TTFT 741ms
Backend: llama.cpp. Context: 32k. Max tokens: 4096.
I went with IQ3_XXS because it's the biggest quant that fits in 96GB VRAM while still leaving safe headroom for 32k context. Same quant across all four rigs, fairest comparison I could run.
Now look at rough peak GPU power draw:
🟠 4x4090 → 1,800W peak (450W × 4)
🟢 4x5090 → 2,300W peak (575W × 4)
🟡 RTX PRO 6000 → 600W peak
🟣 DGX Spark → 240W peak (whole system)
The RTX PRO 6000 is the quiet winner. One card, 96GB, matching a 4x5090 rig at roughly a quarter of the power and zero multi-GPU headaches. Best tokens-per-watt by a wide margin.
DGX Spark is slow on generation but pulls the least power of any rig here, around 240W for the whole system. Prefill-friendly, memory-rich, wall-socket-friendly.
And yes, plenty of people cap their cards. Even then, 4x 4090 or 4x 5090 still pulls well over 1,200W from the GPUs alone.
I tested Unsloth's UD-IQ3_XXS (80GB) on 4 different rigs:
🟠 4x RTX 4090 (96GB): 71.52 tok/s, TTFT 1045ms
🟢 4x RTX 5090 (128GB): 120.54 tok/s, TTFT 725ms
🟡 1x RTX PRO 6000 (96GB): 118.74 tok/s, TTFT 765ms
🟣 DGX Spark (128GB) — 24.41 tok/s, TTFT 741ms
Backend: llama.cpp. Context: 32k. Max tokens: 4096.
I went with IQ3_XXS because it's the biggest quant that fits in 96GB VRAM while still leaving safe headroom for 32k context. Same quant across all four rigs, fairest comparison I could run.
Now look at rough peak GPU power draw:
🟠 4x4090 → 1,800W peak (450W × 4)
🟢 4x5090 → 2,300W peak (575W × 4)
🟡 RTX PRO 6000 → 600W peak
🟣 DGX Spark → 240W peak (whole system)
The RTX PRO 6000 is the quiet winner. One card, 96GB, matching a 4x5090 rig at roughly a quarter of the power and zero multi-GPU headaches. Best tokens-per-watt by a wide margin.
DGX Spark is slow on generation but pulls the least power of any rig here, around 240W for the whole system. Prefill-friendly, memory-rich, wall-socket-friendly.
And yes, plenty of people cap their cards. Even then, 4x 4090 or 4x 5090 still pulls well over 1,200W from the GPUs alone.