Ollama를 사용하여 Mac에서 Gemma 4 31B 실행

hackernews | 2026년 4월 27일 15:23 | 🤖 AI 모델

#ai 모델 #google gemma 4 #headless cli #llama #lm studio

원문 출처: hackernews · Genesis Park에서 요약 및 분석

요약

32GB M5 맥에서 Ollama와 GGUF 양자화 기술을 활용해 Gemma 4 31B 모델을 구동하는 실용적인 방법이 소개되었습니다. 본문은 단순히 벤치마크 점수를 높이는 것보다, 대규모 언어 모델을 실행하면서도 일반 업무용 앱과 개발 도구를 문제없이 사용하는 환경 설정을 목표로 합니다. 이를 위해 메모리 제약을 극복하기 위해 IQ3_XXS 수준으로 적극적으로 압축된 모델을 사용하는 구체적인 전략을 제시합니다.

본문

A practical configuration for a 32 GB M5 Mac that still needs to remain usable Running large language models locally has become surprisingly practical on Apple Silicon. With a modern Mac, Ollama, and a carefully quantized GGUF model, it is possible to run models that only a short time ago would have felt out of reach for a personal machine. This post collects the practical findings from configuring Gemma 4 31B on a 32 GB Apple Silicon Mac with an M5 processor, using Ollama and a highly compressed GGUF quantization. The goal is not to squeeze every last token per second out of the machine. The goal is more realistic: Run a capable 31B local model while keeping the Mac usable for normal work: browser, IDE, terminal, notes, chat apps, and light development tools. That distinction matters. A configuration that works for a benchmark is not necessarily a configuration you want to live with all day. The model The model used in this setup is: gemma-4-31B-it-UD-IQ3_XXS.gguf This is an aggressively quantized GGUF build of Gemma 4 31B. The IQ3_XXS quantization makes the model small enough to fit into machines that would otherwise be unable to run a 31B model at all. The trade-off is obvious: this is not the highest quality quantization, but it gives access to a much larger model class on consumer hardware. The Ollama Modelfile starts from the local GGUF file: FROM ./gemma-4-31B-it-UD-IQ3_XXS.gguf Understanding the main Ollama parameters Before tuning the configuration, it helps to understand what the key parameters actually control. num_ctx num_ctx controls the maximum context window used by the model. A larger context means the model can keep more conversation, documents, code, or instructions in memory. But it also increases memory usage, especially through the KV cache. For this model, useful values are: 6144 conservative 8192 balanced 12288 aggressive For daily use on a 32 GB Mac, 8192 is a good target. num_batch num_batch affects how many tokens are processed together during prompt ingestion. It mostly impacts the speed at which the model reads the input prompt, not necessarily the speed at which it generates the answer token by token. Higher values can improve responsiveness with longer prompts, but they also increase temporary memory pressure. Good values for this setup are: 32 conservative 64 balanced 96 aggressive For a daily driver configuration, 64 is a reasonable compromise. If the Mac becomes sluggish or the runner crashes, this is one of the first values to reduce. num_gpu In Ollama, num_gpu does not mean “number of GPUs”. It means how many model layers are offloaded to the GPU. For Gemma 4 31B, the theoretical maximum is: num_gpu 60 because the model has 60 layers. However, full offload is not always the best practical choice. On a machine that must remain usable for other work, leaving some margin is often better than maximizing GPU offload. For a 32 GB M5 Mac, a good range is: 50 conservative 55 balanced 60 aggressive / full offload The recommended daily value is 55 . The recommended daily configuration This is the configuration I would use as a balanced daily driver. It gives a useful context window, keeps prompt processing reasonably fast, and avoids pushing the system too close to the edge. FROM ./gemma-4-31B-it-UD-IQ3_XXS.gguf PARAMETER num_ctx 8192 PARAMETER num_batch 64 PARAMETER num_gpu 55 PARAMETER temperature 1.0 PARAMETER top_p 0.95 PARAMETER top_k 64 Create or replace the model with: cd ~/ollama-models/gemma4-31b-iq3 cat > Modelfile 32 Then reduce num_ctx : 8192 -> 6144 Then reduce num_gpu : 55 -> 50 Finally reduce the wired memory limit: 22000 -> 20000 A stable machine is more useful than a theoretical maximum configuration that crashes during real work. About Metal crashes During experimentation, one possible failure mode is a Metal backend crash, with logs similar to: ggml-metal-device.m:608: GGML_ASSERT([rsets->data count] == 0) failed panic during panic When this happens, it is often better to reset the setup rather than keep pushing the same configuration. A practical recovery sequence is: killall Ollama killall ollama launchctl unsetenv OLLAMA_FLASH_ATTENTION launchctl unsetenv OLLAMA_KV_CACHE_TYPE launchctl unsetenv OLLAMA_CONTEXT_LENGTH sudo sysctl iogpu.wired_limit_mb=20000 Then reboot the Mac and restart from a conservative profile. Final recommendation For a 32 GB M5 Mac that should remain useful as a normal workstation, I would use this: sudo sysctl iogpu.wired_limit_mb=22000 launchctl setenv OLLAMA_FLASH_ATTENTION "1" launchctl setenv OLLAMA_KV_CACHE_TYPE "q8_0" launchctl setenv OLLAMA_CONTEXT_LENGTH "8192" launchctl setenv OLLAMA_KEEP_ALIVE "5m" And this Modelfile : FROM ./gemma-4-31B-it-UD-IQ3_XXS.gguf PARAMETER num_ctx 8192 PARAMETER num_batch 64 PARAMETER num_gpu 55 PARAMETER temperature 1.0 PARAMETER top_p 0.95 PARAMETER top_k 64 This is not the most extreme configuration. It is the one I would actually want to use. It gives enough context for serious work, enough GPU offload for acceptable performance, and enough memory headroom to keep the Mac usable while doing other things. That is usually the sweet spot for local LLMs: not maximum throughput, but sustainable performance.

원문 보기 (hackernews)

Genesis Park 편집팀이 AI를 활용하여 작성한 분석입니다. 원문은 출처 링크를 통해 확인할 수 있습니다.

요약

본문

관련 저널 읽기