Mistral 3.5 - 로컬로 실행하는 방법 | 느림보 문서

url_exploration | 2026년 4월 30일 21:32 | 📰 뉴스

#mistral #tip #llama #openai #뉴스

원문 출처: url_exploration · Genesis Park에서 요약 및 분석

요약

미스트랄은 텍스트와 이미지 입력을 지원하고 256K 컨텍스트 윈도우를 갖춘 새로운 하이브리드 추론 모델 '미스트랄-미디엄-3.5-128B'를 출시했습니다. 이 모델은 자신의 5배 크기인 모델들과 견줄 만한 성능을 보이며, 효율적인 성능을 위해 64GB 이상의 메모리가 요구됩니다.

본문

Mistral releases Mistral-Medium-3.5-128B, their new dense 128B parameter, multimodal, hybrid reasoning model. It supports text and image input, text output, a 256K context window and excels at reasoning, coding, long-context, tool use, agentic workflows, and multimodal doc/image understanding. Mistral Medium 3.5 offers highly competitive performance for models 5x its size. Run locally on ~64GB RAM. Usage Guide Vision for GGUFs it now supported for now. Support will come later. Table: Mistral Medium 3.5 recommended hardware requirements. Units are total memory: RAM + VRAM, or unified memory. Medium 3.5 128B 64 GB 80 GB 128-170 GB Your total available memory should at least exceed the size of the quantized model you download. If it does not, llama.cpp can still run with partial RAM / disk offload, but generation will be slower. You will also need more memory for long context, larger batches, tool-heavy agent runs and image prompts. Recommended Settings Use Mistral's recommended reasoning settings: reasoning_effort="none" → fast instant replies, chat, extraction and simple instructions.reasoning_effort="high" → reasoning mode, recommended for complex prompts, coding, research, math and agentic usage. Recommended sampling defaults: Use temperature = 0.7 forreasoning_effort="high" .Use temperature = 0.0 to0.7 forreasoning_effort="none" , depending on the task.Keep repetition and presence penalties disabled or at 1.0 unless you see looping.Maximum context length of 262,144 Reasoning Mode Mistral Medium 3.5 supports instant instruct mode and reasoning mode with a 'high' option. To enable high reasoning for llama.cpp / llama-server: To disable reasoning: If you're on Windows PowerShell, use: Run Mistral 3.5 Tutorials Because Mistral Medium 3.5 is a dense 128B model, the recommended starting point is Dynamic 4-bit GGUFs for local inference. GGUF: unsloth/Mistral-Medium-3.5-128B-GGUF Run in Unsloth StudioRun in llama.cpp 🦥 Unsloth Studio Guide For this tutorial, we will be using Unsloth Studio, which is our new web UI for running and training LLMs. With Unsloth Studio, you can run models and input audio, image and text locally on Mac, Windows, and Linux and: Compare models side-by-side Search and download Mistral Medium 3.5 On first launch you will need to create a password to secure your account and sign in again later. Then go to the Studio Chat tab and search for Mistral 3.5 in the search bar and download your desired model and quant. Run Mistral 3.5 Inference parameters should be auto-set when using Unsloth Studio, however you can still change it manually. You can also edit the context length, chat template and other settings. For more information, you can view our Unsloth Studio inference guide. 🦙 Llama.cpp Guide For this guide we will use Unsloth Dynamic 4-bit for Mistral Medium 3.5. See: unsloth/Mistral-Medium-3.5-128B-GGUF . For these tutorials, we will use llama.cpp for fast local inference, especially if you have a CPU or high-memory unified-memory machine. 1. Build llama.cpp Obtain the latest llama.cpp on GitHub. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference. For Apple Mac / Metal devices, set -DGGML_CUDA=OFF ; Metal support is on by default. 2. Run directly from Hugging Face For high reasoning mode: 3. Download the model manually After installing huggingface_hub and hf_transfer : If downloads get stuck, set: 4. Run the local GGUF If a multimodal projector GGUF is included, use: Llama-server deployment To deploy Mistral Medium 3.5 on llama-server, use: For reasoning mode: If you're on Windows PowerShell, use: You can ping llama-server with an OpenAI-compatible request: Mistral 3.5 Best Practices Prompting examples Simple reasoning prompt Use reasoning_effort="high" for this style of prompt. OCR / document prompt For OCR and document extraction, put the image first and ask for structured output. Multi-modal comparison prompt Coding agent prompt Use reasoning_effort="high" and tool calling for codebase exploration. JSON / function calling prompt Benchmarks Last updated Was this helpful?

원문 보기 (url_exploration)

Genesis Park 편집팀이 AI를 활용하여 작성한 분석입니다. 원문은 출처 링크를 통해 확인할 수 있습니다.

요약

본문

관련 저널 읽기