Show HN: RelayFreeLLM — Free AI Gateway with Auto-Failover (Updates)

hackernews | 2026년 4월 11일 15:39 | 📦 오픈소스

#ai 게이트웨이 #ai 모델 #anthropic #gemini #llama #llm #mistral #openai #무료 ai #자동 장애조치

원문 출처: hackernews · Genesis Park에서 요약 및 분석

요약

RelayFreeLLM은 구글 제미나이, Groq, Mistral 등 다양한 무료 AI 제공업체를 단일 엔드포인트로 연결해주는 오픈소스 AI 게이트웨이입니다. 기존의 OpenAI 호환 코드를 그대로 사용할 수 있으며, 특정 제공업체의 요청 한도 초과나 오류 발생 시 자동으로 다음 사용 가능한 모델로 트래픽을 라우팅하여 앱의 다운타임을 방지합니다. 또한 TF 기반 추출 요약을 활용한 4가지 컨텍스트 관리 모드와 세션 선호도 기능을 통해 긴 대화 문맥을 안정적으로 유지하며, 제공업체가 전환되더라도 일관된 출력 형식을 보장합니다.

본문

One endpoint. More free AI than any single provider. Less rate limit headaches. Don't want to pay # Your existing code works. Just change the URL. client = OpenAI(base_url="http://localhost:8000/v1", api_key="fake") No code changes. No retry logic. No 429 errors breaking your app. Free AI APIs are useful — but using them directly can be painful: ❌ Groq hits rate limit → Your app crashes ❌ Gemini quota exhausted → User sees error ❌ Switching providers → Rewrite your integration ❌ Testing 5 providers → 5 different SDKs to manage ✅ Gemini fails → Automatically tries Groq ✅ One provider down → Traffic routes to others ✅ Same API for everyone → OpenAI-compatible ✅ More providers = More throughput You get a meta-model: a single endpoint that routes to the next available free provider, offers flexible context management, maintains session affinity, and fails over automatically to keep your app running. | Feature | Why It Matters | |---|---| | OpenAI-compatible | Drop-in for your existing code. LangChain, LlamaIndex, any SDK. | | Session Affinity | Lock users to specific providers via X-Session-ID . Faster responses via provider-side context caching. | | Context Management | 4 modes (Static, Dynamic, Reservoir, Adaptive). Smartly prunes long histories with multi-turn extractive summarization. | | Automatic Failover | Provider down? One model hit limits? We try the next one automatically. Zero downtime. | | Consistent Output Style | Universal style guidance and response normalizers eliminate provider-specific quirks. | | Strict Boot Validation | Server verifies all models, registry entries, and API keys before binding to ensure a healthy gateway. | | Real-time Streaming | Full SSE streaming support from every backend provider. | | Local models | Seamlessly mix cloud free tiers with your private Ollama instance. | | User | Use Case | |---|---| | Independent developers | Ship AI features without a $$$/month API bill | | Students & hobbyists | GPT-level AI, no need credit card or phone number | | Self-hosters | Combine Ollama privacy with cloud capacity | | Researchers | Batch queries across providers for higher throughput | git clone https://github.com/msmarkgu/RelayFreeLLM.git cd RelayFreeLLM pip install -r requirements.txt Create a .env file: # --- Providers --- GEMINI_APIKEY= # ai.google.dev GROQ_APIKEY= # console.groq.com MISTRAL_APIKEY= # console.mistral.ai CEREBRAS_APIKEY= # cloud.cerebras.ai DEEPSEEK_APIKEY= # Optional OLLAMA_BASE_URL=http://localhost:11434 # Optional # --- Selection Strategies --- PROVIDER_STRATEGY=roundrobin # options: roundrobin, random, weight MODEL_STRATEGY=roundrobin # options: roundrobin, random, weight # --- Session & Affinity --- SESSION_AFFINITY_ENABLED=True # Pin sessions to providers SESSION_TTL_HOURS=24 # How long to keep affinity locks # --- HTTP Configuration --- REQUEST_TIMEOUT_SECONDS=60 # Timeout for all API requests (seconds) # --- Context Management --- # Modes: static, dynamic, reservoir, adaptive CONTEXT_MANAGEMENT_MODE=reservoir CONTEXT_RESERVOIR_RECENT_KEEP=10 # Verbatim messages CONTEXT_RESERVOIR_SUMMARY_BUDGET=400 # Tokens for old history summary python -m tests.test_models_availability Depending on your providers, the result should look like: ================================================== MODEL AVAILABILITY SUMMARY ================================================== ✅ PASS | Groq | llama-3.2-3b-preview | Success ✅ PASS | Groq | llama-3.2-11b-vision-preview | Success ✅ PASS | Groq | llama-3.2-90b-vision-preview | Success ✅ PASS | Groq | llama-3.1-405b-reasoning | Success ✅ PASS | Groq | moonshotai/kimi-k2-instruct-0905 | Success ✅ PASS | Groq | moonshotai/kimi-k2-instruct | Success ✅ PASS | Groq | groq/compound | Success ✅ PASS | Mistral | mistral-large-latest | Success ✅ PASS | Mistral | mistral-medium-latest | Success ✅ PASS | Mistral | codestral-latest | Success ✅ PASS | Mistral | mistral-large-2512 | Success ✅ PASS | Mistral | mistral-medium-2508 | Success ✅ PASS | Mistral | mistral-medium-2505 | Success ✅ PASS | Mistral | mistral-medium | Success ✅ PASS | Mistral | codestral-2508 | Success ✅ PASS | Gemini | gemini-2.5-flash | Success ================================================== TOTAL: 17/17 models available. python -m src.server In console should see something like: INFO: Started server process [203452] INFO: Waiting for application startup. ... ... ... 2026-04-01 19:44:04,123 - src.model_selector - INFO - Provider sequence: ['Cerebras', 'Groq', 'Mistral', 'Gemini', 'Ollama'], Provider Strategy: roundrobin, Model Strategy: roundrobin 2026-04-01 19:44:04,123 - __main__ - INFO - Meta model 'meta-model' ready with providers: ['Cerebras', 'Cloudflare', 'Gemini', 'Groq', 'Mistral', 'Ollama'] INFO: Application startup complete. INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit) Python SDK: from openai import OpenAI client = OpenAI( base_url="http://localhost:8000/v1", api_key="relay-free" ) # Automatic routing - picks the next available free provider response = client.chat.completions.create( model="meta-model", messages=[{"role": "user", "content": "Hello!"}] ) # Or route to specific provider response = client.chat.completions.create( model="groq/llama-3.3-70b-versatile", messages=[{"role": "user", "content": "Hello!"}] ) Note on Consistent Output: Regardless of which provider (Gemini, Groq, Mistral, etc.) handles your request, RelayFreeLLM ensures consistent output style through universal style guidance and response normalization. This means no jarring changes in tone or formatting when the system automatically fails over between providers. cURL: curl -X POST http://localhost:8000/v1/chat/completions \ -H "Authorization: Bearer relay-free" \ -H "Content-Type: application/json" \ -d '{"model": "meta-model", "messages": [{"role": "user", "content": "Hi"}]}' LangChain: from langchain_openai import ChatOpenAI llm = ChatOpenAI( base_url="http://localhost:8000/v1", api_key="relay-free", model="meta-model" ) REST Client Example (using VS Code REST Client extension) POST http://localhost:8000/v1/chat/completions HTTP/1.1 content-type: application/json { "model": "meta-model", "messages": [ {"role": "system", "content": "Format response in JSON."}, {"role": "user", "content": "When was the country Romania founded?"} ] } ### Specific Model Routing # Directly target a specific provider and model POST http://localhost:8000/v1/chat/completions HTTP/1.1 content-type: application/json { "model": "Mistral/mistral-large-latest", "messages": [ {"role": "user", "content": "What is the capital of France?"} ] } See more examples in ./tests/api.http. Tell RelayFreeLLM what you need: // "Any model from any providers, RelayFreeLLM will choose the next available" {"model": "meta-model", "messages": [...]} // "Give me coding model from any providers" {"model": "meta-model", "model_type": "coding", "messages": [...]} // "I prefer small models to run fast, give simple responses" {"model": "meta-model", "model_scale": "small", "messages": [...]} // "I want large models to do most capable reasoning" {"model": "meta-model", "model_scale": "large", "messages": [...]} // "I want DeepSeek models if available" {"model": "meta-model", "model_name": "deepseek", "messages": [...]} // "Specific provider/model" {"model": "Gemini/gemini-2.5-flash", "messages": [...]} When a provider hits a rate limit: Request → Groq (rate limited) → Circuit breaker activates → Retry → Gemini → Retry → Mistral → Success ✓ Despite automatic switching between providers, RelayFreeLLM maintains consistent output style: - Universal style guide injected into every request's system prompt - Response normalization removes provider-specific quirks - No jarring style switches when failing over between providers - Consistent tone, formatting, and quality regardless of backend In multi-turn conversations, many providers (like Gemini and Anthropic) offer Context Caching optimizations. To benefit from this, RelayFreeLLM supports Session Affinity. By passing the X-Session-ID header, RelayFreeLLM will try to "pin" a user to the same provider for the duration of their session. - User sends request with X-Session-ID: user-123 . - Gateway routes to Gemini and locks that session ID to Gemini. - Subsequent requests from user-123 bypass the round-robin logic and go straight back to Gemini. - If Gemini fails or hit limits, the gateway automatically migrates the session to the next best provider and re-pins it. As conversations grow, they exceed free tier context limits. RelayFreeLLM's ContextManager uses advanced pruning to keep chats alive: | Mode | Behavior | |---|---| | Static | Keeps the last | | Dynamic | Uses real-time token tracking to boost the context window when usage is low, or contract it when usage spikes, ensuring you never exceed model context limits. | | Reservoir | Keeps recent messages verbatim + adds an extractive summary of the older conversation. | | Adaptive | Detects task type (e.g., coding vs chat) and switches between Reservoir and Static modes automatically. | Extractive Summarization: Unlike simple truncation, Reservoir mode preserves the "essence" of your history. It uses a TF-scoring algorithm (Term Frequency) to identify sentences with the most unique information, applies a position bias for topicality, and greedily selects the highest-scoring segments to fit within your token budget. Request → Gemini (adds "As an AI..." preamble) → Normalizer removes preamble → Clean, direct response returned Request → Groq (adds "Sure thing!" opener) → Normalizer removes opener → Same clean, direct response style | Parameter | Type | Description | |---|---|---| model | string | "meta-model" for auto-routing, or "provider/model" for direct | messages | array | Standard OpenAI message format | stream | bool | Enable SSE streaming (default: false) | model_type | string | Filter: text , coding , ocr | model_scale | string | Filter: large , medium , small | model_name | string | Match model name substring | List available models with status: curl http://localhost:8000/v1/models?type=coding&scale=large Track your aggregated usage: curl http://localhost:8000/v1/usage ┌─────────────────────────────────────────────────┐ │ Your Application │ │ (OpenAI SDK, LangChain, etc.) │ └─────────────────────┬───────────────────────────┘ │ OpenAI-compatible API │ (with optional X-Session-ID) ┌─────────────────────▼───────────────────────────┐ │ RelayFreeLLM Gateway │ │ ┌───────────┐ ┌───────────┐ ┌──────────┐ │ │ │ Router │───▶│Dispatcher │───▶│ContextMgr│ │ │ │ /v1/chat │ │ (Retries) │ │(Summary) │ │ │ └───────────┘ └─────┬─────┘ └──────────┘ │ │ │ ┌──────────┐ │ │ └─────────▶│Affinity │ │ │ │ Map │ │ │ └──────────┘ │ └─────────────────────────┬───────────────────────┘ │ ┌──────────┬──────────┬─────┴────┬──────────┬──────────┐ ▼ ▼ ▼ ▼ ▼ ▼ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ │ Gemini │ │ Groq │ │ Mistral│ │Cerebras│ │DeepSeek│ │ Ollama │ └────────┘ └────────┘ └────────┘ └────────┘ └────────┘ └────────┘ To ensure consistent user experience despite provider switching: - Style Directive Injection: Universal style guide added to every request's system prompt - Response Normalization: Post-processing removes provider-specific quirks: - Strips AI preambles ("As an AI", "Certainly!", etc.) - Standardizes markdown and code formatting - Fixes and extracts JSON from code fences - Ensures consistent tone and formatting This means users get the same high-quality, consistent output whether their request was handled by Gemini, Groq, Mistral, or any other provider. RelayFreeLLM/ ├── src/ │ ├── server.py # Entry point │ ├── router.py # API endpoints │ ├── model_dispatcher.py # Retry & circuit breaker logic │ ├── model_selector.py # Quota-aware routing │ ├── provider_registry.py # Auto-discovers providers │ ├── models.py # Request/response models │ └── api_clients/ # Provider implementations │ ├── gemini_client.py │ ├── groq_client.py │ ├── mistral_client.py │ └── ... ├── tests/ # Unit & integration tests └── provider_model_limits.json # Rate limit configuration - Web dashboard for live provider status - Persistent rate limit state - Prompt caching layer - Embeddings & image generation routing - One-command Docker deploy Found a new free provider? Adding one takes about 50 lines: # src/api_clients/my_provider_client.py class MyProviderClient(ApiInterface): PROVIDER_NAME = "myprovider" async def call_model_api(self, request, stream): # Your API logic here pass PRs welcome. Built with FastAPI, Pydantic, httpx, and AI coding tools. Powered by the generous free tiers of Google Gemini, Groq, Mistral AI, Cerebras, and Ollama. Built for developers who want great AI without the bill.

원문 보기 (hackernews)

Genesis Park 편집팀이 AI를 활용하여 작성한 분석입니다. 원문은 출처 링크를 통해 확인할 수 있습니다.

요약

본문

관련 저널 읽기