HN 표시: Docker-whisper: 자체 호스팅 Whisper 음성-텍스트 서버(OpenAI API)
hackernews
|
|
📰 뉴스
#docker
#gpt-4
#openai
#openai api
#self-hosted
#stt
#whisper
원문 출처: hackernews · Genesis Park에서 요약 및 분석
요약
오픈소스 프로젝트인 faster-whisper를 기반으로 구축된 이 Docker 이미지는 사용자가 자체 서버에 프라이빗한 음성 인식(STT) 환경을 구축할 수 있게 해줍니다. 이 서버는 OpenAI의 오디오 전사 API와 완벽하게 호환되므로, 기존 애플리케이션의 API 주소 설정 단 한 줄만 변경하여 손쉽게 마이그레이션할 수 있습니다. 인터넷이 차단된 오프라인 환경도 지원하며, mp3, wav 등 주요 오디오 포맷뿐만 아니라 SRT, WebVTT 자막 형태의 응답 생성도 가능합니다. 라즈베리파이 등의 ARM64 환경과 일반적인 AMD64 환경을 모두 지원하며, 모델을 사전 캐싱하여 네트워크 없이도 작동이 가능합니다. 또한 전용 관리 스크립트(whisper_manage)를 통해 터보 모델(large-v3-turbo) 등 다양한 Whisper 모델을 직관적으로 관리하고 전환할 수 있습니다. 특히 다른 오픈소스 LLM 라우터 및 TTS 도구들과 연동하면 외부에 데이터를 유출하지 않는 완전한 형태의 프라이빗 음성 AI 파이프라인을 자체 구축할 수 있습니다.
본문
English | 简体中文 | 繁體中文 | Русский A Docker image to run a Whisper speech-to-text server, powered by faster-whisper. Provides an OpenAI-compatible audio transcription API. Based on Debian (python:3.12-slim). Designed to be simple, private, and self-hosted. - OpenAI-compatible POST /v1/audio/transcriptions endpoint — any app using the OpenAI Whisper API switches with a one-line change - Supports all Whisper models: tiny ,base ,small ,medium ,large-v3 ,large-v3-turbo and more - Model management via a helper script ( whisper_manage ) - Audio stays on your server — no data sent to third parties - All major audio formats supported (mp3, m4a, wav, webm, ogg, flac, and all ffmpeg formats) - Multiple response formats: JSON, plain text, verbose JSON, SRT subtitles, WebVTT subtitles - Offline/air-gapped mode — run without internet access using pre-cached models ( WHISPER_LOCAL_ONLY ) - Automatically built and published via GitHub Actions - Persistent model cache via a Docker volume - Multi-arch: linux/amd64 ,linux/arm64 Also available: Tip: Whisper, LiteLLM, and Kokoro TTS can be used together to build a complete voice AI pipeline on your own server. Use this command to set up a Whisper server: docker run \ --name whisper \ --restart=always \ -v whisper-data:/var/lib/whisper \ -p 9000:9000 \ -d hwdsl2/whisper-server Note: For internet-facing deployments, using a reverse proxy to add HTTPS is strongly recommended. In that case, also replace -p 9000:9000 with -p 127.0.0.1:9000:9000 in the docker run command above, to prevent direct access to the unencrypted port. The Whisper base model (~145 MB) is downloaded and cached on first start. Check the logs to confirm the server is ready: docker logs whisper Once you see "Whisper speech-to-text server is ready", transcribe your first audio file: curl http://your_server_ip:9000/v1/audio/transcriptions \ -F [email protected] \ -F model=whisper-1 Response: {"text": "Your transcribed text appears here."} - A Linux server (local or cloud) with Docker installed - Supported architectures: amd64 (x86_64),arm64 (e.g. Raspberry Pi 4/5, AWS Graviton) - Minimum RAM: ~500 MB free for the default base model (see model table) - Internet access for the initial model download (the model is cached locally afterwards). Not required if using WHISPER_LOCAL_ONLY=true with pre-cached models. For internet-facing deployments, see Using a reverse proxy to add HTTPS. Get the trusted build from the Docker Hub registry: docker pull hwdsl2/whisper-server Alternatively, you may download from Quay.io: docker pull quay.io/hwdsl2/whisper-server docker image tag quay.io/hwdsl2/whisper-server hwdsl2/whisper-server Supported platforms: linux/amd64 and linux/arm64 . All variables are optional. If not set, secure defaults are used automatically. This Docker image uses the following variables, that can be declared in an env file (see example): | Variable | Description | Default | |---|---|---| WHISPER_MODEL | Whisper model to use. See model table for options. | base | WHISPER_LANGUAGE | Default transcription language. BCP-47 code (e.g. en , fr , de , zh , ja ) or auto to autodetect. | auto | WHISPER_PORT | HTTP port for the API (1–65535). | 9000 | WHISPER_DEVICE | Compute device for inference. | cpu | WHISPER_COMPUTE_TYPE | Quantization / compute type. int8 is recommended. | int8 | WHISPER_THREADS | CPU threads for inference. Set to the number of physical cores for best latency. | 2 | WHISPER_API_KEY | Optional Bearer token. If set, all API requests must include Authorization: Bearer . | (not set) | WHISPER_LOG_LEVEL | Log level: DEBUG , INFO , WARNING , ERROR , CRITICAL . | INFO | WHISPER_BEAM | Beam size for transcription decoding. Higher values may improve accuracy at the cost of speed. Use 1 for fastest (greedy) decoding. | 5 | WHISPER_LOCAL_ONLY | When set to any non-empty value (e.g. true ), disables all HuggingFace model downloads. For offline or air-gapped deployments with pre-cached models. | (not set) | Note: In your env file, you may enclose values in single quotes, e.g. VAR='value' . Do not add spaces around = . If you change WHISPER_PORT , update the -p flag in the docker run command accordingly. Example using an env file: cp whisper.env.example whisper.env # Edit whisper.env with your settings, then: docker run \ --name whisper \ --restart=always \ -v whisper-data:/var/lib/whisper \ -v ./whisper.env:/whisper.env:ro \ -p 9000:9000 \ -d hwdsl2/whisper-server The env file is bind-mounted into the container, so changes are picked up on every restart without recreating the container. Alternatively, pass it with --env-file : docker run \ --name whisper \ --restart=always \ -v whisper-data:/var/lib/whisper \ -p 9000:9000 \ --env-file=whisper.env \ -d hwdsl2/whisper-server cp whisper.env.example whisper.env # Edit whisper.env as needed, then: docker compose up -d docker logs whisper Example docker-compose.yml (already included): services: whisper: image: hwdsl2/whisper-server container_name: whisper restart: always ports: - "9000:9000/tcp" # For a host-based reverse proxy, change to "127.0.0.1:9000:9000/tcp" volumes: - whisper-data:/var/lib/whisper - ./whisper.env:/whisper.env:ro volumes: whisper-data: Note: For internet-facing deployments, using a reverse proxy to add HTTPS is strongly recommended. In that case, also change "9000:9000/tcp" to "127.0.0.1:9000:9000/tcp" in docker-compose.yml , to prevent direct access to the unencrypted port. The API is fully compatible with OpenAI's audio transcription endpoint. Any application already calling https://api.openai.com/v1/audio/transcriptions can switch to self-hosted by setting: OPENAI_BASE_URL=http://your_server_ip:9000 POST /v1/audio/transcriptions Content-Type: multipart/form-data Parameters: | Parameter | Type | Required | Description | |---|---|---|---| file | file | ✅ | Audio file. Supported formats: mp3 , mp4 , m4a , wav , webm , ogg , flac and all other formats supported by ffmpeg. | model | string | ✅ | Pass whisper-1 (value is accepted but the active model is always used). | language | string | — | BCP-47 language code. Overrides WHISPER_LANGUAGE for this request. | prompt | string | — | Optional text to guide the model's style or continue a previous segment. | response_format | string | — | Output format. Default: json . See response formats. | temperature | float | — | Sampling temperature (0–1). Default: 0 . | Example: curl http://your_server_ip:9000/v1/audio/transcriptions \ -F [email protected] \ -F model=whisper-1 \ -F language=en With API key authentication: curl http://your_server_ip:9000/v1/audio/transcriptions \ -H "Authorization: Bearer your_api_key" \ -F [email protected] \ -F model=whisper-1 response_format | Description | |---|---| json | {"text": "..."} — default, matches OpenAI's basic response | text | Plain text, no JSON wrapper | verbose_json | Full JSON with language, duration, per-segment timestamps, log-probabilities | srt | SubRip subtitle format (.srt ) | vtt | WebVTT subtitle format (.vtt ) | Example — get SRT subtitles: curl http://your_server_ip:9000/v1/audio/transcriptions \ -F [email protected] \ -F model=whisper-1 \ -F response_format=srt Example — verbose JSON with timestamps: curl http://your_server_ip:9000/v1/audio/transcriptions \ -F [email protected] \ -F model=whisper-1 \ -F response_format=verbose_json GET /v1/models Returns the active model in OpenAI-compatible format. curl http://your_server_ip:9000/v1/models An interactive Swagger UI is available at: http://your_server_ip:9000/docs All server data is stored in the Docker volume (/var/lib/whisper inside the container): /var/lib/whisper/ ├── models--Systran--faster-whisper-*/ # Cached Whisper model files (downloaded from HuggingFace) ├── .port # Active port (used by whisper_manage) ├── .model # Active model name (used by whisper_manage) └── .server_addr # Cached server IP (used by whisper_manage) Back up the Docker volume to preserve downloaded models. Models are large (145 MB – 3 GB) and can take several minutes to download on first start; preserving the volume avoids re-downloading on container recreation. Use whisper_manage inside the running container to inspect and manage the server. Show server info: docker exec whisper whisper_manage --showinfo List available models: docker exec whisper whisper_manage --listmodels Pre-download a model: docker exec whisper whisper_manage --downloadmodel large-v3-turbo To change the active model: - (Optional but recommended) Pre-download the new model while the server is running: docker exec whisper whisper_manage --downloadmodel large-v3-turbo - Update WHISPER_MODEL in yourwhisper.env file (or add-e WHISPER_MODEL=large-v3-turbo to yourdocker run command). - Restart the container: docker restart whisper Available models: | Model | Disk | RAM (approx) | Notes | |---|---|---|---| tiny | ~75 MB | ~250 MB | Fastest; lower accuracy | tiny.en | ~75 MB | ~250 MB | English-only | base | ~145 MB | ~500 MB | Good balance — default | base.en | ~145 MB | ~500 MB | English-only | small | ~465 MB | ~1.5 GB | Better accuracy | small.en | ~465 MB | ~1.5 GB | English-only | medium | ~1.5 GB | ~5 GB | High accuracy | medium.en | ~1.5 GB | ~5 GB | English-only | large-v2 | ~3 GB | ~10 GB | Very high accuracy | large-v3 | ~3 GB | ~10 GB | Best accuracy | large-v3-turbo | ~1.6 GB | ~6 GB | Fast + high accuracy ⭐ | Tip: large-v3-turbo offers accuracy close tolarge-v3 at roughly half the resource cost. It is the recommended upgrade path frombase for most production deployments. RAM figures are approximate and reflect INT8 quantization (default). Models are cached in the /var/lib/whisper Docker volume and only downloaded once. For internet-facing deployments, place a reverse proxy in front of Whisper to handle HTTPS termination. The server works without HTTPS on a local or trusted network, but HTTPS is recommended when the API endpoint is exposed to the internet. Use one of the following addresses to reach the Whisper container from your reverse proxy: whisper:9000 — if your reverse proxy runs as a container in the same Docker network as Whisper (e.g. defined in the samedocker-compose.yml ).127.0.0.1:9000 — if your reverse proxy runs on the host and port9000 is published (the defaultdocker-compose.yml publishes it). Example with Caddy (Docker image) (automatic TLS via Let's Encrypt, reverse proxy in the same Docker network): Caddyfile : whisper.example.com { reverse_proxy whisper:9000 } Example with nginx (reverse proxy on the host): server { listen 443 ssl; server_name whisper.example.com; ssl_certificate /path/to/cert.pem; ssl_certificate_key /path/to/key.pem; # Audio files can be large — increase the upload limit as needed client_max_body_size 100M; location / { proxy_pass http://127.0.0.1:9000; proxy_set_header Host $host; proxy_set_header X-Real-IP $remote_addr; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; proxy_set_header X-Forwarded-Proto $scheme; proxy_read_timeout 300s; } } Set WHISPER_API_KEY in your env file when the server is accessible from the public internet. To update the Docker image and container, first download the latest version: docker pull hwdsl2/whisper-server If the Docker image is already up to date, you should see: Status: Image is up to date for hwdsl2/whisper-server:latest Otherwise, it will download the latest version. Remove and re-create the container: docker rm -f whisper # Then re-run the docker run command from Quick start with the same volume and port. Your downloaded models are preserved in the whisper-data volume. The Whisper, LiteLLM, and Kokoro TTS images can be combined to build a private, self-hosted voice AI assistant entirely on your own server, with no voice data sent to third parties. graph LR A["🎤 Audio input"] -->|transcribe| W["Whisper(speech-to-text)"] W -->|text| L["LiteLLM(AI gateway)"] L -->|response| T["Kokoro TTS(text-to-speech)"] T --> B["🔊 Audio output"] - Whisper — transcribes spoken audio to text (port 9000 ) - LiteLLM — routes the text to an LLM and returns a response (port 4000 ) - Kokoro TTS — converts the response back to speech (port 8880 ) Once all three containers are running, you can chain their APIs together: # Step 1: Transcribe audio to text (Whisper) TEXT=$(curl -s http://localhost:9000/v1/audio/transcriptions \ -F [email protected] -F model=whisper-1 | jq -r .text) # Step 2: Send text to an LLM and get a response (LiteLLM) RESPONSE=$(curl -s http://localhost:4000/v1/chat/completions \ -H "Authorization: Bearer " \ -H "Content-Type: application/json" \ -d "{\"model\":\"gpt-4o\",\"messages\":[{\"role\":\"user\",\"content\":\"$TEXT\"}]}" \ | jq -r '.choices[0].message.content') # Step 3: Convert the response to speech (Kokoro TTS) curl -s http://localhost:8880/v1/audio/speech \ -H "Content-Type: application/json" \ -d "{\"model\":\"tts-1\",\"input\":\"$RESPONSE\",\"voice\":\"af_heart\"}" \ --output response.mp3 - Base image: python:3.12-slim (Debian) - Runtime: Python 3 (virtual environment at /opt/venv ) - STT engine: faster-whisper with CTranslate2 (INT8 by default) - API framework: FastAPI + Uvicorn - Audio decoding: ffmpeg (installed from Debian package) - Data directory: /var/lib/whisper (Docker volume) - Model storage: HuggingFace Hub format inside the volume — downloaded once, reused on restarts Note: The software components inside the pre-built image (such as faster-whisper and its dependencies) are under the respective licenses chosen by their respective copyright holders. As for any pre-built image usage, it is the image user's responsibility to ensure that any use of this image complies with any relevant licenses for all software contained within. Copyright (C) 2026 Lin Song This work is licensed under the MIT License. faster-whisper is Copyright (C) SYSTRAN, and is distributed under the MIT License. This project is an independent Docker setup for Whisper and is not affiliated with, endorsed by, or sponsored by OpenAI or SYSTRAN.
Genesis Park 편집팀이 AI를 활용하여 작성한 분석입니다. 원문은 출처 링크를 통해 확인할 수 있습니다.
공유