HN 표시: PPO 상담원은 기존 파견에 비해 엘리베이터 대기 시간을 84% 줄입니다.

hackernews | 2026년 3월 6일 22:55 | 🔬 연구

#gymnasium #ppo #review #강화학습 #시뮬레이션 #엘리베이터

원문 출처: hackernews · Genesis Park에서 요약 및 분석

요약

파이썬과 짐나지엄(Gymnasium)을 활용해 엘리베이터 4대와 20층 건물을 모의한 강화학습 환경을 구축했습니다. PPO 에이전트와 기존 목적지 파견 알고리즘을 비교한 결과, 아침과 저녁 첨두 시간 대 교통 패턴에서 승객 대기 시간이 84% 감소하는 성과를 거두었습니다. 현재는 더 현실적인 물리 엔진을 추가하여 성능을 고도화하고 있으며, 관련 코드는 깃허브에서 공개되었습니다.

본문

Comparing a classical Destination Dispatching algorithm against a PPO-trained reinforcement learning agent in a custom-built Gymnasium simulation environment. Stack: Python · Gymnasium · stable-baselines3 · FastAPI (in progress) · Next.js (in progress) | Classic Agent | PPO Agent | | |---|---|---| | Mean Avg Reward | -0.67 | +0.14 | | Mean Dropoffs / Episode | 2,664 | 2,855 (+7%) | | Mean Avg Wait (steps) | 600.61 | 93.06 (6× faster) | Evaluated over 1,000 independent episodes of 10,000 steps each. elevator-ai/ ├── environment/ │ ├── building.py # Core simulation entities │ ├── elevator_env.py # Gymnasium environment │ └── traffic_patterns.py # Probabilistic passenger spawning ├── agents/ │ ├── classic_agent.py # Destination Dispatching baseline │ └── ppo_agent.py # PPO model definition ├── training/ │ └── train.py # Training script ├── tests/ │ ├── test_classic.py # Classic agent evaluation │ └── test_ppo.py # PPO agent evaluation └── README.md pip install gymnasium stable-baselines3 numpy Train the PPO agent: cd training python train.py Evaluate both agents: cd tests python test_classic.py python test_ppo.py - Simulation environment (Phase 1 + Phase 2) - Classical Destination Dispatching baseline - PPO agent training and evaluation - Realistic elevator speed simulation (in progress) - FastAPI inference server - Next.js visualization frontend Reinforcement Learning for Elevator Dispatch: A Comparative Study of PPO Against Classical Destination Dispatching Jonas Brahmst · [email protected] · Independent Research Project · 2025 This paper investigates whether a Proximal Policy Optimization (PPO) reinforcement learning agent can outperform a classical Destination Dispatching algorithm in a simulated multi-elevator environment. A custom Gymnasium environment was designed to model a 20-floor building with 4 elevators and probabilistic passenger traffic. After extensive reward engineering and architectural iteration, the PPO agent achieved a mean average reward of +0.14 per step compared to -0.67 for the classical baseline — a significant improvement driven primarily by a 6× reduction in average passenger wait time (93 vs 601 steps). Elevator dispatch algorithms are a well-studied problem in operations research. Modern buildings typically rely on rule-based systems such as SCAN or Destination Dispatching — deterministic algorithms that assign elevators to floors based on proximity and direction. While effective in predictable traffic conditions, such systems cannot adapt to learned patterns. This project explores whether a reinforcement learning agent trained via PPO can discover a superior dispatch policy from experience alone, without any hand-coded routing logic. The comparison is designed to be rigorous: both agents operate in identical environments under identical traffic conditions, and performance is averaged over 1,000 independent evaluation episodes. The simulation models a 20-floor building with 4 independent elevators. Each elevator maintains a list of target floors representing passengers currently aboard. Each floor tracks whether passengers are waiting to travel up or down, along with how long they have been waiting. Building: 20 floors, 4 elevators Max steps: 10,000 per episode Time model: 1 step ≈ 1 elevator floor traversal A deliberate design decision was made to model elevator call buttons as boolean flags (waitingUp , waitingDown ) rather than integer counters. This reflects reality: a floor call button does not convey how many people are waiting — only that someone is. This keeps the simulation grounded in the information a real elevator system would actually have access to. Passenger spawning is probabilistic with time-of-day modifiers: | Period | Floors affected | Multiplier | Direction bias | |---|---|---|---| | Off-peak | All | 1× | Random | | Morning Rush (07–10) | Floor 0 | 3× | Up (100%) | | Morning Rush (07–10) | Floors 1–19 | 0.3× | Random | | Evening Rush (16–20) | Floors 1–19 | 2× | Down (80%) | Base probability was set to 0.02 per floor per step, yielding approximately 0.4 passengers per step across the building — a realistic load for 4 elevators. A critical design decision was to implement a full two-phase passenger model: - Phase 1: Passenger waits on floor with a directional call - Phase 2: Passenger boards elevator, a random destination floor is assigned, passenger travels to destination Upon boarding, destination floors are sampled uniformly from valid floors in the travel direction: # Boarding going up from floor N: target = random.randint(floor.number + 1, num_floors - 1) # Boarding going down from floor N: target = random.randint(0, floor.number - 1) This models the real-world scenario where the elevator system does not know a passenger's destination until they board and press a button. The observation vector has 92 values: | Component | Count | Description | |---|---|---| | Elevator position | 4 | Current floor (0–19) | | Elevator direction | 4 | Moving value (-1, 0,

원문 보기 (hackernews)

Genesis Park 편집팀이 AI를 활용하여 작성한 분석입니다. 원문은 출처 링크를 통해 확인할 수 있습니다.

요약

본문

관련 저널 읽기