OCI의 Postgres Ha: 장애 조치가 모든 테스트를 통과했지만 프로덕션이 중단되는 이유
hackernews
|
|
🔬 연구
#ha
#oci
#patroni
#postgresql
#review
#장애조치
원문 출처: hackernews · Genesis Park에서 요약 및 분석
요약
AWS나 Azure와 달리 OCI의 가상 IP(VIP)는 인스턴스 간 자동 이동이 불가능하여, PostgreSQL 장애 조치 시 트래픽이 이전 노드로 계속 전달되는 문제가 발생할 수 있습니다. 따라서 단순히 OS 레벨에서 IP를 변경하는 것이 아니라, OCI 컨트롤 플레인 API를 호출해 네트워크 레벨에서 명시적으로 IP를 재할당해야 정상적인 장애 조치가 완료됩니다. 이를 해결하는 방법으로는 VIP를 사용하지 않고 Patroni API를 기반으로 상태를 확인하는 HAProxy를 도입하여 라우팅하는 방식이 있습니다.
본문
Your PostgreSQL HA cluster promotes a new primary. Patroni says everything is healthy. But your application is still talking to the old, dead node. Welcome to the OCI VIP problem. If you have built PostgreSQL high availability clusters on AWS or Azure, you have probably gotten comfortable with how virtual IPs work. You assign a VIP, your failover tool moves it, and your application reconnects to the new primary. Clean. Simple. Done. Then you try the same thing on Oracle Cloud Infrastructure and something quietly goes wrong. The cluster promotes. Patroni (or repmgr, or whatever you are using) does its job. The standby becomes the new primary. But the VIP does not follow. Your application keeps sending traffic to the old node — the one that just failed. From the outside, it looks like the database is down. From the inside, everything looks green. This is one of the more frustrating failure modes we have worked through in production. Not because it is hard to fix, but because it is hard to catch. It passes every test you throw at it right up until the moment it matters. Let me walk you through why this happens, how to fix it, and how to pick the right approach for your environment. Why OCI Handles VIPs Differently On AWS, a secondary private IP can float between instances within a subnet. You call assign-private-ip-addresses and it moves. On Azure, you update a NIC’s IP configuration. In both cases, your failover tool can handle this natively, or with a small callback script. OCI does not work that way. On OCI, a virtual IP (implemented as a secondary private IP on a VNIC) is explicitly bound to a specific instance’s Virtual Network Interface Card. It cannot float between instances the way it does on AWS or Azure. When your primary fails and the standby gets promoted, the VIP stays attached to the old instance’s VNIC. It does not move on its own, and the standard failover tooling does not know it needs to make OCI API calls to move it. Here is what is happening under the hood. On OCI, each compute instance has a primary VNIC with a primary private IP. You can add secondary private IPs to that VNIC, and those secondary IPs function as your VIPs. But reassigning that secondary IP to a different instance requires an explicit API call to detach it from one VNIC and attach it to another. The networking layer will not do this for you just because the IP was brought down on one host and brought up on another. This is worth repeating: the IP address being “up” at the OS level on the new node does not mean OCI’s networking fabric is routing traffic to it. You have to tell OCI to move the secondary IP assignment at the cloud control plane level. Otherwise, packets continue to arrive at the old VNIC — which is attached to an instance that is either down or no longer the primary. This is not a PostgreSQL limitation. It is not a Patroni limitation. It is an OCI networking behavior that breaks assumptions baked into most HA tooling. The Silent Failure: What It Actually Looks Like Here is the scenario that catches teams off guard. You set up a three-node Patroni cluster on OCI. Primary on node-1, synchronous standby on node-2, async standby on node-3. You configure a VIP on node-1. You test failover. Patroni promotes node-2. You check patronictl list and everything looks correct — node-2 is the leader, node-1 is a replica (or down). But the application never reconnects. The VIP is still registered on node-1’s VNIC inside OCI’s networking fabric. Even though node-2 brought the VIP up at the OS level (using ip addr add), OCI is not routing traffic to node-2. The packets are still going to node-1’s VNIC. If node-1 is completely down, the application gets connection timeouts. If node-1 is up but demoted to a replica, something worse happens — the application connects successfully but hits a read-only node and starts throwing errors on every write. Either way, your failover did not actually fail over from the application’s perspective. And unless you are specifically testing application connectivity after failover (not just checking cluster state), you will miss this during testing. Two Approaches That Work in Production We have deployed both of these in production on OCI with PostgreSQL clusters targeting 99.99% availability. They solve the same problem from different angles, and the right choice depends on your architecture and your team’s operational preferences. Approach 1: HAProxy with Health Checks (Skip the VIP Entirely) The most direct way to sidestep the OCI VIP problem is to stop relying on a VIP for routing. Instead, put HAProxy in front of your PostgreSQL cluster and let it figure out which node is the primary. Here is how it works. HAProxy sits between your application and the PostgreSQL nodes. It performs health checks against each node’s Patroni REST API. Patroni exposes endpoints that tell you exactly what role each node is playing: - GET /primary returns HTTP 200 only on the current primary - GET /replica returns
Genesis Park 편집팀이 AI를 활용하여 작성한 분석입니다. 원문은 출처 링크를 통해 확인할 수 있습니다.
공유