FUSE가 아닌 ZFS: 단일 VM의 $0.02/GB GCS 객체 스토리지에서 13GB/s

hackernews | 2026년 3월 6일 01:27 | 🔬 연구

#gcs #review #vm #zfs #객체 스토리지 #스토리지 최적화

원문 출처: hackernews · Genesis Park에서 요약 및 분석

요약

1. **Analyze the Request:** * **Role:** News summarization expert. * **Task:** Summarize the provided article concisely in 2-4 sentences. * **Constraints:** * Focus on the most important information. * DO NOT copy/paste. * DO NOT simply restate the title. * Add new information beyond the title. * Answer in English. * **Input:** * Title: ZFS, not FUSE: 13 GB/s from $0.02/GB GCS object storage on a single VM * Article Text: "ZFS, not FUSE: 13 GB/s from $0.02/GB GCS object storage on a single VM" (Note: The input text is just the title repeated or extremely minimal. I need to infer content based on the title or check if there's hidden text, but often with these prompts, if the text is missing, I have to rely on general knowledge of the topic or the specific link context if I were browsing. However, looking at the prompt, it says "Article Text: [Title repeated]". This suggests the user might have made a mistake and didn't paste the full body, OR the "article" is just a headline/blurb. *Wait*, looking at the prompt again, the "Article Text" section only contains the title. This is tricky. * **Self-Correction/Refinement:** If the text is actually missing, I cannot "Add new information beyond the title" without hallucinating or using external knowledge. However, usually in these datasets, the text is supposed to be there. Let me look closer at the prompt structure. Ah, I see the input `Article Text: ZFS, not FUSE: 13 GB/s from $0.02/GB GCS object storage on a

본문

The Untapped Goldmine Every hyperscaler has abundant object storage at $0.02/GB/month with 11 nines of durability. Petabytes of AI training data, genomic datasets, media archives — all sitting in GCS, S3, and Azure Blob. Cheap. Durable. Available. And largely untapped for high-performance workloads. Because applications need POSIX file access, and object storage doesn't speak POSIX. The POSIX Bridge Problem The industry agrees a bridge is needed. Google routes even its premium storage tiers through a filesystem mount so that "common AI frameworks such as TensorFlow and PyTorch can access object storage without having to modify any code." The question is what kind of bridge. Today, the common answer is FUSE — Filesystem in Userspace. Every major cloud offers one: mount a bucket, get a filesystem. Simple in concept. FUSE was the simple, kneejerk choice — and it shows. Performance is the first casualty: kernel-to-userspace context switches on every I/O operation are an architectural tax you can't optimize away. Linus Torvalds was direct about this on linux-fsdevel: "People who think that userspace filesystems are realistic for anything but toys are just misguided." A USENIX FAST '17 study ("To FUSE or Not to FUSE") confirmed: FUSE overhead caused up to 83% performance degradation, with 31% higher CPU utilization — even when optimized. But performance is only the start. The deeper problem is that a FUSE bridge never becomes a real filesystem — it just accumulates caveats. No hard links. No file locking. Non-atomic renames. Whole-object rewrites for any modification. No POSIX access control. Concurrent writers get stale file handle errors instead of serialized I/O. The documentation itself recommends "retry strategies" for transient errors. Cloud providers know the performance gap. Their response has been to build faster backends — premium storage tiers, NVMe caching layers, managed parallel filesystems — so the backend is fast enough that FUSE overhead becomes tolerable. Impressive engineering. But it solves the performance problem from the wrong end, and none of it fixes the limitations above. A Different Approach: ZFS as the Bridge What if you fix the access layer instead of retrofitting the backend? ZFS is a battle-tested kernel filesystem trusted for two decades. It natively stripes data across multiple block devices. The insight: make each object storage bucket a block device. This is what MayaNAS does — a cloud-native ZFS storage solution that presents object storage as a standard filesystem. Deploy via Terraform on GCP, AWS, or Azure — or self-serve from cloud marketplaces. Hybrid Storage Architecture - ZFS special vdev (local NVMe) — metadata and small blocks - Object storage vdevs (GCS/S3/Azure Blob buckets) — bulk data, striped across multiple buckets This is not NVMe caching with copy-in/copy-out. ZFS places data automatically based on block size — metadata lands on NVMe, large data blocks land directly on object storage. One filesystem, one namespace, one mount point. What you get — the whole platter: - Real POSIX — open() ,read() ,seek() ,stat() . Every application works unmodified. - End-to-end checksums — every block verified on read. Silent corruption caught automatically. - Snapshots and clones — instant, zero-cost, backed by object storage durability. - In-kernel I/O — ZFS prefetch, ARC caching, and I/O scheduling all run in the kernel. No FUSE tax. Test Configuration Hardware Specification | Compute Instance | c4-highcpu-144 (72 cores / 144 threads, Intel Xeon 6985P-C, 282 GB RAM) | | Network | 150 Gbps Tier_1 networking, MTU 8896 (jumbo frames) | | Storage Backend | 12 GCS Standard buckets ($0.02/GB/month), striped | | Special Vdev | 10 GB local NVMe persistent disk (metadata only) | | Zone | GCP us-central1 | ZFS Configuration | Recordsize | 1 MB (aligned with GCS object size for optimal throughput) | | Compression | Disabled (compression breaks 1 MB alignment) | | ARC Max | 80 GB (dataset is 210 GB = 2.6× ARC, ensuring reads hit GCS) | | Prefetch Distance | 512 MB (default 8 MB — 64× deeper read-ahead) | | Async Read Max Active | 96 per vdev (default 3 — 32× more concurrent reads per bucket) | The Proof: 13 GB/s from Standard GCS 13.1 GB/s Cold Read (zero cache) 10.7 GB/s Sustained 5-minute Read Cold Read: 13.1 GB/s Zero cache — every byte from GCS. 210 GB in 17 seconds: dstat (1-second intervals) confirms pure network, zero local disk: recv send | read writ 13G 28M | 0 0 15G 32M | 0 0 ← peak 14G 31M | 0 0 14G 31M | 0 0 13G 28M | 0 0 13–15 GB/s network receive, zero disk reads. Every byte is live from GCS. (GCS Cloud Monitoring has a 60-second minimum granularity — a 17-second burst can't be captured cleanly, so dstat network telemetry is the proof for the cold run.) Sustained 5-Minute Read: 10.7 GB/s — with Google's Own Proof Continuous read for 5 minutes, 3 TB total: For this run, we have third-party proof. Google Cloud Monitoring — Google's own infrastructure telemetry — confirms per-mi

원문 보기 (hackernews)

Genesis Park 편집팀이 AI를 활용하여 작성한 분석입니다. 원문은 출처 링크를 통해 확인할 수 있습니다.

요약

본문

관련 저널 읽기