Show HN: Volresample – CPU의 PyTorch보다 최대 13배 빠른 3D 볼륨 리샘플링

hackernews | | 📰 뉴스
#3d볼륨 #cython #openmp #pytorch #성능최적화 #하드웨어/반도체 #병렬처리
원문 출처: hackernews · Genesis Park에서 요약 및 분석

요약

한 개발자가 PyTorch 없이 NumPy 배열을 사용해 3D 볼륨(의료 영상 등)을 리샘플링하는 오픈 소스 라이브러리 ‘volresample’을 공개했습니다. Intel i7 기준 벤치마크 결과, 이 도구는 삼선형 보간 1.6배, 영역 모드 9.5배, 가장 가까운 픽셀 방식(int16)은 최대 13배까지 PyTorch 대비 더 빠른 성능을 보였습니다. 이러한 속도 향상은 인덱스 테이블 사전 계산, OpenMP 병렬화 및 데이터 타입 캐스팅 제거와 같은 최적화 기법을 통해 달성되었습니다.

본문

Fast 3D volume resampling with Cython and OpenMP parallelization. Implemented against PyTorch's F.interpolate and F.grid_sample as a reference, producing identical results. Can be used as a drop-in replacement when PyTorch is not available or when better performance is desired on CPU. - Cython-optimized with OpenMP parallelization - Simple API: resample() andgrid_sample() - Interpolation modes: nearest, linear and area - Supports 3D and 4D (multi-channel) volumes - Supports uint8, int16 (nearest) and float32 dtypes (all) pip install volresample Or build from source: git clone https://github.com/JoHof/volresample.git cd volresample uv sync import numpy as np import volresample # Create a 3D volume volume = np.random.rand(128, 128, 128).astype(np.float32) # Resample to a different size resampled = volresample.resample(volume, (64, 64, 64), mode='linear') print(resampled.shape) # (64, 64, 64) # 4D volume with 4 channels volume_4d = np.random.rand(4, 128, 128, 128).astype(np.float32) # Resample all channels resampled_4d = volresample.resample(volume_4d, (64, 64, 64), mode='linear') print(resampled_4d.shape) # (4, 64, 64, 64) # 5D volume with batch dimension (N, C, D, H, W) volume_5d = np.random.rand(2, 4, 128, 128, 128).astype(np.float32) # Resample all batches and channels resampled_5d = volresample.resample(volume_5d, (64, 64, 64), mode='linear') print(resampled_5d.shape) # (2, 4, 64, 64, 64) # Input volume: (N, C, D, H, W) input = np.random.rand(2, 3, 32, 32, 32).astype(np.float32) # Sampling grid with normalized coordinates in [-1, 1] grid = np.random.uniform(-1, 1, (2, 24, 24, 24, 3)).astype(np.float32) # Sample with linear interpolation output = volresample.grid_sample(input, grid, mode='linear', padding_mode='zeros') print(output.shape) # (2, 3, 24, 24, 24) import volresample # Check default thread count (min of cpu_count and 4) print(volresample.get_num_threads()) # e.g., 4 # Set custom thread count volresample.set_num_threads(8) # All subsequent operations use 8 threads resampled = volresample.resample(volume, (64, 64, 64), mode='linear') Resample a 3D, 4D, or 5D volume to a new size. Parameters: data (ndarray): Input volume of shape(D, H, W) ,(C, D, H, W) , or(N, C, D, H, W) size (tuple): Target size(D_out, H_out, W_out) mode (str): Interpolation mode:'nearest' : Nearest neighbor (works with all dtypes)'linear' : Trilinear interpolation (float32 only)'area' : Area-based averaging (float32 only, suited for downsampling) PyTorch correspondence: | volresample | PyTorch F.interpolate | |---|---| mode='nearest' | mode='nearest-exact' | mode='linear' | mode='trilinear' | mode='area' | mode='area' | volresample does not expose an align_corners parameter. The behavior matches PyTorch's align_corners=False (the default). Returns: - Resampled array with same number of dimensions as input Supported Dtypes: uint8 ,int16 : Only withmode='nearest' float32 : All modes Sample input at arbitrary locations specified by a grid. Parameters: input (ndarray): Input volume of shape(N, C, D, H, W) grid (ndarray): Sampling grid of shape(N, D_out, H_out, W_out, 3) - Values in range [-1, 1] where -1 maps to the first voxel, 1 to the last - Values in range mode (str):'nearest' or'linear' padding_mode (str):'zeros' ,'border' , or'reflection' PyTorch correspondence: | volresample | PyTorch F.grid_sample | |---|---| mode='nearest' | mode='nearest' | mode='linear' | mode='bilinear' | The behavior matches PyTorch's grid_sample with align_corners=False . Returns: - Sampled array of shape (N, C, D_out, H_out, W_out) Set the number of threads used for parallel operations. Parameters: num_threads (int): Number of threads to use (must be >= 1) Get the current number of threads used for parallel operations. Returns: - Current thread count (default: min(cpu_count, 4) ) Benchmarks on an Intel i7-8565U against PyTorch 2.8.0. Times are means over 10 iterations. resample() — single large 3D volume: | Operation | Mode | Single-thread | Four-threads | |||| |---|---|---|---|---|---|---|---| | volresample | PyTorch | Speedup | volresample | PyTorch | Speedup | || | 512³ → 256³ | nearest | 23.6 ms | 38.0 ms | 1.6× | 12.6 ms | 16.7 ms | 1.3× | | 512³ → 256³ | linear | 99.9 ms | 182 ms | 1.8× | 34.3 ms | 54.6 ms | 1.6× | | 512³ → 256³ | area | 230 ms | 611 ms | 2.7× | 64.5 ms | 613 ms | 9.5× | | 512³ → 256³ | nearest (uint8) | 13.7 ms | 33.8 ms | 2.5× | 4.3 ms | 10.4 ms | 2.4× | | 512³ → 256³ | nearest (int16) | 16.5 ms | 217 ms | 13.2× | 8.4 ms | 93.2 ms | 11.2× | grid_sample() — single large 3D volume (128³ input): | Mode | Padding | Single-thread | Four-threads | |||| |---|---|---|---|---|---|---|---| | volresample | PyTorch | Speedup | volresample | PyTorch | Speedup | || | linear | zeros | 118 ms | 181 ms | 1.5× | 38.1 ms | 169 ms | 4.4× | | linear | reflection | 103 ms | 211 ms | 2.1× | 33.2 ms | 194 ms | 5.9× | Average speedup across all benchmarks: 3.1× at 1 thread, 6.0× at 4 threads. Notes: - Area mode: At 1 thread the speedup is 2.7×; at 4 t

Genesis Park 편집팀이 AI를 활용하여 작성한 분석입니다. 원문은 출처 링크를 통해 확인할 수 있습니다.

공유

관련 저널 읽기

전체 보기 →