푸리에 변환이 소리를 주파수로 변환하는 방법

Towards Data Science | | 🔬 연구
#review #리뷰 #소리 #주파수 #푸리에 변환
원문 출처: Towards Data Science · Genesis Park에서 요약 및 분석

요약

이 기사는 푸리에 변환이 복잡한 소리 신호를 개별 주파수 성분으로 분해하는 과정을 시각적 직관을 중심으로 설명합니다. 저자는 와인딩 머신과 스펙트로그램 등의 구체적인 예시를 활용하여, 추상적인 수학 공식이 실제로 어떤 원리로 데이터를 처리하고 변환하는지 그 메커니즘을 상세히 보여줍니다.

본문

Why This Piece Exists Before we get into the Fourier Transform, you should have a basic understanding of how digital sound is stored — specifically sampling and quantization. Let me quickly cover it here so we’re on the same page. Sound in the real world is a continuous wave — air pressure changing smoothly over time. But computers can’t store continuous things. They need numbers, discrete values. To store sound digitally, we do two things. First, sampling — we take “snapshots” of the sound wave’s amplitude at regular intervals. How many snapshots per second? That’s the sampling rate. CD-quality audio takes 44,100 snapshots per second (44.1 kHz). For speech in ML pipelines, 16,000 per second (16 kHz) is common and mostly sufficient. I’ve worked with 16 kHz speech data extensively, and it captures pretty much everything that matters for speech. The key idea is that we’re converting a smooth continuous wave into a series of discrete points in time. Second, quantization — each snapshot needs to record how loud the wave is at that moment, and with how much precision. This is the bit depth. With 16-bit audio, each amplitude value can be one of 65,536 possible levels (2¹⁶). That’s more than enough for the human ear to notice any difference from the original. With only 8-bit, you’d have just 256 levels — the audio would sound rough and grainy because the gap between the true amplitude and the closest storable value (this gap is called quantization error) becomes audible. After sampling and quantization, what we have is a sequence of numbers — amplitude values at evenly spaced time steps — stored in the computer. That’s our time domain signal. That’s g(t). And that’s what the Fourier Transform takes as input. I’ve spent a good amount of time working hands-on with audio data preprocessing and model training, mostly dealing with speech data. While this piece builds everything from first principles, a lot of what’s written here comes from actually running into these things in real pipelines, not just textbook reading. Also a promise — no AI slop here. Let’s get into it. The Setup: What We’re Starting With The original audio signal — for complex sounds (including harmonic ones) like the human voice or musical instruments — is often made up of a combination of frequencies: constituent frequencies, or a superposition of frequencies. The continuous sound we’re talking about is in the time domain. It would be an amplitude vs. time graph. That is how the sampled points from the original sound are stored in a computer in digital format. The Fourier Transform (FT) is the mechanism through which we convert that graph from the time domain (X-axis → Time, Y-axis → Amplitude) into a frequency domain representation (X-axis → Frequency, Y-axis → Amplitude of contribution). If you’ve ever used librosa.stft() or np.fft.rfft() in your ML pipeline and wondered what’s actually happening under the hood when you go from raw audio to a spectrogram — this is it. The Fourier Transform is the foundation underneath all of it. Let’s talk more at an intuition level about what we’re aiming for and how the Fourier Transform delivers it. We’ll try to understand this in an organized way. Our Goal We want to find the values of those frequencies whose combination makes up the original sound. By “original sound,” I mean the digital signal that we’ve stored through sampling and quantization via an ADC into our digital system. In simpler terms – we want to extract the constituent frequencies from which the complex sound is composed. It’s analogous to having a bucket in which all colours are mixed, and we want to segregate the constituent colours. The bucket mixed with colours is the original audio signal. The constituent colours are the constituent frequencies. We want a graph that easily tells us which frequencies have what amplitude of contribution in making the original sound. The x-axis of that graph should have all the frequency values, and the y-axis should have the amplitude of contribution corresponding to each frequency. The frequencies that are actually present in the signal will show up as peaks. Everything else will be near zero. Our input would be the amplitude-time graph, and the output would be the amplitude-frequency graph from the Fourier Transform. It’s obvious that since these graphs look so different, there would be mathematics involved. And to be honest, advanced mathematical tools like the Fourier Transform and complex numbers are used to convert from our input (time domain graph) to our output (frequency domain graph). But to get the intuition of why the Fourier Transform does the job correctly, it’s essential to understand what the Fourier Transform does such that our goal is achieved. Then we’ll get to know how it helps us achieve it at an intuition level. The WHAT, the HOW, and the WHY. The WHAT: What Does FT Actually Do? In answering the WHAT, we don’t need to see what math is going on inside — we just want to know what input i

Genesis Park 편집팀이 AI를 활용하여 작성한 분석입니다. 원문은 출처 링크를 통해 확인할 수 있습니다.

공유

관련 저널 읽기

전체 보기 →