Techniques

Non-negative matrix factorization (NMF)

Non-negative matrix factorization (NMF) is a popular signal processing technique used in source separation problem. The given data is an K*N dimensional matrix V representing a spectrogram (it can be regular STFT, or Mel spectrogram). The algorithm finds two matrices W (of size K*S) and H (of size S*N) so that:

The matrix product WH approximates V
Entries of W and H are non-negative.

W contains features of the audio and H represents activations of different features. The common dimension S, called the rank of the factorization, can be chosen as desired and usually has something to do with the number of sources we wish to extract.

source: https://medium.com/@zahrahafida.benslimane/audio-source-separation-using-non-negative-matrix-factorization-nmf-a8b204490c7d

W and H are usually obtained by optimizing a cost function using iterative updates of a gradient descent. To be specific, we start with a random choice of W and H. The cost function measures the distance between V and WH. Then the gradient would tell us direction of the steepest descent, so we can update W and H along the way. Some care can be taken so that the updated W and H are still non-negative.

source: https://halfrost.me/post/how-to-understand-gradient-descent/

Finally the spectrogram of sources is obtained by multiplying the i-th column of W with the i-th row of H.

featured_huf8623e3651292c46251c2f19af84c2fd_651010_720x0_resize_lanczos_2.png

Harmonic-percussive source separation (HPSS)

Harmonic-percussive source separation (HPSS) is a signal processing technique used to separate an audio signal into its harmonic (e.g., bass) and percussive (e.g., drums) components. The harmonic components have smooth and stable variations over time, whereas the percussive components have transient and rapid variations over time.

Mel spectrograms

The Mel spectrogram is the spectrogram represented in Mel scale. Human ears don’t perceive sounds linearly. Indeed, human ears are better at differentiating low frequencies than high frequencies. Given this, Mel scale is a perceptual scale judged by listeners to approximate linearity.