ECE 512 – Project Report
Separating Music and Voice Signals from an Audio
is the most glorious creation of humans, which touches the soul and life of the
human beings. The effect of music in humans is one which separates him from
other animals. Music has a tremendous effect in the life of human being, it helps
to unite people from different background and cultural heritage by breaking
boundaries. Songs play a crucial part in our everyday life. A song usually consists
of two basic things such as vocal and the background music. In general, a song
is a mixture of two components (vocal and background music). The background
music is a mixture of different musical instruments and vocal is usually the
voice of the singer.
music and vocals from the song helps us to the study the characteristics, pattern,
pitch, rhythm, texture and lyrics of the song which is useful for analyzing, teaching
and composing purposes. A music and vocal separator system works on the
principle of repetition in the music and the blind source separation and does
not depend on prior training, particular features and complex frameworks like
the existing separation models.
is the important principle in a song where there will be a repeating background
and a non-repeating foreground vocal. The music and vocal separation system
makes use of this principle to separate the music and vocals from a given audio
signal. Blind source separation (BSS) is the process of separating a set of
signals from the given mixed set of signals without any or little information
about the source or other process involved. This blind source separation technique
can be applied effectively on multidimensional data.
music and vocal separation system involves the extraction of the repeating
structure in the song. The period of the repeating structure is initially found
and then a spectrogram is segmented at the period boundaries and then averaged
to create a repeating segment model. Later individual time frequency bin in the
segment is compared with the model and binary time frequency masking is used
for partitioning the mixture by labelling the bins similar to the model as the
this project, a simple music and vocal separator system has been built by using
the REpeating Pattern Extraction Technique. Unlike
existing techniques, this is only based on self-similarity, and can be
implemented on any audio signal as long as there is a repeating structure. This
method is simple, fast, blind, and is also completely automatable.
In general, the music, vocal
separation involves three major parts: Finding the repeating period, repeating
segment model and binary time frequency masking. The repeating period p can be
found out by initially estimating the period of repeating musical structure, estimating
periodicities by using autocorrelation function and by using the peaks. The
repeating segment model is estimated by using the estimated period p
of the repeating musical structure and evenly segmenting the spectrogram of
length p. The binary time frequency mask M is calculated by taking the
logarithm of each bin and obtaining modified spectrogram and by adding
tolerance to the binary time frequency mask.
The repeating period p can be estimated by initially identifying
the repeating segments in a song and then by calculating the periodicities by
using the autocorrelation function. This can be used to measure the similarity
between the segment and the lagged version in the successive time intervals.
The spectrogram V of the mixture can be calculated by using the Short Time Fourier
Transform and then by estimating their magnitude along with the hamming window
of length N. This process helps in discarding the symmetric part and helps to
retain the DC component. Then autocorrelation is performed on each frequency
component V2 to obtain the autocorrelation matrix B. The V2
also helps in enhancing the peaks in the obtained autocorrelation matrix.
Incase if the given audio signal is mixture in nature,
then the V2 is obtained by averaging. The vector b which estimates
the self-similarity of the given song as a function of time lag is estimated by
using the means over the autocorrelation matrix. Later the vector b is normalized.
Thus, a similarity matrix is not calculated explicitly and the used method
provides a precise beat pattern visualization. After the calculation of the
beat spectrum (b) its first coefficient estimates any similarity that is present
in the signal and if any repeating pattern is present, the peaks are formed
according to the repeating pattern.
The period p is defined as the period of the longest
strong repeating pattern and is represented by the peaks with the largest level
and repeating at the longest period in b.
The repeating segment model can be estimated by using the obtained period
p. The spectrogram is now divided into equal segments of length p and r
segments. It is assumed that the time frequency bins consisting the repeating
pattern would have same values at each period p, and be same as the repeating
segment model. By observing at the peaks obtained it is seen that the geometric
means provide good extraction of the repeating musical structure than the
arithmetic means. The calculation of repeating segment model is given by,
Time Frequency Masking:
The third important part is the calculation of binary time frequency
mask M. From the calculated repeating
segment the spectrogram is divided by each time-frequency
corresponding to the .
The absolute logarithm of each bin is calculated to get a modified spectrogram .
The V can later be partitioned by assigning time-frequency bins with
values near 0 in to the repeating background. Based on the assumption that the repeating
background structure and the varying foreground sound have sparse and disjoint
time-frequency representations. In real time, the time-frequency bins of music
and voice can overlap, and furthermore the repeating musical structure
generally involves variations. Therefore, a tolerance t is added to the binary
time-frequency mask M. It is seen that tolerance of t = 1 gives good separation
results, for both the repeating background (music) and non-repeating foreground
After the computation of the binary time-frequency mask M, it is then symmetrized
and applied to the Short Time Fourier Transform X of the audio signal x to get
the STFT of the music and the STFT of the voice . The estimated music signal and voice signal are finally obtained by performing inverse
Short Time Fourier in the time domain.
Results and Discussions: