Practical music classification project

Practical music classification project - panden's Machine Learning notes

Music structure analysis

  • Enjoy a song first

Unconditional

  • View its waveform

View the waveform in a second unconditionally

#%%Waveform display
import matplotlib.pyplot as plt
import librosa.display
import numpy as np
from pydub import AudioSegment
 
# 1 second = 1000 milliseconds
SECOND = 1000
# Music files
AUDIO_PATH = 'Unconditional-Eason Chan(30 second).wav'
 
def split_music(begin, end, filepath):
 # Import music
    song = AudioSegment.from_wav(filepath)
 
 # Take the segment from begin seconds to end seconds
    song = song[begin*SECOND: end*SECOND]
 
 # Store backup for temporary files
    a = filepath.split('.')
    temp_path = a[0] + '(Waveform display).' + a[1]
    song.export(temp_path, format='wav')
 
    return temp_path
 
music, sr = librosa.load(split_music(0, 1, AUDIO_PATH))
 
# Figure with aspect ratio of 14:5
plt.figure(figsize=(14, 5))
librosa.display.waveplot(music, sr=sr)
plt.show()
  • The waveform within 1 second is as follows:

It's still disgusting. I can't see anything. So, change it to within 0.1 seconds

Connect code

# enlarge
n0 = 9000
n1 = 10000
 
music = np.array([mic for mic in music])
plt.figure(figsize=(14, 5))
plt.plot(music[n0:n1])
plt.grid()
 
# Display diagram
plt.show()
  • The waveform within 0.1s is as follows:

Do you want to stop looking at the image of the positive half axis?

# Show only positive half
music = np.array([mic for mic in music if mic > 0])
plt.figure(figsize=(14, 5))
plt.plot(music[n0:n1])
plt.grid()
 
# Display diagram
plt.show()
  • The waveform of positive half axis within 0.1s is as follows:

Fourier analysis of audio structure

Time domain and frequency domain (preliminary knowledge)

time domain: Is the direction that changes with time in our real life, As shown in the figure above; The abscissa of the above figure is time, The ordinate represents the amplitude;

frequency domain: The frequency domain is the direction that does not change with time;

Since we were born, the world we see runs through time. The trend of stocks, people's height and the trajectory of cars will change over time. This method of observing the dynamic world with time as a reference is called time domain analysis. And we also take it for granted that everything in the world is changing over time and will never stop. But if I tell you to observe the world in another way, you will find that the world is eternal. Do you think I'm crazy? I'm not crazy. This static world is called frequency domain.

Fourier series

The simplest way to understand Fourier series is that any periodic function can be decomposed into a pile of sinusoidal functions, where the sine is A sin ⁡ ( ω x + φ ) A\sin(\omega x + \varphi) Asin( ω x+ φ). First, the pile here can be infinite, and then because sin ⁡ ( α + β ) = sin ⁡ α cos ⁡ β + sin ⁡ β cos ⁡ α \sin(\alpha + \beta)=\sin\alpha\cos\beta + \sin\beta\cos\alpha sin( α+β)= sin α cos β+ sin β cos α, So we say that any periodic function can be decomposed into a pile of sine and cosine functions.

Take the following example:

As shown in the figure, it is a square wave:

So how to use sine wave to make such a square wave?

  • step1: a sine wave

y = 4 π sin ⁡ x y = \frac{4}{\pi}\sin x y=π4​sinx

  • Step 2: two sine waves

Add a sine wave to the above
y = 4 3 π sin ⁡ 3 x y = \frac{4}{3\pi}\sin 3x y=3π4​sin3x

Stack them
y = 4 π sin ⁡ x + 4 3 π sin ⁡ 3 x y = \frac{4}{\pi}\sin x + \frac{4}{3\pi}\sin 3x y=π4​sinx+3π4​sin3x

  • step3: re stack
    y = 4 π sin ⁡ x + 4 3 π sin ⁡ 3 x + 4 5 π sin ⁡ 5 x y = \frac{4}{\pi}\sin x + \frac{4}{3\pi}\sin 3x + \frac{4}{5\pi}\sin 5x y=π4​sinx+3π4​sin3x+5π4​sin5x

  • step?: add again
    y = 4 π sin ⁡ x + 4 3 π sin ⁡ 3 x + 4 5 π sin ⁡ 5 x + ⋯ y = \frac{4}{\pi}\sin x + \frac{4}{3\pi}\sin 3x + \frac{4}{5\pi}\sin 5x + \cdots y=π4​sinx+3π4​sin3x+5π4​sin5x+⋯

When I superimposed 50 items, he was very close to the square wave

The sound wave can be described by such Fourier series. Next, we will show the Fourier series in time domain and frequency domain

Graphical Fourier series (taking 3 terms as an example)

y = 4 π sin ⁡ x + 4 3 π sin ⁡ 3 x + 4 5 π sin ⁡ 5 x y = \frac{4}{\pi}\sin x + \frac{4}{3\pi}\sin 3x + \frac{4}{5\pi}\sin 5x y=π4​sinx+3π4​sin3x+5π4​sin5x

  • Time domain and frequency domain

In the figure, the black is y, and the other colors are its components. What is very complex in the time domain (y) is actually sine waves in the frequency domain, and these sine waves are eternal;

  • Just look at the frequency domain

However, it is obvious that sine waves that only know a specific frequency may not form a unique sound wave; because the initial phase is different, that is, the initial value is not necessarily the same, although the subsequent superposition is periodic;

We can use a thing called phase spectrum to see its phase;

By looking for a baseline, I'm looking for time 0 as the baseline;

Then, after finding the baseline, find the time of the maximum value of the sine wave after the baseline, and make a difference between the time and the time of the baseline to obtain the time difference, divide the period and multiply it 2 π 2\pi 2π

mutually position difference = Time between difference week stage × 2 π Phase difference = \ frac {time difference} {period} \ times 2\pi Phase difference = cycle time difference × 2π

Then we say that the time domain is viewed from the front, the frequency domain is viewed from the side, and the phase spectrum is viewed from the bottom

  • time difference

  • phase difference

We can see that the phase difference is actually the same, because the initial phase of our series is 0;

y = A sin ⁡ ( ω x + φ ) , φ Just yes first mutually position y = A\sin(\omega x + \varphi), \varphi is the initial phase y=Asin( ω x+ φ),φ Is the initial phase

Well, that's how Fourier series describe any function. In fact, that's it, right;

Fourier transform is to divide the result (y) of a Fourier series into many sine wave functions;

Then all the drawing codes above are as follows:

#%%Graphical Fourier series
import matplotlib.pyplot as plt
import numpy as np


x = np.array(np.linspace(0.1, 10, 100))
y = [1] * 30 + [-1] * 30 + [1] * 30 + [-1] * 10
y1 = 4/np.pi*np.sin(x)
y2 = 4/(3*np.pi)*np.sin(3*x)
y3 = y1 + y2
y4 = 4/(5*np.pi)*np.sin(5*x)
y5 = y3 + y4
y6 = 4/(7*np.pi)*np.sin(7*x)
y7 = y5 + y6
plt.rcParams['axes.facecolor']='black'
plt.figure(figsize=(10,6))
plt.plot(x, y1)
plt.plot(x, y2, 'g')
plt.plot(x, y3, 'r')
plt.plot(x, y5, 'y')
plt.plot(x, y7, 'gray')
plt.show()

for i in range(3, 100, 2):
    y1 += 4/(i*np.pi)*np.sin(i*x)

plt.figure(figsize=(10,6))
plt.plot(x, y, 'gray')
plt.show()
#%%Time domain and frequency domain
plt.rcParams['axes.facecolor']='snow'
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False
fig = plt.figure()
ax = plt.axes(projection='3d')
ax.view_init(elev=30, azim=335)


ax.plot(x, [1]*len(x), y1, 'red')
ax.plot(x, [3]*len(x), y2, 'green')
ax.plot(x, [5]*len(x), y4, 'cyan')
ax.plot(x, [0]*len(x), y5, 'black')
plt.xlabel('time domain')
plt.ylabel('frequency domain')
ax.set_zlabel('amplitude')
plt.show()
#%%Time difference
plt.rcParams['axes.facecolor']='snow'
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False
fig = plt.figure()
ax = plt.axes(projection='3d')
ax.view_init(elev=45, azim=330)

x1 = [i for i in x if i <= np.pi/2]
x2 = [i for i in x if i <= np.pi/6]
x4 = [i for i in x if i <= np.pi/10]

ax.plot([0]*6, range(0, 6), 0, 'blue')
ax.plot(x1, [1]*len(x1), 0, 'red')
ax.plot(x2, [3]*len(x2), 0, 'green')
ax.plot(x4, [5]*len(x4), 0, 'cyan')
# ax.plot(x, [0]*len(x), y5, 'black')
ax.set_xlabel('time domain')
ax.set_xlim([0, 2])
ax.set_ylabel('frequency domain')
ax.set_ylim([0, 5.5])
ax.set_zlabel('amplitude')
ax.set_zlim([0, 1])
plt.show()

#%%Phase spectrum
fig = plt.figure()
ax = plt.axes(projection='3d')
ax.view_init(elev=45, azim=330)

ax.plot([0]*6, range(0, 6), 0, 'blue')
ax.plot(x1, [1]*len(x1), 0, 'red')
ax.plot(x1, [3]*len(x2)*3, 0, 'green')
ax.plot(x1, [5]*len(x4)*5, 0, 'cyan')
# ax.plot(x, [0]*len(x), y5, 'black')
ax.set_xlabel('time domain')
ax.set_xlim([0, 2])
ax.set_ylabel('frequency domain')
ax.set_ylim([0, 5.5])
ax.set_zlabel('amplitude')
ax.set_zlim([0, 1])
plt.show()

Fourier transform the above unconditionally

In python's spicy library, there is a function called fft, which is called fast Fourier transform in Chinese. Because it is not specialized in electronic information, it's OK to use it. Don't tangle with his algorithm. Euler formula and imaginary numbers are not easy to do;

Then there is a function called specgram in the matplotlib library, which can draw the time domain and frequency domain at the same time;

Only 30 seconds of clips are intercepted here, because that's what happens when there are more, just as an example;

Then the audio cutting uses Adobe audit, of course python;

It is a class called AudioSegment in the pydub library, which can support the formats of 'mp3', 'wav', 'raw', 'ogg' or other ffmpeg;

Not much to say, on the code!!!

  • Audio clip code
#%%Music editing
from pydub import AudioSegment

# 1 second = 1000 milliseconds
SECOND = 1000
# Import music
song = AudioSegment.from_wav("Unconditional-Eason Chan.wav")

# Take the segment between 33 seconds and 63 seconds
song = song[33*SECOND:63*SECOND]

# # The entrance part is increased by 6 dB and the exit part is reduced by 5 dB
# ten_seconds = 10 * SECOND
# last_five_seconds = -5 * SECOND
# beginning = song[:ten_seconds] + 6
# ending = song[last_five_seconds:] - 5

# # Form new fragments
# new_song = beginning + song[ten_seconds:last_five_seconds] + ending
new_song = song
# Export music
new_song.export('Unconditional-Eason Chan(30 second).wav', format='wav')

  • fft Fourier transform code
#%%Unconditional Fourier transform
from scipy import fft #fft is Fourier transform
from scipy.io import wavfile
from matplotlib.pyplot import specgram
import matplotlib.pyplot as plt

path = r'C:\Users\Panden\Documents\python Full series\artificial intelligence\Unconditional-Eason Chan(30 second).wav'
(sample_rate,x) = wavfile.read(path)
print(sample_rate,x.shape)   #Sample_rate how many samples are sampled per second x is (1395072, 2) double, indicating that the song is a dual channel. Use x/sample_rate to get the duration (s) of the song



def plotSpec(file_name):
    plt.subplot(1,2,1)
    sample_rate,x = wavfile.read(file_name)
    x = x[:,0] # Turn dual channel into single channel
    specgram(x,Fs = sample_rate,xextent = (0,30))
    plt.xlabel('time')
    plt.ylabel('frequency')
    plt.grid(True,linestyle='-',color = '0.25')
    plt.title('time domain--Unconditional-Eason Chan(30 second)')
    plt.subplot(1,2,2)
    plt.xlabel('frequency')
    plt.xlim(0, 4000)
    plt.ylabel('amplitude')
    plt.title('FFT of Unconditional-Eason Chan(30 second)')
    plt.plot(fft(x,sample_rate))
    plt.show()

    
plt.figure(figsize = (18,9),dpi = 80, facecolor = 'w', edgecolor = 'k')
plotSpec(path)

The results are as follows:

Music classification project

After so much foreshadowing, we finally reached the actual battle;

  • Objective: to achieve music classification and classify the input music into these categories:

[ c l a s s i c a l j a z z p o p b l u e s c o u n t r y m e t a l r o c k h i p h o p d i s c o r e g g a e ] \begin{bmatrix} classical & jazz & pop & blues & country\\ metal & rock& hiphop & disco & reggae\\ \end{bmatrix} [classicalmetal​jazzrock​pophiphop​bluesdisco​countryreggae​]

  • Data: 100 songs of each kind

First look at the time domain of the song

Obviously, we can't see the above, so it's a mess of sound waves;

from scipy import fft #fft is Fourier transform
from scipy.io import wavfile
from matplotlib.pyplot import specgram
import matplotlib.pyplot as plt

(sample_rate,x) = wavfile.read(r'C:\Users\Panden\Documents\python Full series\artificial intelligence\genres\blues\converted\blues.00000.au.wav')
print(sample_rate,x.shape)   #Sample_rate how many samples are sampled per second x is (661794,) unit group indicates that this song is a single channel. Use x/sample_rate to get the duration of this song (s)

# plt.figure(figsize = (10,4),dpi = 80)
# plt.xlabel('time')
# plt.ylabel('frequency')
# plt.grid(True,linestyle='-',color = '0.25')
# specgram(x,Fs = sample_rate,xextent = (0,30))
# plt.show()

def plotSpec(g,n):
    file_name = r'C:\Users\Panden\Documents\python Full series\artificial intelligence\genres' + '\\' + g + '\converted' + '\\' + g + '.' + n + '.au.wav'
    sample_rate,x = wavfile.read(file_name)
    specgram(x,Fs = sample_rate,xextent = (0,30))
    plt.xlabel('time')
    plt.ylabel('frequency')
    plt.grid(True,linestyle='-',color = '0.25')
    plt.title(g+'_'+n[-1])
    
plt.figure(figsize = (18,9),dpi = 80, facecolor = 'w', edgecolor = 'k')
plt.subplot(6,3,1);plotSpec('classical','00001')
plt.subplot(6,3,2);plotSpec('classical','00002')
plt.subplot(6,3,3);plotSpec('classical','00003')
plt.subplot(6,3,4);plotSpec('jazz','00001')
plt.subplot(6,3,5);plotSpec('jazz','00002')
plt.subplot(6,3,6);plotSpec('jazz','00003')
plt.subplot(6,3,7);plotSpec('pop','00001')
plt.subplot(6,3,8);plotSpec('pop','00002')
plt.subplot(6,3,9);plotSpec('pop','00003')
plt.subplot(6,3,10);plotSpec('rock','00001')
plt.subplot(6,3,11);plotSpec('rock','00002')
plt.subplot(6,3,12);plotSpec('rock','00003')
plt.subplot(6,3,13);plotSpec('country','00001')
plt.subplot(6,3,14);plotSpec('country','00002')
plt.subplot(6,3,15);plotSpec('country','00003')
plt.subplot(6,3,16);plotSpec('metal','00001')
plt.subplot(6,3,17);plotSpec('metal','00002')
plt.subplot(6,3,18);plotSpec('metal','00003')

plt.tight_layout(pad = 0.4,w_pad = 0,h_pad = 1)
plt.show()

Look at the frequency domain of the song

This frequency domain is differentiated, but it is still too difficult for us;

#%%Graphical Fourier transform
def plotFFT(g,n):
    file_name = r'C:\Users\Panden\Documents\python Full series\artificial intelligence\genres' + '\\' + g + '\converted' + '\\' + g + '.' + n + '.au.wav'
    sample_rate,x = wavfile.read(file_name)
    plt.plot(fft(x,sample_rate))
    plt.xlabel('frequency')
    plt.xlim(0,3000)   #We can't hear too high a frequency
    plt.ylabel('amplitude')
    plt.title(g+'_'+n[-1])

plt.figure(num = None,figsize = (10,8),dpi = 80,facecolor = 'w',edgecolor = 'k')
plt.subplot(6,2,1);plotSpec('classical','00001')
plt.subplot(6,2,2);plotFFT('classical','00001')
plt.subplot(6,2,3);plotSpec('jazz','00001')
plt.subplot(6,2,4);plotFFT('jazz','00001')
plt.subplot(6,2,5);plotSpec('country','00001')
plt.subplot(6,2,6);plotFFT('country','00001')
plt.subplot(6,2,7);plotSpec('pop','00001')
plt.subplot(6,2,8);plotFFT('pop','00001')
plt.subplot(6,2,9);plotSpec('rock','00001')
plt.subplot(6,2,10);plotFFT('rock','00001')
plt.subplot(6,2,11);plotSpec('metal','00001')
plt.subplot(6,2,12);plotFFT('metal','00001')
plt.show()

Do Fourier transform and extract features

Fourier transform all audio, save the results and use them later;

#Feature extraction
import numpy as np
def creat_fft(g,n):
    file_name = r'C:\Users\Panden\Documents\python Full series\artificial intelligence\genres' + '\\' + g + '\converted' + '\\' + g + '.' + str(n).zfill(5) + '.au.wav'
    sample_rate,x = wavfile.read(file_name)
    fft_features = abs(fft(x)[:1000])  #There is no need for much data, and then there is noise
    sad = r'C:\Users\Panden\Documents\python Full series\artificial intelligence\trainset' + '\\' + g + '.' + str(n).zfill(5) + '.fft'
    np.save(sad,fft_features)

genre_list = ['classical','jazz','country','pop','rock','metal']
for g in genre_list:
    for n in range(100):
        creat_fft(g,n)

Convert features to x and labels to y

  • Set the Fourier transform feature to x, do some simple screening, and then convert the category to digital coding and set it to y;

  • Divide training set and test set;

  • Then the training model is saved for easy use;

#%%Read the data set after Fourier transform and convert it into x and y required by machine learning
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import pickle
genre_list = ['classical','jazz','country','pop','rock','metal']
x = []
y = []
for g in genre_list:
    for n in range(100):
            file_name = r'C:\Users\Panden\Documents\python Full series\artificial intelligence\trainset' + '\\' + g +  '.' + str(n).zfill(5) + '.fft.npy'
            fft_features = np.load(file_name)
            x.append(fft_features)
            y.append(genre_list.index(g))
            
x = np.array(x)
y = np.array(y)

x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.4,random_state=5)

#Train the model and save the model
model = LogisticRegression(multi_class='multinomial',solver = 'sag',max_iter=1000)#Here, multinomial is best used together with sag. Using the default solution method will report an error
model.fit(x_train,y_train)

output = open('model.pkl','wb')  #Saving the model to model.pkl WB is written in binary
pickle.dump(model,output)
output.close()

Read the model and predict

import pickle
from pprint import pprint 
from sklearn.metrics import confusion_matrix

pkl_file = open('model.pkl','rb')
model_loaded = pickle.load(pkl_file)
pprint(model_loaded)
pkl_file.close()

temp = model_loaded.predict(x_test)
print(confusion_matrix(y_test,temp,labels = range(len(genre_list))))
print(np.trace(confusion_matrix(y_test,temp,labels = range(len(genre_list))))/180)

However, this prediction result is not what we want, because in a data set, it is either very classic or has the same tonality. The prediction result will be better, so we choose to find some songs to Kangkang;

Find some songs to try

Beethoven's Moonlight

Or the old way, first edit him and set the above code;

The code is as follows:

#%%Download music from the Internet to view the model
print('Starting read wavfile...')

music_name = 'Moonlight song(30 second).wav'
sample_rate,x = wavfile.read(music_name)

print(x.shape)
x = np.reshape(x,(1,-1))[0]  #Convert dual channel audio into single channel

test_fft_features = abs(fft(x)[:1000])

temp = model_loaded.predict([test_fft_features])
print(genre_list[int(temp)])

See the result, Ou!!!!

Let's do another one, unconditionally, from beginning to end

#%%Download music from the Internet to view the model
print('Starting read wavfile...')
music_name = 'Unconditional-Eason Chan(30 second).wav'

sample_rate,x = wavfile.read(music_name)

print(x.shape)
x = np.reshape(x,(1,-1))[0]  #Convert dual channel audio into single channel

test_fft_features = abs(fft(x)[:1000])

temp = model_loaded.predict([test_fft_features])
print(genre_list[int(temp)])

See the result, Ou!!!!

The practical music multi category project is over with laughter. Continue to the next chapter! pd's Machine Learning

Posted on Sun, 24 Oct 2021 08:39:00 -0400 by eskimo42