Python編曲實踐(五):通過編寫爬蟲來爬取海量MIDI文件,預備構建數據集(附有百度雲下載鏈接)

前言

  1. 由於畢業設計的課題是通過CycleGAN搭建一個音樂風格轉換系統,需要大量的音樂文件來訓練神經網絡,而MIDI文件作爲最廣泛使用的一種電腦編曲保存媒介,十分容易蒐集資源,也有很多成熟的Python庫來對MIDI文件進行處理。
  2. 如今,相關領域的研究最常用的數據集是The Lakh MIDI Dataset 。這一數據集包括十萬餘個MIDI格式文件,數據的數量是足夠使用了,可是通過個人的測試,其質量沒有達到我的預期,主要的原因是一首歌對應多個MIDI文件,而且元數據同數據相比遠遠不足,這兩點導致了使用時的困難和不順手,讓我萌生了自己搭建數據集的想法。下面便是我從 Free Midi Files Download 這個網站爬取數據,構建數據集的過程,涉及簡單的爬蟲知識,和MongoDB數據庫的Python API, PyMongo的簡單操作。
  3. 如果您對爬蟲實施過程中Session和Cookies的使用有疑惑之處,本篇文章也可以爲您提供借鑑。若對爬蟲的內容不感興趣,儘可以滾動到本篇文章底部的資源鏈接,通過百度網盤下載這一數據集,使用時請註明本文鏈接,希望本文內容可以幫助到您!
  4. 下面的實施過程源代碼地址在 josephding23/Free-Midi-Library

實施過程

編寫爬蟲

爬蟲的代碼在 Free-Midi-Library/src/midi_scratch.py
在搜索MIDI資源的過程中,我瀏覽了很多網站,其中 Free Midi Files Download 這個網站從資源數量以及資源組織形式這兩個方面來看都是最優秀的一個,包含的音樂風格有17種之多,這之中搖滾樂的MIDI文件數目達到了9866個,通過結構化的爬取操作,這些文件的元數據(風格、歌手、歌名)都十分完整地保存在了MongoDB數據庫中,方便之後的訓練和測試。
爬蟲的過程分爲以下三個階段:

  1. 爬取音樂風格信息,得到每個風格的藝術家名稱、鏈接等 ,將信息添加到數據庫,通過以下函數實現,比較簡單:
def free_midi_get_genres():
    genres_collection = get_genre_collection()
    for genre in get_genres():
        if genres_collection.count({'name': genre}) != 0:
            continue
        url = 'https://freemidi.org/genre-' + genre
        text = get_html_text(url)
        soup = BeautifulSoup(text, 'html.parser')
        urls = []
        performers = []
        for item in soup.find_all(name='div', attrs={'class': 'genre-link-text'}):
            try:
                href = item.a['href']
                name = item.text
                urls.append(href)
                performers.append(name)
            except:
                pass
        genres_collection.insert_one({
            'name': genre,
            'performers_num': len(urls),
            'performers': performers,
            'performer_urls': urls
        })
        print(genre, len(urls))

  1. 構建藝術家數據表,將所有藝術家的信息添加進去,這一步不需要爬蟲,而是爲之後的爬取工作做準備:
def free_midi_get_performers():
    root_url = 'https://freemidi.org/'
    genres_collection = get_genre_collection()
    performers_collection = get_performer_collection()
    for genre in genres_collection.find({'Finished': False}):
        genre_name = genre['Name']
        performers = genre['Performers']
        performer_urls = genre['PerformersUrls']
        num = genre['PerformersNum']
        for index in range(num):
            name = performers[index]
            url = root_url + performer_urls[index]
            print(name, url)

            performers_collection.insert_one({
                'Name': name,
                'Url': url,
                'Genre': genre_name,
                'Finished': False
            })
        genres_collection.update_one(
            {'_id': genre['_id']},
            {'$set': {'Finished': True}})
        print('Progress: {:.2%}\n'.format(genres_collection.count({'Finished': True}) / genres_collection.count()))
  1. 爬取每個藝術家的頁面,將每個藝術家的所有作品信息添加到新建的MIDI數據表,注意這一步需要針對該網站的反爬蟲機制來編輯響應頭,重點在於Cookie的設置,具體方法在這裏不詳細展開,若有不解的地方可以在下面評論,與我交流。
def get_free_midi_songs_and_add_performers_info():
    root_url = 'https://freemidi.org/'
    midi_collection = get_midi_collection()
    performer_collection = get_performer_collection()
    while performer_collection.count({'Finished': False}) != 0:
        for performer in performer_collection.find({'Finished': False}):
            num = 0
            performer_url = performer['Url']
            performer_name = performer['Name']
            genre = performer['Genre']
            try:
                params = {
                    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36',
                    'Cookie': cookie_str,
                    'Referer': root_url + genre,
                    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
                    'Accept-Encoding': 'gzip, deflate, br',
                    'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8',
                    'Connection': 'keep-alive'
                }
                text = get_html_text(performer_url, params)
                if text == '':
                    print('connection error')
                    continue
                soup = BeautifulSoup(text, 'html.parser')
                # print(soup)
                for item in soup.find_all(name='div', attrs={'itemprop': 'tracks'}):
                    try:
                        download_url = root_url + item.span.a['href']
                        name = item.span.text
                        if midi_collection.count({'Genre': genre, 'Name': name}) == 0:
                            midi_collection.insert_one({
                                'Name': name.replace('\n', ''),
                                'DownloadPage': download_url,
                                'Performer': performer_name,
                                'PerformerUrl': performer_url,
                                'Genre': genre,
                                'Downloaded': False
                            })
                        num = num + 1
                    except:
                        pass
                if num != 0:
                    performer_collection.update_one(
                        {'_id': performer['_id']},
                        {'$set': {'Finished': True, 'Num': num}}
                    )
                    time.sleep(uniform(1, 1.6))
                    print('Performer ' + performer_name + ' finished.')
                    print('Progress: {:.2%}\n'.format(performer_collection.count({'Finished': True}) / performer_collection.count()))
            except:
                print('Error connecting.')
  1. 最後的一步便是爬取MIDI文件,這一步是最複雜的一步,也是我與該網站的反爬蟲機制鬥志鬥勇體驗最深刻的一步。因爲該網站沒有爲MIDI資源提供一個直接的下載鏈接,而是一個getter鏈接,需要通過GET響應頭來獲得response中的下載鏈接。在此處通過響應頭來設置cookies已經不足以解決問題,故我使用了Session來維護cookies的一致,這樣就更好地模擬了“真人”訪問網頁時的情況,使得反爬蟲無法偵察我的爬蟲行爲。當然這一方法也不是萬無一失,經常遇到的情況是顯示錯誤並循環多次後才能開始下載,有一些無法下載的內容只能從數據庫中移除了。
def download_free_midi():
    root_url = 'https://freemidi.org/'
    root_path = 'E:/free_MIDI'
    cookie_path = './cookies.txt'
    params = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36',
        # 'Cookie': cookie,
        'Sec-Fetch-Mode': 'navigate',
        'Sec-Fetch-Site': 'same-origin',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
        'Accept-Encoding': 'gzip, deflate, br',
        'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8',
        'Connection': 'keep-alive'
    }
    midi_collection = get_midi_collection()

    session = requests.Session()
    requests.packages.urllib3.disable_warnings()
    session.headers.update(params)
    session.cookies = cookies
    while midi_collection.count({'Downloaded': False}) != 0:
        for midi in midi_collection.find({'Downloaded': False}, no_cursor_timeout = True):
            performer_link = midi['PerformerUrl']
            download_link = midi['DownloadPage']
            name = midi['Name']
            genre = midi['Genre']
            performer = midi['Performer']
            try:
                params = {
                    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36',
                    # 'Cookie': cookie_str,
                    'Referer': performer_link,
                    'Sec-Fetch-Mode': 'navigate',
                    'Sec-Fetch-Site': 'same-origin',
                    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
                    'Accept-Encoding': 'gzip, deflate, br',
                    'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8',
                    'Connection': 'keep-alive'
                }
                session.headers.update({'Referer': performer_link})
                r = session.get(download_link, verify=False, timeout=20)
                # r.encoding = 'utf-8'
                if r.cookies.get_dict():
                    print(r.cookies.get_dict())
                    session.cookies = r.cookies
                if r.status_code != 200:
                    print('connection error ' + str(r.status_code))
                soup = BeautifulSoup(r.text, 'html.parser')
                r.close()
                try:
                    getter_link = root_url + soup.find(name='a', attrs={'id': 'downloadmidi'})['href']
                    print(getter_link)
                    download_header = {
                        # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
                        'Accept-Encoding': 'gzip, deflate, br',
                        'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8',
                        'Referer': download_link,
                        # 'Cookie': cookie_str,
                        'Sec-Fetch-Mode': 'navigate',
                        'Sec-Fetch-Site': 'same-origin',
                        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36',
                    }
                    session.headers.update(download_header)
                    dir = root_path + '/' + genre
                    if not os.path.exists(dir):
                        os.mkdir(dir)
                    rstr = r'[\\/:*?"<>|\r\n\t]+'  # '/ \ : * ? " < > |'
                    name = re.sub(rstr, '', name).strip()
                    performer = re.sub(rstr, '', performer).strip()
                    file_name = name + ' - ' + performer + '.mid'
                    path = dir + '/' +  file_name
                    try:
                        with open(path, 'wb') as output:
                            with session.get(getter_link, allow_redirects=True, verify=False, timeout=20) as r:
                                if r.history:
                                    print('Request was redirected')
                                    for resp in r.history:
                                        print(resp.url)
                                    print('Final: ' + str(r.url))
                                r.raise_for_status()
                                if r.cookies.get_dict():
                                    print(r.cookies)
                                    session.cookies.update(r.coo![在這裏插入圖片描述](https://img-blog.csdnimg.cn/20200327182437531.jpg?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L1RydWVkaWNrRGluZw==,size_16,color_FFFFFF,t_70)kies)
                                output.write(r.content)
                        time.sleep(uniform(2, 3))
                        # cookie_opener.open(getter_link)
                        # cj.save(cookie_path, ignore_discard=True)
                        if is_valid_midi(path):
                            print(file_name + ' downloaded')
                            midi_collection.update_one(
                                {'_id': midi['_id']},
                                {'$set': {'Downloaded': True, 'GetterLink': getter_link}}
                            )
                            print('Progress: {:.2%}\n'.format(midi_collection.count({'Downloaded': True}) / midi_collection.count()))
                        else:
                            print('Cannot successfully download midi.')
                            os.remove(path)
                    except:
                        print(traceback.format_exc())
                except:
                    print('Found no download link')
            except:
                print(traceback.format_exc())

文件名哈希化

爬取到的MIDI文件夾結構如下,其中每個子文件夾代表不同的風格:
midifiles
每個子文件夾內就包含該風格的所有MIDI文件:
midi_files
爲了方便處理,我把所有的文件名通過md5算法加密了,並將對應的哈希碼保存在數據表,可以通過簡單的find語句來查找,哈希化代碼保存在 Free-Midi-Library/src/md5_reorganize.py/
在這裏插入圖片描述

統一速度和調性

爲了使得訓練效果更佳,我將所有的MIDI音樂的速度調整到120bpm,並轉調到C調,這兩種操作的代碼在 src/unify_tempo.pysrc/transpose_tone.py 可以找到。

  1. 轉調到C
    關鍵函數:
def transpose_to_c():
    root_dir = 'E:/free_midi_library/'
    transpose_root_dir = 'E:/transposed_midi/'
    midi_collection = get_midi_collection()
    for midi in midi_collection.find({'Transposed': False}, no_cursor_timeout = True):
        original_path = os.path.join(root_dir, midi['Genre'] + '/', midi['md5'] + '.mid')

        if not os.path.exists(os.path.join(transpose_root_dir, midi['Genre'])):
            os.mkdir(os.path.join(transpose_root_dir, midi['Genre']))

        transposed_path = os.path.join(transpose_root_dir, midi['Genre'] + '/', midi['md5'] + '.mid')
        try:
            original_stream = converter.parse(original_path)

            estimate_key = original_stream.analyze('key')

            estimate_tone, estimate_mode = (estimate_key.tonic, estimate_key.mode)

            c_key = key.Key('C', 'major')
            c_tone, c_mode = (c_key.tonic, c_key.mode)
            margin = interval.Interval(estimate_tone, c_tone)
            semitones = margin.semitones

            mid = pretty_midi.PrettyMIDI(original_path)
            for instr in mid.instruments:
                if not instr.is_drum:
                    for note in instr.notes:
                        if note.pitch + semitones < 128 and note.pitch + semitones > 0:
                            note.pitch += semitones

            mid.write(transposed_path)
            midi_collection.update_one({'_id': midi['_id']}, {'$set': {'Transposed': True}})
            print('Progress: {:.2%}\n'.format(midi_collection.count({'Transposed': True}) / midi_collection.count()))
        except:
            print(traceback.format_exc())

這一函數中,首先通過music21.converter庫中的調性分析函數來得到MIDI文件的調性,並根據與C調的距離來將其轉調到C大調或C小調
實例
轉調前:
1
轉調後:2

  1. 統一速度(BPM)
    關鍵函數:
def tempo_unify_and_merge():

    midi_collection = get_midi_collection()
    root_dir = 'E:/transposed_midi/'
    merged_root_dir = 'E:/merged_midi/'

    for midi in midi_collection.find({'MergedAndScaled': False}, no_cursor_timeout = True):
        original_path = os.path.join(root_dir, midi['Genre'] + '/', midi['md5'] + '.mid')
        try:
            original_tempo = get_tempo(original_path)[0]
            changed_rate = original_tempo / 120

            if not os.path.exists(os.path.join(merged_root_dir, midi['Genre'])):
                os.mkdir(os.path.join(merged_root_dir, midi['Genre']))

            pm = pretty_midi.PrettyMIDI(original_path)
            for instr in pm.instruments:
                for note in instr.notes:
                    note.start *= changed_rate
                    note.end *= changed_rate

            merged_path = os.path.join(merged_root_dir, midi['Genre'] + '/', midi['md5'] + '.mid')
            merged = get_merged_from_pm(pm)
            merged.write(merged_path)

            midi_collection.update_one({'_id': midi['_id']}, {'$set': {'MergedAndScaled': True}})

            print('Progress: {:.2%}\n'.format(midi_collection.count({'MergedAndScaled': True}) / midi_collection.count()))
            
        except:
            pass

這一函數使用了 pretty_midi 庫支持的對MIDI文件的操作,根據源文件的BPM與120BPM的比例,來對所有Note的起始時間和終止時間來進行改變。
實例:
統一速度後:
3

下載鏈接

百度雲下載鏈接,提取碼:fm8f
資源介紹:
resource

  • unhashed:MIDI文件沒有加密,格式爲“歌名-藝術家名”
  • raw_midi:僅對文件名進行MD5加密後的文件
  • transposed_midi:轉爲C大調之後的MIDI文件
  • merged_midi:轉調後並且將速度設置爲120BPM的文件
  • meta:從MongoDB導出的JSON文件,分爲genre、performers和midi三個表,可以通過mongoimport命令來導入MongoDB數據庫
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章