Assignment | 05-week3 -Part_2-Trigger Word Detection

該系列僅在原課程基礎上課後作業部分添加個人學習筆記,如有錯誤,還請批評指教。- ZJ

Coursera 課程 |deeplearning.ai |網易雲課堂

CSDNhttps://blog.csdn.net/JUNJUN_ZHAO/article/details/79699845


Welcome to the final programming assignment of this specialization!

In this week’s videos, you learned about applying deep learning to speech recognition. In this assignment, you will construct a speech dataset and implement an algorithm for trigger word detection (sometimes also called keyword detection, or wakeword detection). Trigger word detection is the technology that allows devices like Amazon Alexa, Google Home, Apple Siri, and Baidu DuerOS to wake up upon hearing a certain word.

For this exercise, our trigger word will be “Activate.” Every time it hears you say “activate,” it will make a “chiming” sound. By the end of this assignment, you will be able to record a clip of yourself talking, and have the algorithm trigger a chime when it detects you saying “activate.”

After completing this assignment, perhaps you can also extend it to run on your laptop so that every time you say “activate” it starts up your favorite app, or turns on a network connected lamp in your house, or triggers some other event?

歡迎來到這個微專業的最後一篇編程作業!

在本週的視頻中,您瞭解瞭如何將深度學習應用於語音識別。在這個任務中,您將構建一個語音數據集並實現觸發字檢測算法(有時也稱爲關鍵字檢測或喚醒檢測)。觸發式字詞檢測技術可讓亞馬遜 Alexa,Google Home,Apple Siri和百度 DuerOS 等設備在聽到某個詞語時醒來。

對於這個練習,我們的觸發詞將是“激活”。每次它聽到你說“激活”,它會發出“鳴叫”的聲音。在此作業結束時,您將能夠錄製自己正在講話的片段,並在算法檢測到您說出“激活”時讓算法觸發一次鐘聲。

完成這個任務後,也許你還可以擴展它在你的筆記本電腦上運行,這樣每當你說“激活”它啓動你最喜歡的應用程序,或打開你家的網絡連接燈,或觸發一些其他事件?

這裏寫圖片描述

In this assignment you will learn to:
- Structure a speech recognition project 構建語音識別項目
- Synthesize and process audio recordings to create train/dev datasets 合成和處理音頻記錄以創建 train / dev 數據集
- Train a trigger word detection model and make predictions 訓練觸發詞檢測模型並做出預測

Lets get started! Run the following cell to load the package you are going to use.

import numpy as np
from pydub import AudioSegment
import random
import sys
import io
import os
import glob
import IPython
from td_utils import *
%matplotlib inline
d:\program files\python36\lib\site-packages\pydub\utils.py:165: RuntimeWarning: Couldn't find ffmpeg or avconv - defaulting to ffmpeg, but may not work
  warn("Couldn't find ffmpeg or avconv - defaulting to ffmpeg, but may not work", RuntimeWarning)
'''
td_utils.py

'''

import matplotlib.pyplot as plt
from scipy.io import wavfile
import os
from pydub import AudioSegment

# Calculate and plot spectrogram for a wav audio file
def graph_spectrogram(wav_file):
    rate, data = get_wav_info(wav_file)
    nfft = 200 # Length of each window segment
    fs = 8000 # Sampling frequencies
    noverlap = 120 # Overlap between windows
    nchannels = data.ndim
    if nchannels == 1:
        pxx, freqs, bins, im = plt.specgram(data, nfft, fs, noverlap = noverlap)
    elif nchannels == 2:
        pxx, freqs, bins, im = plt.specgram(data[:,0], nfft, fs, noverlap = noverlap)
    return pxx

# Load a wav file
def get_wav_info(wav_file):
    rate, data = wavfile.read(wav_file)
    return rate, data

# Used to standardize volume of audio clip
def match_target_amplitude(sound, target_dBFS):
    change_in_dBFS = target_dBFS - sound.dBFS
    return sound.apply_gain(change_in_dBFS)

# Load raw audio files for speech synthesis
def load_raw_audio():
    activates = []
    backgrounds = []
    negatives = []
    for filename in os.listdir("./raw_data/activates"):
        if filename.endswith("wav"):
            activate = AudioSegment.from_wav("./raw_data/activates/"+filename)
            activates.append(activate)
    for filename in os.listdir("./raw_data/backgrounds"):
        if filename.endswith("wav"):
            background = AudioSegment.from_wav("./raw_data/backgrounds/"+filename)
            backgrounds.append(background)
    for filename in os.listdir("./raw_data/negatives"):
        if filename.endswith("wav"):
            negative = AudioSegment.from_wav("./raw_data/negatives/"+filename)
            negatives.append(negative)
    return activates, negatives, backgrounds

1 - Data synthesis: Creating a speech dataset

Let’s start by building a dataset for your trigger word detection algorithm. A speech dataset should ideally be as close as possible to the application you will want to run it on. In this case, you’d like to detect the word “activate” in working environments (library, home, offices, open-spaces …). You thus need to create recordings with a mix of positive words (“activate”) and negative words (random words other than activate) on different background sounds. Let’s see how you can create such a dataset.

我們首先爲觸發字檢測算法構建一個數據集。 理想情況下,語音數據集應儘可能接近您希望運行的應用程序。 在這種情況下,您希望在工作環境(圖書館,家庭,辦公室,開放空間等)中檢測到“激活”一詞。 因此,您需要在不同的背景聲音中混合使用正面詞語(“激活”)和負面詞語(除激活以外的隨機詞)來創建錄音。 我們來看看如何創建這樣一個數據集。

1.1 - Listening to the data

One of your friends is helping you out on this project, and they’ve gone to libraries, cafes, restaurants, homes and offices all around the region to record background noises, as well as snippets of audio of people saying positive/negative words. This dataset includes people speaking in a variety of accents.

In the raw_data directory, you can find a subset of the raw audio files of the positive words, negative words, and background noise. You will use these audio files to synthesize a dataset to train the model. The “activate” directory contains positive examples of people saying the word “activate”. The “negatives” directory contains negative examples of people saying random words other than “activate”. There is one word per audio recording. The “backgrounds” directory contains 10 second clips of background noise in different environments.

Run the cells below to listen to some examples.

你的一位朋友正在幫助你完成這個項目,並且他們已經去過遍佈該地區的圖書館,咖啡館,餐館,家庭和辦公室,以記錄背景噪音,以及人們說正面/負面詞彙的片段的片段。 這個數據集包括以各種口音說話的人。

在 raw_data 目錄中,您可以找到正面單詞,負面單詞和背景噪音的原始音頻文件的子集。 您將使用這些音頻文件合成數據集來訓練模型。 “激活”目錄包含說“單詞激活”的人的正面例子。 “底片”目錄包含說除“激活”以外的隨機單詞的反面例子。 每個音頻記錄有一個詞。 “背景”目錄包含10個不同環境下的背景噪音剪輯。

運行下面的單元格來聽一些例子。

IPython.display.Audio("./raw_data/activates/1.wav")
IPython.display.Audio("./raw_data/negatives/4.wav")
IPython.display.Audio("./raw_data/backgrounds/2.wav")
# 我本地的 1.wav 0 字節,不可用

You will use these three type of recordings (positives/negatives/backgrounds) to create a labelled dataset.

1.2 - From audio recordings to spectrograms 從錄音到聲譜圖

What really is an audio recording? A microphone records little variations in air pressure over time, and it is these little variations in air pressure that your ear also perceives as sound. You can think of an audio recording is a long list of numbers measuring the little air pressure changes detected by the microphone. We will use audio sampled at 44100 Hz (or 44100 Hertz). This means the microphone gives us 44100 numbers per second. Thus, a 10 second audio clip is represented by 441000 numbers (= 10×44100 ).

It is quite difficult to figure out from this “raw” representation of audio whether the word “activate” was said. In order to help your sequence model more easily learn to detect triggerwords, we will compute a spectrogram of the audio. The spectrogram tells us how much different frequencies are present in an audio clip at a moment in time.

(If you’ve ever taken an advanced class on signal processing or on Fourier transforms, a spectrogram is computed by sliding a window over the raw audio signal, and calculates the most active frequencies in each window using a Fourier transform. If you don’t understand the previous sentence, don’t worry about it.)

什麼是錄音?麥克風隨着時間的推移記錄氣壓變化很小,正是這些氣壓的小變化讓你的耳朵感覺到了聲音。您可以想象一個錄音是一個長長的數字列表,用於測量麥克風檢測到的微小氣壓變化。我們將使用以44100赫茲(或44100赫茲)採樣的音頻。這意味着麥克風每秒給我們44100個數字。因此,一個10秒的音頻片段由441000 numbers (= 10×44100 ). 表示。

從這個音頻的“原始”表示中找出“激活”這個詞是否被說出來是相當困難的。爲了幫助您的序列模型更容易學習檢測觸發字,我們將計算音頻的譜圖。頻譜圖告訴我們一段時間內音頻片段中存在多少不同的頻率。

(如果你曾經在信號處理或傅里葉變換上採用了先進的課程,通過在原始音頻信號上滑動窗口計算頻譜圖,並使用傅立葉變換計算每個窗口中最活躍的頻率,不要理解前面的句子,不要擔心。)

Lets see an example.

IPython.display.Audio("audio_examples/example_train.wav")
x = graph_spectrogram("audio_examples/example_train.wav")

這裏寫圖片描述

The graph above represents how active each frequency is (y axis) over a number of time-steps (x axis).

這裏寫圖片描述

Figure 1: Spectrogram of an audio recording, where the color shows the degree to which different frequencies are present (loud) in the audio at different points in time. Green squares means a certain frequency is more active or more present in the audio clip (louder); blue squares denote less active frequencies.

音頻記錄的頻譜圖,其中顏色顯示不同時間點音頻中存在(大聲)不同頻率的程度。 綠色方塊意味着音頻片段中的某個頻率更加活躍或更多(更響亮); 藍色方塊表示較少的活動頻率。

The dimension of the output spectrogram depends upon the hyperparameters of the spectrogram software and the length of the input. In this notebook, we will be working with 10 second audio clips as the “standard length” for our training examples. The number of timesteps of the spectrogram will be 5511. You’ll see later that the spectrogram will be the input x into the network, and so Tx=5511 .

輸出譜圖的維度取決於譜圖軟件的超參數和輸入的長度。 在本筆記中,我們將使用 10 秒音頻剪輯作爲我們訓練樣本的“標準長度”。 頻譜圖的時間步數將爲5511.稍後您會看到頻譜圖將成爲網絡中的輸入x ,and so Tx=5511 .

_, data = wavfile.read("audio_examples/example_train.wav")
print("Time steps in audio recording before spectrogram", data[:,0].shape)
print("Time steps in input after spectrogram", x.shape)
Time steps in audio recording before spectrogram (441000,)
Time steps in input after spectrogram (101, 5511)

Now, you can define:

Tx = 5511 # The number of time steps input to the model from the spectrogram 從譜圖輸入到模型的時間步數
n_freq = 101 # Number of frequencies input to the model at each time step of the spectrogram 在頻譜圖的每個時間步驟輸入到模型的頻率數量

Note that even with 10 seconds being our default training example length, 10 seconds of time can be discretized to different numbers of value. You’ve seen 441000 (raw audio) and 5511 (spectrogram). In the former case, each step represents 10/4410000.000023 seconds. In the second case, each step represents 10/55110.0018 seconds.

請注意,即使 10 秒作爲我們默認的訓練樣本長度,也可以將10秒的時間離散爲不同的數值。 你已經看過 441000(原始音頻)和 5511(頻譜圖)。 在前一種情況下,每一步表示 10/4410000.000023 秒。 在第二種情況下,每個步驟代表 10/55110.0018 秒。

For the 10sec of audio, the key values you will see in this assignment are:

  • 441000 (raw audio)
  • 5511=Tx (spectrogram output, and dimension of input to the neural network 頻譜圖輸出,以及輸入到神經網絡的維度).
  • 10000 (used by the pydub module to synthesize audio 由pydub模塊用於合成音頻)
  • 1375=Ty (the number of steps in the output of the GRU you’ll build 您將構建的 GRU 輸出中的步數).

Note that each of these representations correspond to exactly 10 seconds of time. It’s just that they are discretizing them to different degrees. All of these are hyperparameters and can be changed (except the 441000, which is a function of the microphone). We have chosen values that are within the standard ranges uses for speech systems.

請注意,這些表示中的每一個對應於恰好10秒的時間。 只是他們正在不同程度地對他們進行離散化。 所有這些都是超參數,可以更改(除了441000,這是麥克風的功能)。 我們選擇了語音系統標準範圍內的值。

Consider the Ty=1375 number above. This means that for the output of the model, we discretize the 10s into 1375 time-intervals (each one of length 10/13750.0072 s) and try to predict for each of these intervals whether someone recently finished saying “activate.”

Consider also the 10000 number above. This corresponds to discretizing the 10sec clip into 10/10000 = 0.001 second itervals. 0.001 seconds is also called 1 millisecond, or 1ms. So when we say we are discretizing according to 1ms intervals, it means we are using 10,000 steps.

考慮上面的Ty=1375 數字。 這意味着,對於模型的輸出,我們將 10 秒離散爲 1375 個時間間隔(每個長度爲10/13750.0072 s),並嘗試預測這些時間間隔是否最近有人說過“激活”。

考慮上面的 10000 個數字。 這對應於將 10 秒剪輯離散化爲 10/10000 = 0.001 秒的遊程。 0.001 秒也被稱爲 1 毫秒,或 1 毫秒。 所以當我們說我們按照 1ms 間隔進行離散化時,這意味着我們正在使用 10,000 步。

Ty = 1375 # The number of time steps in the output of our model

1.3 - Generating a single training example

Because speech data is hard to acquire and label, you will synthesize your training data using the audio clips of activates, negatives, and backgrounds. It is quite slow to record lots of 10 second audio clips with random “activates” in it. Instead, it is easier to record lots of positives and negative words, and record background noise separately (or download background noise from free online sources).

由於語音數據很難獲取和標記,因此您將使用激活,底片和背景的音頻剪輯合成訓練數據。 記錄大量 10 個隨機“激活”音頻剪輯是很慢的。 相反,記錄大量積極和消極詞彙,並單獨記錄背景噪音(或從免費在線渠道下載背景噪音)更容易。

To synthesize a single training example, you will:

  • Pick a random 10 second background audio clip
  • Randomly insert 0-4 audio clips of “activate” into this 10sec clip
  • Randomly insert 0-2 audio clips of negative words into this 10sec clip

Because you had synthesized the word “activate” into the background clip, you know exactly when in the 10sec clip the “activate” makes its appearance. You’ll see later that this makes it easier to generate the labels yt as well.

因爲您已將“激活”一詞合成到背景剪輯中,所以您確切知道 10 秒剪輯中何時出現“激活”。 稍後您會看到,這樣可以更輕鬆地生成標籤 yt

You will use the pydub package to manipulate audio. Pydub converts raw audio files into lists of Pydub data structures (it is not important to know the details here). Pydub uses 1ms as the discretization interval (1ms is 1 millisecond = 1/1000 seconds) which is why a 10sec clip is always represented using 10,000 steps.

您將使用 pydub 軟件包來處理音頻。 Pydub 將原始音頻文件轉換爲 Pydub 數據結構列表(這裏瞭解細節並不重要)。 Pydub 使用 1ms 作爲離散化間隔(1ms是1毫秒= 1/1000秒),這就是爲什麼 10 秒剪輯總是使用 10,000 步表示的原因。

# Load audio segments using pydub 
activates, negatives, backgrounds = load_raw_audio()

print("background len: " + str(len(backgrounds[0])))    # Should be 10,000, since it is a 10 sec clip
print("activate[0] len: " + str(len(activates[0])))     # Maybe around 1000, since an "activate" audio clip is usually around 1 sec (but varies a lot)
print("activate[1] len: " + str(len(activates[1])))     # Different "activate" clips can have different lengths 
background len: 19841
activate[0] len: 721
activate[1] len: 731

Overlaying positive/negative words on the background:

Given a 10sec background clip and a short audio clip (positive or negative word), you need to be able to “add” or “insert” the word’s short audio clip onto the background. To ensure audio segments inserted onto the background do not overlap, you will keep track of the times of previously inserted audio clips. You will be inserting multiple clips of positive/negative words onto the background, and you don’t want to insert an “activate” or a random word somewhere that overlaps with another clip you had previously added.

For clarity, when you insert a 1sec “activate” onto a 10sec clip of cafe noise, you end up with a 10sec clip that sounds like someone sayng “activate” in a cafe, with “activate” superimposed on the background cafe noise. You do not end up with an 11 sec clip. You’ll see later how pydub allows you to do this.

給定一個 10秒的背景剪輯和一個短的音頻剪輯(正面或負面的單詞),您需要能夠將單詞的短片段“添加”或“插入”背景。 爲確保插入到背景上的音頻片段不重疊,您將跟蹤以前插入的音頻片段的時間。 您將在背景中插入多個正片/負片單詞剪輯,並且您不希望在與您之前添加的另一個剪輯重疊的地方插入“激活”或隨機單詞。

爲了清楚起見,當您在咖啡廳噪音的10秒剪輯中插入1秒“激活”時,最終會出現10秒的剪輯,聽起來像某人在咖啡廳中“激活”,並將“激活”疊加在背景咖啡廳噪音上。 你不會以 11 秒的剪輯結束。 稍後你會看到 pydub 如何讓你做到這一點。

Creating the labels at the same time you overlay:

Recall also that the labels yt represent whether or not someone has just finished saying “activate.” Given a background clip, we can initialize yt=0 for all t , since the clip doesn’t contain any “activates.”

回想一下,標籤 yt 代表是否有人剛說完“激活”。 給定一個背景剪輯,我們可以初始化所有tyt=0 ,因爲剪輯不包含任何“激活”。

When you insert or overlay an “activate” clip, you will also update labels for yt , so that 50 steps of the output now have target label 1. You will train a GRU to detect when someone has finished saying “activate”. For example, suppose the synthesized “activate” clip ends at the 5sec mark in the 10sec audio—exactly halfway into the clip. Recall that Ty=1375 , so timestep 687= int(1375*0.5) corresponds to the moment at 5sec into the audio. So, you will set y688=1 . Further, you would quite satisfied if the GRU detects “activate” anywhere within a short time-internal after this moment, so we actually set 50 consecutive values of the label yt to 1. Specifically, we have y688=y689==y737=1 .

當您插入或覆蓋“激活”剪輯時,您還將更新yt 的標籤,以便輸出的50個步驟現在具有目標標籤 1.您將訓練 GRU 以檢測何時 有人已完成說“激活”。 例如,假設合成的“激活”剪輯在 10 秒音頻中的 5 秒標記處結束 - 恰好在剪輯的中途。 回想一下,Ty=1375 ,所以timetep687= int(1375*0.5)對應於 5 秒進入音頻的時刻。 所以,你會設置 y688=1 。 此外,如果 GRU 在短時間內在任何時間內檢測到“激活” - 在此時刻之後內部,您會非常滿意,所以我們實際上將標籤 yt 50 的連續值設置爲 1.具體而言, 我們有y688=y689==y737=1 .

This is another reason for synthesizing the training data: It’s relatively straightforward to generate these labels yt as described above. In contrast, if you have 10sec of audio recorded on a microphone, it’s quite time consuming for a person to listen to it and mark manually exactly when “activate” finished.

Here’s a figure illustrating the labels yt , for a clip which we have inserted “activate”, “innocent”, activate”, “baby.” Note that the positive labels “1” are associated only with the positive words.

這是合成訓練數據的另一個原因:如上所述,生成這些標籤是相對直接的。 相反,如果您在麥克風上錄製了10秒鐘的音頻,那麼聽到它並且在“激活”完成時確切地手動標記是相當耗時的。

下面是一張圖,說明我們插入”activate”, “innocent”, activate”, “baby.” 的剪輯的標籤 yt 。請注意,正面標籤 “1” 只有積極的話語。

這裏寫圖片描述

Figure 2

To implement the training set synthesis process, you will use the following helper functions. All of these function will use a 1ms discretization interval, so the 10sec of audio is alwsys discretized into 10,000 steps.

  1. get_random_time_segment(segment_ms) gets a random time segment in our background audio 在我們的背景音頻中獲取隨機時間段
  2. is_overlapping(segment_time, existing_segments) checks if a time segment overlaps with existing segments 檢查時間段是否與現有段重疊
  3. insert_audio_clip(background, audio_clip, existing_times) inserts an audio segment at a random time in our background audio using 使用隨機時間在我們的背景音頻中插入音頻段get_random_time_segment and is_overlapping
  4. insert_ones(y, segment_end_ms) inserts 1’s into our label vector y after the word “activate” 在單詞“激活”後插入1到我們的標籤向量 y 中,

The function get_random_time_segment(segment_ms) returns a random time segment onto which we can insert an audio clip of duration segment_ms. Read through the code to make sure you understand what it is doing.

函數get_random_time_segment(segment_ms) 返回一個隨機時間段,我們可以在其中插入持續時間段 segment_ms 的音頻片段。 仔細閱讀代碼,以確保您瞭解它在做什麼。

def get_random_time_segment(segment_ms):
    """
    Gets a random time segment of duration segment_ms in a 10,000 ms audio clip.

    Arguments:
    segment_ms -- the duration of the audio clip in ms ("ms" stands for "milliseconds")

    Returns:
    segment_time -- a tuple of (segment_start, segment_end) in ms
    """

    segment_start = np.random.randint(low=0, high=10000-segment_ms)   # Make sure segment doesn't run past the 10sec background 
    segment_end = segment_start + segment_ms - 1

    return (segment_start, segment_end)

Next, suppose you have inserted audio clips at segments (1000,1800) and (3400,4500). I.e., the first segment starts at step 1000, and ends at step 1800. Now, if we are considering inserting a new audio clip at (3000,3600) does this overlap with one of the previously inserted segments? In this case, (3000,3600) and (3400,4500) overlap, so we should decide against inserting a clip here.

For the purpose of this function, define (100,200) and (200,250) to be overlapping, since they overlap at timestep 200. However, (100,199) and (200,250) are non-overlapping.

接下來,假設您在段(1000,1800)和(3400,4500)處插入了音頻剪輯。 即,第一段從步驟 1000 開始,並在步驟 1800 結束。現在,如果我們正考慮在(3000,3600)處插入新的音頻剪輯,它是否與先前插入的段之一重疊? 在這種情況下,(3000,3600)和(3400,4500)重疊,所以我們應該決定在這裏插入剪輯。

爲了這個功能的目的,定義(100,200)和(200,250)是重疊的,因爲它們在時間步 200 處重疊。但是,(100,199)和(200,250)是不重疊的。

Exercise: Implement is_overlapping(segment_time, existing_segments) to check if a new time segment overlaps with any of the previous segments. You will need to carry out 2 steps:

實現 is_overlapping(segment_time, existing_segments)以檢查新時間段是否與任何之前的段重疊。 您需要執行2個步驟:

  1. Create a “False” flag, that you will later set to “True” if you find that there is an overlap. 創建一個“False”標誌,如果您發現有重疊,您將稍後設置爲“True”。
  2. Loop over the previous_segments’ start and end times. Compare these times to the segment’s start and end times. If there is an overlap, set the flag defined in (1) as True. You can use: 遍歷previous_segments的開始和結束時間。 將這些時間與段的開始和結束時間進行比較。 如果有重疊,請將(1)中定義的標誌設置爲True。 您可以使用:
for ....:
        if ... <= ... and ... >= ...:
            ...

Hint: There is overlap if the segment starts before the previous segment ends, and the segment ends after the previous segment starts.
提示:如果分段在前一個分段結束之前開始,則有重疊,並且該分段在上一個分段開始之後結束。

# GRADED FUNCTION: is_overlapping

def is_overlapping(segment_time, previous_segments):
    """
    Checks if the time of a segment overlaps with the times of existing segments.

    Arguments:
    segment_time -- a tuple of (segment_start, segment_end) for the new segment
    previous_segments -- a list of tuples of (segment_start, segment_end) for the existing segments

    Returns:
    True if the time segment overlaps with any of the existing segments, False otherwise
    """

    segment_start, segment_end = segment_time

    ### START CODE HERE ### (≈ 4 line)
    # Step 1: Initialize overlap as a "False" flag. (≈ 1 line)
    overlap = False

    # Step 2: loop over the previous_segments start and end times.
    # Compare start/end times and set the flag to True if there is an overlap (≈ 3 lines)
    for previous_start, previous_end in previous_segments:
        if segment_start <= previous_end and segment_end >= previous_start:
            overlap = True
    ### END CODE HERE ###

    return overlap
overlap1 = is_overlapping((950, 1430), [(2000, 2550), (260, 949)])
overlap2 = is_overlapping((2305, 2950), [(824, 1532), (1900, 2305), (3424, 3656)])
print("Overlap 1 = ", overlap1)
print("Overlap 2 = ", overlap2)
Overlap 1 =  False
Overlap 2 =  True

Expected Output:

**Overlap 1** False
**Overlap 2** True

Now, lets use the previous helper functions to insert a new audio clip onto the 10sec background at a random time, but making sure that any newly inserted segment doesn’t overlap with the previous segments.

Exercise: Implement insert_audio_clip() to overlay an audio clip onto the background 10sec clip. You will need to carry out 4 steps:

  1. Get a random time segment of the right duration in ms.
  2. Make sure that the time segment does not overlap with any of the previous time segments. If it is overlapping, then go back to step 1 and pick a new time segment.
  3. Add the new time segment to the list of existing time segments, so as to keep track of all the segments you’ve inserted.
  4. Overlay the audio clip over the background using pydub. We have implemented this for you.

  5. 以毫秒爲單位獲取正確持續時間的隨機時間段。

  6. 確保時間段與前面的任何時間段都不重疊。 如果它重疊,則返回步驟1並選擇新的時間段。
  7. 將新時間段添加到現有時間段列表中,以跟蹤您插入的所有段。
  8. 使用pydub將音頻剪輯覆蓋在背景上。 我們已經爲你實現了這個。
# GRADED FUNCTION: insert_audio_clip

def insert_audio_clip(background, audio_clip, previous_segments):
    """
    Insert a new audio segment over the background noise at a random time step, ensuring that the 
    audio segment does not overlap with existing segments.

    Arguments:
    background -- a 10 second background audio recording.  
    audio_clip -- the audio clip to be inserted/overlaid. 
    previous_segments -- times where audio segments have already been placed

    Returns:
    new_background -- the updated background audio
    """

    # Get the duration of the audio clip in ms
    segment_ms = len(audio_clip)

    ### START CODE HERE ### 
    # Step 1: Use one of the helper functions to pick a random time segment onto which to insert 
    # the new audio clip. (≈ 1 line)
    segment_time = get_random_time_segment(segment_ms)

    # Step 2: Check if the new segment_time overlaps with one of the previous_segments. If so, keep 
    # picking new segment_time at random until it doesn't overlap. (≈ 2 lines)
    while is_overlapping(segment_time, previous_segments):
        segment_time = get_random_time_segment(segment_ms)

    # Step 3: Add the new segment_time to the list of previous_segments (≈ 1 line)
    previous_segments.append(segment_time)
    ### END CODE HERE ###

    # Step 4: Superpose audio segment and background
    new_background = background.overlay(audio_clip, position = segment_time[0])

    return new_background, segment_time
np.random.seed(5)
audio_clip, segment_time = insert_audio_clip(backgrounds[0], activates[0], [(3790, 4400)])
audio_clip.export("insert_test.wav", format="wav")
print("Segment Time: ", segment_time)
IPython.display.Audio("insert_test.wav")
Segment Time:  (2915, 3635)

Expected Output

**Segment Time** (2254, 3169)
# Expected audio
IPython.display.Audio("audio_examples/insert_reference.wav")

Finally, implement code to update the labels yt , assuming you just inserted an “activate.” In the code below, y is a (1,1375) dimensional vector, since Ty=1375 .

If the “activate” ended at time step t , then set yt+1=1 as well as for up to 49 additional consecutive values. However, make sure you don’t run off the end of the array and try to update y[0][1375], since the valid indices are y[0][0] through y[0][1374] because Ty=1375 . So if “activate” ends at step 1370, you would get only y[0][1371] = y[0][1372] = y[0][1373] = y[0][1374] = 1

最後,假設你剛插入一個“激活”,實現代碼來更新標籤yt 。 在下面的代碼中,由於Ty=1375 ,所以y是一個(1,1375)維向量。

如果“激活”在時間步驟 t 結束,則設置 yt+1=1 以及最多 49 個附加連續值。 但是,請確保您不會跑掉數組的末尾並嘗試更新y [0] [1375],因爲有效索引是 y[0][0]y[0][1374] 因爲 Ty=1375 。 因此,如果“激活”在步驟 1370 結束,則將只獲得y[0][1371] = y[0][1372] = y[0][1373] = y[0][1374] = 1

Exercise: Implement insert_ones(). You can use a for loop. (If you are an expert in python’s slice operations, feel free also to use slicing to vectorize this.) If a segment ends at segment_end_ms (using a 10000 step discretization), to convert it to the indexing for the outputs y (using a 1375 step discretization), we will use this formula:

實現insert_ones()。 你可以使用for循環。 (如果你是 python 的 slice 操作的專家,那麼也可以使用切片來將這個向量化。)如果一個段以segment_end_ms結尾(使用10000步離散),將它轉換爲輸出的索引y (使用 1375 步離散化),我們將使用這個公式:

    segment_end_y = int(segment_end_ms * Ty / 10000.0)
# GRADED FUNCTION: insert_ones

def insert_ones(y, segment_end_ms):
    """
    Update the label vector y. The labels of the 50 output steps strictly after the end of the segment 
    should be set to 1. By strictly we mean that the label of segment_end_y should be 0 while, the
    50 following labels should be ones.

    更新標籤向量 y。 嚴格地說,在段結束之後的 50 個輸出步驟的標籤應該被設置爲 1.嚴格地說,
    我們的意思是 segment_end_y 的標籤應該是0,而下面的 50 個標籤應該是 1。?????

    Arguments:
    y -- numpy array of shape (1, Ty), the labels of the training example
    segment_end_ms -- the end time of the segment in ms

    Returns:
    y -- updated labels
    """

    # duration of the background (in terms of spectrogram time-steps)
    segment_end_y = int(segment_end_ms * Ty / 10000.0)

    # Add 1 to the correct index in the background label (y)
    ### START CODE HERE ### (≈ 3 lines)
    for i in range(segment_end_y+1, segment_end_y+51):
        if i < Ty:
            y[0, i] = 1.0
    ### END CODE HERE ###

    return y
arr1 = insert_ones(np.zeros((1, Ty)), 9700)
plt.plot(insert_ones(arr1, 4251)[0,:])
print("sanity checks:", arr1[0][1333], arr1[0][634], arr1[0][635])
sanity checks: 0.0 1.0 0.0

這裏寫圖片描述

Expected Output

**sanity checks**: 0.0 1.0 0.0

這裏寫圖片描述

Finally, you can use insert_audio_clip and insert_ones to create a new training example.

Exercise: Implement create_training_example(). You will need to carry out the following steps:

  1. Initialize the label vector y as a numpy array of zeros and shape (1,Ty) .
  2. Initialize the set of existing segments to an empty list.
  3. Randomly select 0 to 4 “activate” audio clips, and insert them onto the 10sec clip. Also insert labels at the correct position in the label vector y .
  4. Randomly select 0 to 2 negative audio clips, and insert them into the 10sec clip.

  5. 將標籤向量 y 初始化爲零和形狀 (1,Ty) 的一個 numpy 數組。

  6. 將現有段的集合初始化爲空列表。
  7. 隨機選擇 0 至 4“激活”音頻剪輯,並將其插入 10 秒剪輯。 還要將標籤插入標籤矢量 y 中的正確位置。
  8. 隨機選擇 0 到 2 個負樣本音頻片段,並將它們插入 10 秒片段。
# GRADED FUNCTION: create_training_example

def create_training_example(background, activates, negatives):
    """
    Creates a training example with a given background, activates, and negatives.

    Arguments:
    background -- a 10 second background audio recording
    activates -- a list of audio segments of the word "activate"
    negatives -- a list of audio segments of random words that are not "activate"

    Returns:
    x -- the spectrogram of the training example
    y -- the label at each time step of the spectrogram
    """

    # Set the random seed
    np.random.seed(18)

    # Make background quieter
    background = background - 20

    ### START CODE HERE ###
    # Step 1: Initialize y (label vector) of zeros (≈ 1 line)
    y = np.zeros((1,Ty))

    # Step 2: Initialize segment times as empty list (≈ 1 line)
    previous_segments = []
    ### END CODE HERE ###

    # Select 0-4 random "activate" audio clips from the entire list of "activates" recordings
    number_of_activates = np.random.randint(0, 5)
    random_indices = np.random.randint(len(activates), size=number_of_activates)
    random_activates = [activates[i] for i in random_indices]

    ### START CODE HERE ### (≈ 3 lines)
    # Step 3: Loop over randomly selected "activate" clips and insert in background
    for random_activate in random_activates:
        # Insert the audio clip on the background
        background, segment_time = insert_audio_clip(background, random_activate, previous_segments)
        # Retrieve segment_start and segment_end from segment_time
        segment_start, segment_end = segment_time
        # Insert labels in "y"
        y = insert_ones(y, segment_end)
    ### END CODE HERE ###

    # Select 0-2 random negatives audio recordings from the entire list of "negatives" recordings
    number_of_negatives = np.random.randint(0, 3)
    random_indices = np.random.randint(len(negatives), size=number_of_negatives)
    random_negatives = [negatives[i] for i in random_indices]

    ### START CODE HERE ### (≈ 2 lines)
    # Step 4: Loop over randomly selected negative clips and insert in background
    for random_negative in random_negatives:
        # Insert the audio clip on the background 
        background, _ = insert_audio_clip(background, random_negative, previous_segments)
    ### END CODE HERE ###

    # Standardize the volume of the audio clip 
    background = match_target_amplitude(background, -20.0)

    # Export new training example 
    file_handle = background.export("train" + ".wav", format="wav")
    print("File (train.wav) was saved in your directory.")

    # Get and plot spectrogram of the new recording (background with superposition of positive and negatives)
    x = graph_spectrogram("train.wav")

    return x, y
x, y = create_training_example(backgrounds[0], activates, negatives)
File (train.wav) was saved in your directory.


d:\program files\python36\lib\site-packages\matplotlib\axes\_axes.py:7172: RuntimeWarning: divide by zero encountered in log10
  Z = 10. * np.log10(spec)

這裏寫圖片描述

Expected Output

這裏寫圖片描述

Now you can listen to the training example you created and compare it to the spectrogram generated above.

IPython.display.Audio("train.wav")

Expected Output

IPython.display.Audio("audio_examples/train_reference.wav")

Finally, you can plot the associated labels for the generated training example.

plt.plot(y[0])
[<matplotlib.lines.Line2D at 0x242efc38550>]

這裏寫圖片描述

Expected Output

這裏寫圖片描述

1.4 - Full training set

You’ve now implemented the code needed to generate a single training example. We used this process to generate a large training set. To save time, we’ve already generated a set of training examples.

# Load preprocessed training examples
X = np.load("./XY_train/X.npy")
Y = np.load("./XY_train/Y.npy")

1.5 - Development set

To test our model, we recorded a development set of 25 examples. While our training data is synthesized, we want to create a development set using the same distribution as the real inputs. Thus, we recorded 25 10-second audio clips of people saying “activate” and other random words, and labeled them by hand. This follows the principle described in Course 3 that we should create the dev set to be as similar as possible to the test set distribution; that’s why our dev set uses real rather than synthesized audio.

爲了測試我們的模型,我們記錄了25個例子的開發集。 雖然我們的訓練數據是合成的,但我們希望創建一個使用與實際輸入相同分佈的開發集。 因此,我們記錄了人們說“激活”和其他隨機單詞的 25 個 10 秒鐘音頻剪輯,並用手標記。 這遵循在課程3中描述的原則,我們應該創建一個儘可能與測試集分佈相似的開發集; 這就是爲什麼我們的開發套件使用真實而不是合成音頻。

# Load preprocessed dev set examples
X_dev = np.load("./XY_dev/X_dev.npy")
Y_dev = np.load("./XY_dev/Y_dev.npy")

2 - Model

Now that you’ve built a dataset, lets write and train a trigger word detection model!

The model will use 1-D convolutional layers, GRU layers, and dense layers. Let’s load the packages that will allow you to use these layers in Keras. This might take a minute to load.

現在您已經構建了一個數據集,可以編寫和訓練一個觸發詞檢測模型!

該模型將使用一維卷積圖層,GRU 圖層和 dense 圖層。 讓我們加載可以在 Keras 中使用這些圖層的包。 這可能需要一分鐘才能加載。

from keras.callbacks import ModelCheckpoint
from keras.models import Model, load_model, Sequential
from keras.layers import Dense, Activation, Dropout, Input, Masking, TimeDistributed, LSTM, Conv1D
from keras.layers import GRU, Bidirectional, BatchNormalization, Reshape
from keras.optimizers import Adam

2.1 - Build the model

Here is the architecture we will use. Take some time to look over the model and see if it makes sense.

這裏寫圖片描述

Figure 3

One key step of this model is the 1D convolutional step (near the bottom of Figure 3). It inputs the 5511 step spectrogram, and outputs a 1375 step output, which is then further processed by multiple layers to get the final Ty=1375 step output. This layer plays a role similar to the 2D convolutions you saw in Course 4, of extracting low-level features and then possibly generating an output of a smaller dimension.

Computationally, the 1-D conv layer also helps speed up the model because now the GRU has to process only 1375 timesteps rather than 5511 timesteps. The two GRU layers read the sequence of inputs from left to right, then ultimately uses a dense+sigmoid layer to make a prediction for yt . Because y is binary valued (0 or 1), we use a sigmoid output at the last layer to estimate the chance of the output being 1, corresponding to the user having just said “activate.”

Note that we use a uni-directional RNN rather than a bi-directional RNN. This is really important for trigger word detection, since we want to be able to detect the trigger word almost immediately after it is said. If we used a bi-directional RNN, we would have to wait for the whole 10sec of audio to be recorded before we could tell if “activate” was said in the first second of the audio clip.

該模型的一個關鍵步驟是 1D 卷積步驟(靠近圖3的底部)。它輸入 5511 步譜圖,並輸出一個1375步輸出,然後再進行多層處理,得到最終的 Ty=1375 步輸出。該層的作用類似於您在課程 4 中看到的提取低級特徵的2D卷積,然後可能會生成較小維度的輸出。

從計算角度而言,1-D conv層也有助於加速模型,因爲現在 GRU 必須僅處理 1375 個時步而不是 5511 個時步。兩個 GRU 層從左到右讀取輸入序列,然後最終使用 dense+sigmoid 層對yt 進行預測。因爲 y 是二進制值(0或1),所以我們在最後一層使用 sigmoid 輸出來估計輸出爲1的機會,對應於剛剛說過“激活”的用戶。

請注意,我們使用單向RNN而不是雙向RNN。這對於觸發字檢測非常重要,因爲我們希望能夠在它說出後立即檢測觸發字。如果我們使用雙向RNN,我們必須等待整個10秒的音頻被記錄下來,然後才能確定音頻片段的第一秒是否有“激活”。

Implementing the model can be done in four steps:

Step 1: CONV layer. Use Conv1D() to implement this, with 196 filters,
a filter size of 15 (kernel_size=15), and stride of 4. [See documentation.]

Step 2: First GRU layer. To generate the GRU layer, use:

X = GRU(units = 128, return_sequences = True)(X)

Setting return_sequences=True ensures that all the GRU’s hidden states are fed to the next layer. Remember to follow this with Dropout and BatchNorm layers.

Step 3: Second GRU layer. This is similar to the previous GRU layer (remember to use return_sequences=True), but has an extra dropout layer.

Step 4: Create a time-distributed dense layer as follows:

X = TimeDistributed(Dense(1, activation = "sigmoid"))(X)

This creates a dense layer followed by a sigmoid, so that the parameters used for the dense layer are the same for every time step. [See documentation.]

Exercise: Implement model(), the architecture is presented in Figure 3.

# GRADED FUNCTION: model

def model(input_shape):
    """
    Function creating the model's graph in Keras.

    Argument:
    input_shape -- shape of the model's input data (using Keras conventions)

    Returns:
    model -- Keras model instance
    """

    X_input = Input(shape = input_shape)

    ### START CODE HERE ###

    # Step 1: CONV layer (≈4 lines)
    X = Conv1D(196, 15, strides=4)(X_input)                                 # CONV1D
    X = BatchNormalization()(X)                                 # Batch normalization
    X = Activation('relu')(X)                                 # ReLu activation
    X = Dropout(0.8)(X)                                 # dropout (use 0.8)

    # Step 2: First GRU Layer (≈4 lines)
    X = GRU(units=128, return_sequences=True)(X)                                 # GRU (use 128 units and return the sequences)
    X = Dropout(0.8)(X)                                 # dropout (use 0.8)
    X = BatchNormalization()(X)                                 # Batch normalization

    # Step 3: Second GRU Layer (≈4 lines)
    X = GRU(units=128, return_sequences=True)(X)                                 # GRU (use 128 units and return the sequences)
    X = Dropout(0.8)(X)                                 # dropout (use 0.8)
    X = BatchNormalization()(X)                                 # Batch normalization
    X = Dropout(0.8)(X)                              # dropout (use 0.8)

    # Step 4: Time-distributed dense layer (≈1 line)
    X = TimeDistributed(Dense(1, activation = "sigmoid"))(X) # time distributed  (sigmoid)

    ### END CODE HERE ###

    model = Model(inputs = X_input, outputs = X)

    return model  
model = model(input_shape = (Tx, n_freq))

Let’s print the model summary to keep track of the shapes.

model.summary()
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_2 (InputLayer)         (None, 5511, 101)         0         
_________________________________________________________________
conv1d_2 (Conv1D)            (None, 1375, 196)         297136    
_________________________________________________________________
batch_normalization_3 (Batch (None, 1375, 196)         784       
_________________________________________________________________
activation_2 (Activation)    (None, 1375, 196)         0         
_________________________________________________________________
dropout_4 (Dropout)          (None, 1375, 196)         0         
_________________________________________________________________
gru_3 (GRU)                  (None, 1375, 128)         124800    
_________________________________________________________________
dropout_5 (Dropout)          (None, 1375, 128)         0         
_________________________________________________________________
batch_normalization_4 (Batch (None, 1375, 128)         512       
_________________________________________________________________
gru_4 (GRU)                  (None, 1375, 128)         98688     
_________________________________________________________________
dropout_6 (Dropout)          (None, 1375, 128)         0         
_________________________________________________________________
batch_normalization_5 (Batch (None, 1375, 128)         512       
_________________________________________________________________
dropout_7 (Dropout)          (None, 1375, 128)         0         
_________________________________________________________________
time_distributed_1 (TimeDist (None, 1375, 1)           129       
=================================================================
Total params: 522,561
Trainable params: 521,657
Non-trainable params: 904
_________________________________________________________________

Expected Output:

**Total params** 522,561
**Trainable params** 521,657
**Non-trainable params** 904

The output of the network is of shape (None, 1375, 1) while the input is (None, 5511, 101). The Conv1D has reduced the number of steps from 5511 at spectrogram to 1375.

網絡的輸出是形狀(None,1375,1),而輸入是(None,5511,101)。 Conv1D 將光譜圖中的 5511 步數減少到 1375。

2.2 - Fit the model

Trigger word detection takes a long time to train. To save time, we’ve already trained a model for about 3 hours on a GPU using the architecture you built above, and a large training set of about 4000 examples. Let’s load the model.

觸發詞檢測需要很長時間來訓練。 爲了節省時間,我們已經使用上面構建的體系結構在GPU上訓練了約3小時的模型,併爲大約4000個示例進行了大量訓練。 讓我們加載模型。

model = load_model('./models/tr_model.h5')

You can train the model further, using the Adam optimizer and binary cross entropy loss, as follows. This will run quickly because we are training just for one epoch and with a small training set of 26 examples.

您可以使用 Adam優化 和二元交叉熵損失進一步訓練模型,如下所示。 這將會很快發生,因爲我們正在爲一個epoch 進行訓練,並且只有 26 個樣本的訓練集。

opt = Adam(lr=0.0001, beta_1=0.9, beta_2=0.999, decay=0.01)
model.compile(loss='binary_crossentropy', optimizer=opt, metrics=["accuracy"])
model.fit(X, Y, batch_size = 5, epochs=1)
Epoch 1/1
26/26 [==============================] - 11s 427ms/step - loss: 0.0558 - acc: 0.9797





<keras.callbacks.History at 0x242938d8da0>

2.3 - Test the model

Finally, let’s see how your model performs on the dev set.

loss, acc = model.evaluate(X_dev, Y_dev)
print("Dev set accuracy = ", acc)
25/25 [==============================] - 2s 68ms/step
Dev set accuracy =  0.9355927109718323

This looks pretty good! However, accuracy isn’t a great metric for this task, since the labels are heavily skewed to 0’s, so a neural network that just outputs 0’s would get slightly over 90% accuracy. We could define more useful metrics such as F1 score or Precision/Recall. But let’s not bother with that here, and instead just empirically see how the model does.

這看起來不錯! 然而,對於這項任務來說,準確性並不是一個很好的指標,因爲標籤嚴重傾斜到 0,所以只輸出 0 的神經網絡的準確性會略高於90%。 我們可以定義更多有用的指標,如 F1 分數或 Precision / Recall。 但是,讓我們不要在這裏煩惱,而只是憑經驗去看模型是如何工作的。

3 - Making Predictions

Now that you have built a working model for trigger word detection, let’s use it to make predictions. This code snippet runs audio (saved in a wav file) through the network.

def detect_triggerword(filename):
    plt.subplot(2, 1, 1)

    x = graph_spectrogram(filename)
    # the spectogram outputs (freqs, Tx) and we want (Tx, freqs) to input into the model
    x  = x.swapaxes(0,1)
    x = np.expand_dims(x, axis=0)
    predictions = model.predict(x)

    plt.subplot(2, 1, 2)
    plt.plot(predictions[0,:,0])
    plt.ylabel('probability')
    plt.show()
    return predictions

Once you’ve estimated the probability of having detected the word “activate” at each output step, you can trigger a “chiming” sound to play when the probability is above a certain threshold. Further, yt might be near 1 for many values in a row after “activate” is said, yet we want to chime only once. So we will insert a chime sound at most once every 75 output steps. This will help prevent us from inserting two chimes for a single instance of “activate”. (This plays a role similar to non-max suppression from computer vision.)

一旦你估計了在每個輸出步驟檢測到“激活”這個詞的概率,當概率超過某個閾值時,你就可以觸發“鳴叫”聲音。 此外,在“activate”之後, yt 對於連續的許多值可能接近1,但我們只想要一次鐘聲。 所以我們每75個輸出步驟最多插入一次鐘聲。 這將有助於防止我們爲單個“激活”實例插入兩個鐘聲。 (這起到類似於計算機視覺非最大抑制的作用。)

chime_file = "audio_examples/chime.wav"
def chime_on_activate(filename, predictions, threshold):
    audio_clip = AudioSegment.from_wav(filename)
    chime = AudioSegment.from_wav(chime_file)
    Ty = predictions.shape[1]
    # Step 1: Initialize the number of consecutive output steps to 0
    consecutive_timesteps = 0
    # Step 2: Loop over the output steps in the y
    for i in range(Ty):
        # Step 3: Increment consecutive output steps
        consecutive_timesteps += 1
        # Step 4: If prediction is higher than the threshold and more than 75 consecutive output steps have passed
        if predictions[0,i,0] > threshold and consecutive_timesteps > 75:
            # Step 5: Superpose audio and background using pydub
            audio_clip = audio_clip.overlay(chime, position = ((i / Ty) * audio_clip.duration_seconds)*1000)
            # Step 6: Reset consecutive output steps to 0
            consecutive_timesteps = 0

    audio_clip.export("chime_output.wav", format='wav')

3.3 - Test on dev examples

Let’s explore how our model performs on two unseen audio clips from the development set. Lets first listen to the two dev set clips.

IPython.display.Audio("./raw_data/dev/1.wav")
IPython.display.Audio("./raw_data/dev/2.wav")

Now lets run the model on these audio clips and see if it adds a chime after “activate”!
現在讓我們在這些音頻剪輯上運行模型,看看它是否在“激活”之後添加了鈴聲!

filename = "./raw_data/dev/1.wav"
prediction = detect_triggerword(filename)
chime_on_activate(filename, prediction, 0.5)
IPython.display.Audio("./chime_output.wav")

這裏寫圖片描述

filename  = "./raw_data/dev/2.wav"
prediction = detect_triggerword(filename)
chime_on_activate(filename, prediction, 0.5)
IPython.display.Audio("./chime_output.wav")

這裏寫圖片描述

Congratulations

You’ve come to the end of this assignment!

Here’s what you should remember:
- Data synthesis is an effective way to create a large training set for speech problems, specifically trigger word detection.
- Using a spectrogram and optionally a 1D conv layer is a common pre-processing step prior to passing audio data to an RNN, GRU or LSTM.
- An end-to-end deep learning approach can be used to built a very effective trigger word detection system.

  • 數據合成是爲語音問題創建大型訓練集的有效方法,特別是觸發詞檢測。
  • 在將音頻數據傳遞到RNN,GRU或LSTM之前,使用頻譜圖和可選的1D conv層是常見的預處理步驟。
  • 端到端的深度學習方法可用於構建非常有效的觸發字檢測系統。

Congratulations on finishing the fimal assignment!

Thank you for sticking with us through the end and for all the hard work you’ve put into learning deep learning. We hope you have enjoyed the course!

4 - Try your own example! (OPTIONAL/UNGRADED)

In this optional and ungraded portion of this notebook, you can try your model on your own audio clips!

Record a 10 second audio clip of you saying the word “activate” and other random words, and upload it to the Coursera hub as myaudio.wav. Be sure to upload the audio as a wav file. If your audio is recorded in a different format (such as mp3) there is free software that you can find online for converting it to wav. If your audio recording is not 10 seconds, the code below will either trim or pad it as needed to make it 10 seconds.

在這個筆記的可選和未定義部分中,您可以在自己的音頻剪輯上嘗試模型!

錄製10秒鐘的音頻片段,說出“激活”和其他隨機單詞,然後將其作爲myaudio.wav上傳到Coursera中心。 確保將音頻上傳爲wav文件。 如果您的音頻以不同的格式(如mp3)錄製,則可以在線找到用於將其轉換爲wav的免費軟件。 如果您的錄音時間不是10秒鐘,下面的代碼會根據需要進行修剪或填充,以使其達到10秒。

# Preprocess the audio to the correct format
def preprocess_audio(filename):
    # Trim or pad audio segment to 10000ms
    padding = AudioSegment.silent(duration=10000)
    segment = AudioSegment.from_wav(filename)[:10000]
    segment = padding.overlay(segment)
    # Set frame rate to 44100
    segment = segment.set_frame_rate(44100)
    # Export as wav
    segment.export(filename, format='wav')

Once you’ve uploaded your audio file to Coursera, put the path to your file in the variable below.

your_filename = "audio_examples/my_audio.wav"
preprocess_audio(your_filename)
IPython.display.Audio(your_filename) # listen to the audio you uploaded 

Finally, use the model to predict when you say activate in the 10 second audio clip, and trigger a chime. If beeps are not being added appropriately, try to adjust the chime_threshold.

chime_threshold = 0.5
prediction = detect_triggerword(your_filename)
chime_on_activate(your_filename, prediction, chime_threshold)
IPython.display.Audio("./chime_output.wav")

這裏寫圖片描述

發佈了186 篇原創文章 · 獲贊 44 · 訪問量 14萬+
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章