Deep Learning Techniques for Music Generation

Performance RNN
MusicVAE
Wavenet

Abstract

五個維度分析：

Object
- melody
- polyphony
- accompaniment
- counterpoint
Representation
waveform, spectrogram, note, chord, meter and beat
波形，頻譜圖，音符，和絃，節拍
- format
  MIDI, piano roll or text.
- encoded
  scalar, one-hot or many-hot.
Architecture
- feedforward network
- recurrent network
- autoencoder
- generative adversarial networks
Challenge
variability, interactivity and creativity.
Strategy
single-step feedforward, iterative feedforward, sampling or input manipulation

Introduction

Type

Melody
Single-voice monophonic melody
Polyphony
和絃
Single-voice polyphony (also named Single-track polyphony)
Multivoice or Multitrack
Multivoice polyphony (also named Multitrack polyphony)
Accompaniment
伴奏
- Counterpoint, composed of one or more melodies (voices)
- Chord progression, which provides some associated harmony.

Destination and Use

Audio system
play the generated content
Sequencer software
process the generated content（MIDI）
Human(s)
music score.

Mode

自動無需人干預
具有一些控制界面，供人類用戶對生成過程進行某些互動控制

Style

相干性，覆蓋率（相對於稀疏性）和範圍（特定性與較大廣度）

coherence
coverage (versus sparsity)
scope (specialized versus large breadth)

Representation

Audio

Waveform
Transformed Representations
Spectrogram
音頻的常見變換表示形式是通過傅立葉變換獲得的頻譜
Chromagram
頻譜圖的一種變化形式

與八度無關
鋼琴演奏的C大調的色譜圖如圖所示
四個子圖（a至d）共有的x軸表示時間（以秒爲單位）
（a）的y軸表示音符
（b和d）的y軸表示色度（音高等級）
（c）的y軸表示振幅
對於色譜圖（b和d），彩色的第三個軸表示強度。

Main Concepts

Note 音符（Pitch,Duration,Dynamics）

Pitch 音高
- frequency
  單位 Hz
- vertical position (height) on a score
- pitch notation
  $A_4$ （A440（頻率爲440 Hz）一般的音高調整標準）
  音高等級+一個數字
Duration 持續時間
- 絕對值 ms
- 相對值 a quarter note / an eighth note
Dynamics
- quantitative value (dB)
- qualitative value
  an annotation on a score about how to perform the note
  $\{ppp, pp, p, f, ff, fff\}$

Rest 休止

絕對值 ms
相對值
a quarter rest an eighth rest

Interval 間隔

basis of chords 和絃的基礎
An interval is a relative transition between two notes
Examples：
a major third (which includes 4 semitones 半音)
a minor third (3 semitones)
a (perfect) fifth (7 semitones)

很少用於基於深度學習的音樂生成

Chord 和絃

一組至少3個音符（一個三重音）

specification of the precise octave as well as the position (voicing) for each note
每個音符的精確八度音程以及位置（發音）
通過使用和絃符號組合
- 根音的音高等級 e.g. C
- 類型 e.g. major, minor, dominant seventh, or diminished

Rhythm 節奏

conveys the pulsation as well as the stress on specific beats

傳達脈動和節拍

Rhythm introduces pulsation, cycles （脈動，循環）

改變原本平坦的音符線性順序

Beat and Meter 節拍

the unit of pulsation

meter
More frequent meters are 2/4, 3/4 and 4/4
3/4表示每小節3個節拍，每小節持續時間爲四分音符
- 小節內的節拍數
- 每個節拍的持續時間

Multivoice/Multitrack

多音軌
每個聲音是不同的人聲範圍（例如，女高音，中音…）或不同的樂器（例如，鋼琴，貝斯，鼓…）。多聲音音樂通常被建模爲平行音軌，每個音軌具有不同的音符序列，共享相同的音高，但可能具有不同的強的節拍（重音）

表示形式將是單聲複音，常見的例子是和絃樂器，例如鋼琴或吉他

Format

MIDI

Musical Instrument Digital Interface
指定了實時音符演奏數據以及控制數據

Note on
- a channel number
  表示樂器或軌道
  {0,1, . . . ,15}
- a MIDI note number
  表示音高
  {0,1, . . . ,127}
- a velocity
  音符的響度（對於鍵盤，它表示按下鍵的速度）
  {0,1, . . . ,127}
- e.g. ‘Note on，0，60，50’
  表示“在通道1上，開始以速度50演奏中音C”；
Note off
- 同上，除了速度是指示釋放音符的速度
- e.g. “Note off，0、60、20”，
  表示“在通道1上，以20的速度停止演奏中音C”。

每個音符事件實際上都嵌入到軌道塊中，軌道塊是一個包含增量時間值的數據結構，該值指定了時序信息和事件本身。

a relative metrical time
specifies the number of ticks per quarter note
an absolute time
ticks 滴答聲
四分音符爲 384 ticks（十六分音符爲 96 ticks ，八分音符爲 192 ticks ）

缺點

它不能有效地保留通過使用多個音軌一次播放多個音符的概念

Piano Roll

條音符
長度持續時間

優點

直觀

相對於 MIDI 缺點

no note off information
無法區分長音和重複的短音

ABC notation

民間音樂和傳統音樂

旋律可以編碼爲文本表示形式，並作爲文本進行處理。

前六行是標題，代表元數據

T是音樂的標題，M是 meter ，L是默認音符長度

Chord and Polyphony

Chord2Vec34

MusicXML

Markup Language
e.g. HTML , XML

由於MIDI文件的複雜，不規範以及面向播放的特性，其並不能完全滿足音樂軟件對譜子顯示及排版的需求。因此，MusicXML應運而生。

MusicXML是一個開放自由，易於分發的西洋樂記譜格式，其在萬維網聯盟（W3C）管理下。MusicXML文件基於標準XML技術，因此本質上是一種文本文件，有別於標準MIDI文件爲二進制文件。MusicXML的優點主要在於其對顯示格式有着精確的定義，因此可以做到對於同一個文件在不同的環境下打開都有着同樣的譜面顯示內容。 MusicXML中的音樂語義主要有elements表達，也就是其中的XML標籤，並以標籤的嵌套關係表達音樂語義的元素包含關係。

MusicXML文件分爲兩種類型：

score-partwise

譜子信息，XML文件信息  
各聲部信息
聲部1全曲：
    小節1：
            屬性
        音符1
        音符2
        ……
    小節2：
        音符1
        音符2
        ……
    ……
聲部2全曲：
    小節：
            屬性
        音符1
        音符2
        ……
    小節2：
        音符1
        音符2
        ……
    ……

score-timewise

譜子信息，XML文件信息
各聲部信息
小節1：
    聲部1：
        屬性
        音符1
        音符2
        ……
    聲部2：
        屬性
        音符1
        音符2
        ……
小節2：
    聲部1：
        屬性
        音符1
        音符2
        ……
    聲部2：
        屬性
        音符1
        音符2
        ……

一個屬性（attributes）通常包含以下信息：

Divisions：最小時值單位
Key：調號
Time：拍號
Clef：譜號

對於一個音符（Note），通常包含以下信息：

Step：音名
Octave：八度位置
Duration：相對長度
Type：音符類型

缺點

冗長和豐富，不適合作爲機器學習任務的直接表示

庫

在Python中

可以使用 music21 庫處理 MusicXML 文件與 MIDI 文件
使用 pretty_midi 庫處理 MIDI 文件。

Lead Sheet

爵士，流行音樂等

Flow Machines

Lead Sheet Data Base (LSDB) repository
include 12,000 lead sheets

MidiNet system

Temporal Scope and Granularity

Scope 範圍

Global
Examples are the MiniBach and DeepHear systems
產生的音樂內容具有固定的長度
Time step (or time slice)
產生的音樂內容任意長度
Note step
CONCERT system
產生的音樂內容任意長度

Granularity 粒度

在 Global 和 Time step 的 Temporal Scope ，必須定義 time step 的粒度

設置爲相對時間
e.g. 十六分音符
設置爲絕對時間
e.g. 10ms

Encoding

a scalar discrete integer value encoding of A4, the integer number specifying its MIDI note number;
a one-hot encoding of A4
a many-hot encoding of a D minor chord (D4, F4, A4) （D小調和絃）
a multi-one-hot encoding of a first voice with A4 and a second voice with D3
a multi-many-hot encoding of a first voice with a D minor chord (D4, F4, A4) and a second voice with C3 (corresponding to a minor seventh on bass).

Dataset

Abs	Dataset	Introduce
	the Classical piano MIDI database
JSB	the JSB Chorales dataset
LSDB	Lead Sheet Data Base	with more than 12,000 lead sheets (including from all jazz and bossa nova song books), developed within the Flow Machines project
	MuseData electronic library of classical music with more than 800 pieces, from CCARH in Stanford University
	MusicNet	a collection of 330 freely-licensed classical music recordings together with over 1 million annotated labels (indicating timing and instrumental information)
	Nottingham	a collection of 1,200 folk tunes in the ABC notation , each tune consisting of a simple melody on top of chords, in other words an ABC equivalent of a lead sheet;
	Session	a repository and discussion platform for Celtic music in the ABC notation containing more than 15,000 songs
	Symbolic Music dataset by Walder	a huge set of cleaned and preprocessed MIDI files
	TheoryTab database	a set of songs represented in a tab format, a combination of a piano roll melody, chords and lyrics, in other words a piano roll equivalent of a lead sheet;
	Yamaha e-Piano Competition dataset	in which participants MIDI performance records are made available

Architecture

Restricted Boltzmann Machine (RBM)

受限玻爾茲曼機

RBM 和 autoencoder 不同

an RBM has no ouput – the input also acts as the output;
an RBM is stochastic， not deterministic
隨機，不確定
RBM 採用特定算法（ contrastive divergence ）以無監督學習的方式進行訓練的
操作的值是布爾值

RBMs became popular after Hinton designed a specific fast learning algorithm for them, named contrastive divergence , and used them for pre-training deep neural networks

可以學習分佈，可以從少數數據裏有效的學習

RBM的本質是一種Unsumervised Machine Learning模型，用於對input數據進行重構，即有效地提取數據特徵，構建新的數據結構進行預測分析，基本功能有點兒像AutoEncoder模型（自動編碼器）。因此，RBM和AE一樣，也可以不斷地堆疊實現深層的神經網絡挖掘數據的特徵。

Generative Adversarial Networks (GAN)

Reinforcement learning (RL)

RL-Tuner architecture

Compound Architectures

Composition

RNN Encoder-Decoder
combines an RNN and an autoencoder
RNN-RBM architecture
combining an RNN architecture and an RBM architecture

Refinement

variational autoencoder (VAE) architecture

Pattern instantiation

C-RNN-GAN architecture

Strategy

Ex Nihilo Generation

Minibach

Single-Step Feedforward Strategy

監督學習
piano roll

擅長生成與輸入旋律匹配的伴奏（由三個不同旋律組成的對位）

缺陷

音樂長度固定
相同的旋律將始終產生完全相同的伴奏
沒有增量性和交互性

DeepHear

based on an autoencoder architecture

DeepHear Ragtime Melody Symbolic Music Generation System

piano roll with a multi-one-hot encoding

Sample

Metropolis-Hastings algorithm
Gibbs sampling (GS)
block Gibbs sampling

different levels of probability distribution (and sampling):

item-level or vertical dimension
在複合音樂項目的級別，例如和絃。在這種情況下，分佈是關於和絃成分之間的關係的，即描述音符一起出現的可能性。
sequence-level or horizontal dimension
系列項目的級別，例如，由連續音符組成的旋律。在這種情況下，分佈與音符序列有關，它描述了在給定音符之後出現特定音符的概率。

RBM-based Chord

RBM-based Chord Music Generation System

模擬複音音樂

sample from the RBM through block Gibbs sampling

Length Variability

單步前饋策略和解碼器前饋策略的一個重要限制是生成的音樂的長度（更準確地說是步數或小節的次數）是固定的。

解決方法，使用 RNN

將其前饋到循環網絡中以產生下一項（例如，下一個音符）；
使用該下一項作爲下一個輸入，以產生下一個下一項；
迭代重複此過程，直到產生所需長度的序列（例如音符，即旋律）爲止。

BluesC

其中C代表和絃

Blues Chord Sequence Symbolic Music Generation System

目標是學習和生成和絃序列
piano roll

two types of sequences: melody and chords

BluesMC

MC代表旋律和和絃

Blues Melody and Chords Symbolic Music Generation System

LSTM

Content Variability

RNN上的迭代前饋策略的侷限性在於生成是確定性的，前饋相同的輸入將始終產生相同的輸出

the output activation layer is softmax and generation is modeled as a classification task

可以通過抽樣輕鬆地切換到不確定性策略

通過按照生成的分佈對音符進行採樣

CONCERT

CONCERT Bach Melody Symbolic Music Generation System

LSTM

The three main components are as follows:

the pitch height (PH)
the (modulo) chroma circle (CC) cartesian coordinates
the (harmonic) circle of fifths (CH) cartesian coordinates.

activation function is the sigmoid function
cost function is mean squared error

是早期工作

Celtic

Celtic Melody Symbolic Music Generation System

輸入 - 網絡 - 採樣 - 遞歸

Expressiveness

One limitation of most existing systems is that they consider fixed dynamics (amplitude) for all notes as well as an exact quantization (a fixed tempo), which makes the music generated too mechanical, without expressiveness or nuance.

Performance RNN

Performance RNN Piano Polyphony Symbolic Music Generation System

MIDI

LSTM

a temperature 0-1 隨機-固定

Melody-Harmony Interaction

RNN-RBM

RNN-RBM Polyphony Symbolic Music Generation System

同時考慮 item-level or vertical dimension 和 sequence-level or horizontal dimension

Hexahedria

Hexahedria Polyphony Symbolic Music Generation Architecture

同時考慮 item-level or vertical dimension 和 sequence-level or horizontal dimension

古典鋼琴MIDI / piano roll

Bi-Axial LSTM

Bi-Axial LSTM Polyphony Symbolic Music Generation Architecture

Structure

MusicVAE

MusicVAE Multivoice Hierarchical Symbolic Music Generation System

在解碼器內具有2級分層RNN的變異遞歸自動編碼器（VRAE）
MIDI

Incrementality

單步前饋–前饋體系結構在單個處理步驟中處理包括所有時間步長的全局表示。一個例子是MiniBach
迭代前饋–循環體系結構迭代處理與單個時間步相對應的本地表示。一個例子是CONCERT
增量採樣–前饋體系結構通過增量實例化其變量（每個變量對應於特定時間步長音符的可能性）來增量處理包括所有時間步長的全局表示。一個例子是DeepBach

DeepBach

DeepBach Chorale Multivoice Symbolic Music Generation System
Hadjeres

Interactivity

與人類用戶的某種互動性，以幫助人們以漸進和互動的方式完成音樂任務（作曲，對位，和聲，分析，編排等）

Deep-AutoController
DeepBach

【Music 系列：一】Deep Learning Techniques for Music Generation

Abstract

Introduction

Type

Representation

Audio

Main Concepts

Note 音符 （Pitch,Duration,Dynamics）

Rest 休止

Interval 間隔

Chord 和絃

Rhythm 節奏

Multivoice/Multitrack

Format

MIDI

Piano Roll

ABC notation

Chord and Polyphony

MusicXML

庫

Lead Sheet

Temporal Scope and Granularity

Scope 範圍

Granularity 粒度

Encoding

Dataset

Architecture

Restricted Boltzmann Machine (RBM)

Generative Adversarial Networks (GAN)

Reinforcement learning (RL)

Compound Architectures

Strategy

Ex Nihilo Generation

Minibach

DeepHear

Sample

RBM-based Chord

Length Variability

BluesC

BluesMC

Content Variability

CONCERT

Celtic

Expressiveness

Performance RNN

Melody-Harmony Interaction

RNN-RBM

Hexahedria

Bi-Axial LSTM

Structure

MusicVAE

Incrementality

DeepBach

Interactivity

Analysis

Note 音符（Pitch,Duration,Dynamics）