Google Audio Set是谷歌提供的語音數據集,對於語音相關的AI學習和研究有着至關重要的作用
因爲身處長城之內,故從谷歌官網搬運該數據集的介紹、下載,及解析格式
數據集簡介
AudioSet由632個音頻事件類的擴展本體和從YouTube視頻中提取的2084320個標記爲10秒的聲音片段組成,涵蓋了人類和動物的各種聲音、樂器和流派以及常見的日常環境聲音。
二百一十萬
annotated videos
5.8 k 小時音頻
hours of audio
527 個類別
of annotated sounds
下載方式見https://blog.csdn.net/qq_39437746/article/details/80793476
下面是tfrecord文件的具體解析格式
Features dataset
Frame-level features are stored as tensorflow.SequenceExample protocol buffers. A tensorflow.SequenceExample proto is reproduced here in text format:
context: {
feature: {
key : "video_id"
value: {
bytes_list: {
value: [YouTube video id string]
}
}
}
feature: {
key : "start_time_seconds"
value: {
float_list: {
value: 6.0
}
}
}
feature: {
key : "end_time_seconds"
value: {
float_list: {
value: 16.0
}
}
}
feature: {
key : "labels"
value: {
int64_list: {
value: [1, 522, 11, 172] # The meaning of the labels can be found here.
}
}
}
}
feature_lists: {
feature_list: {
key : "audio_embedding"
value: {
feature: {
bytes_list: {
value: [128 8bit quantized features]
}
}
feature: {
bytes_list: {
value: [128 8bit quantized features]
}
}
}
... # Repeated for every second of the segment
}
}
tfRecord解析代碼
def getParseData(filenames):
# filenames = 'audioset_v1_embeddings/bal_train/5v.tfrecord'
raw_dataset = tf.data.TFRecordDataset(filenames)
# for raw_single in raw_dataset:
# print(repr(raw_single))
# #查看feature
# for raw_record in raw_dataset.take(1):
# example = tf.train.Example()
# example.ParseFromString(raw_record.numpy())
# print(example)
context_feature = {
"video_id": tf.io.FixedLenFeature([], tf.string),
'labels': tf.io.VarLenFeature(tf.int64),
'end_time_seconds': tf.io.FixedLenFeature([], tf.float32),
'start_time_seconds': tf.io.FixedLenFeature([], tf.float32)
}
sequence_feature = {
'audio_embedding': tf.io.FixedLenSequenceFeature(shape=[], dtype=tf.string, allow_missing=True)
}
def _parse_function(example_proto):
return tf.io.parse_single_sequence_example(example_proto, context_feature, sequence_feature)