使用自動編碼器將數據匿名化,別再讓數據泄露你的隱私

{"type":"doc","content":[{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本文最初發表於 Towards Data Science 博客,經原作者 Shuyi Yang 授權,InfoQ 中文站翻譯並分享。"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在這篇文章中,我們將學習如何使用自編碼器(autoencoder,一種特殊的人工神經網絡)來實現數據匿名化。通過保持原始數據的保密性,這種方法提取的數據的潛在表示可以在下游機器學習預測任務中使用,而不會導致性能顯著降低。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本文分爲兩部分。在第一部分,我將通過例子介紹一個自編碼器的結構。在第二部分,我將展示如何使用自編碼器對錶格數據進行編碼,以便將其匿名化,並將其用於其他機器學習任務,同時又能保護隱私。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"自編碼器"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/en.wikipedia.org\/wiki\/Autoencoder","title":"","type":null},"content":[{"type":"text","text":"自編碼器"}]},{"type":"text","text":"是一種特殊的神經網絡,它由兩部分組成:編碼器和解碼器。編碼器部分接收輸入數據並將其轉換爲潛在表示;而解碼器部分嘗試重構潛在表示的輸入數據。損失是輸入數據和重構數據之間的距離。"}]},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/f2\/0d\/f2f846ca852641f455efbce267a82e0d.jpg","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"經過訓練的自編碼器能夠提供一個良好的潛在表示。這種表示方式與原始數據非常不同,但是它包含了輸入層中的所有信息。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲了說明這一點,讓我們嘗試在一個著名的公共數據集 "},{"type":"link","attrs":{"href":"https:\/\/en.wikipedia.org\/wiki\/MNIST_database","title":"","type":null},"content":[{"type":"text","text":"MNIST"}]},{"type":"text","text":" 上運行一個自編碼器。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"讓我們爲本教程導入一些包。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"text"},"content":[{"type":"text","text":"from pandas import read_csv, set_option, get_dummies, DataFrame\nfrom sklearn.preprocessing import MinMaxScaler\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.ensemble import RandomForestClassifier\nfrom sklearn.model_selection import cross_validate\nfrom sklearn.inspection import permutation_importance\nfrom numpy import mean, max, prod, array, hstack\nfrom numpy.random import choice\nfrom matplotlib.pyplot import barh, yticks, ylabel, xlabel, title, show, scatter, cm, figure, imshow\nfrom tensorflow.keras.layers import Input, Dense, Dropout, Activation, BatchNormalization\nfrom tensorflow.keras import Model\nfrom tensorflow.keras.datasets import mnist\nfrom tensorflow.keras.callbacks import EarlyStopping\nfrom tensorflow.keras.utils import plot_model\nfrom tqdm import tqdm"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們將構建和訓練不同的自編碼器,因此,爲了這個目的,讓我們來定義一個函數。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"text"},"content":[{"type":"text","text":"def build_autoencoder(dim_input, dim_layer_1, dim_layer_2):\n \n input_layer = Input(shape=(dim_input,))\n x = Activation(\"relu\")(input_layer)\n x = Dense(dim_layer_1)(x)\n x = Activation(\"relu\")(x)\n bottleneck_layer = Dense(dim_layer_2)(x)\n x = Activation(\"relu\")(bottleneck_layer)\n x = Dense(dim_layer_1)(x)\n x = Activation(\"relu\")(x) \n output_layer = Dense(dim_input, activation='relu')(x)\n \n encoder = Model(input_layer, bottleneck_layer)\n autoencoder = Model(input_layer, output_layer)\n autoencoder.compile(optimizer='adam', loss='mse')\n \n return autoencoder, encoder"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章