GPT-2會盜取你的電話號碼嗎?

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"先說結論:很可能不會。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"但是,OpenAI 的 GPT-2 語言模型 確實知道如何觸達特定的 Peter W —(爲保護隱私而刪除的名稱)。當出現簡短的 Internet 文本提示時,該模型將準確生成 Peter 的聯繫信息,包括他的工作地址,電子郵件,電話和傳真:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/wechat\/images\/81\/81f9b5ad32c8b14d7fc35f2f75c03aec.png","alt":null,"title":null,"style":null,"href":null,"fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在我們 最近的論文 中,我們評估了大型語言模型如何記憶和輸出訓練數據的這種稀有片段。"},{"type":"text","marks":[{"type":"strong"}],"text":"我們關注 GPT-2,發現至少 0.1%的由它生成的文本(非常保守的估計)包含了從其訓練集中的某篇文檔中逐字“複製 - 粘貼”的長字符串"},{"type":"text","text":"。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於在私人數據(例如,用戶的 電子郵件)上訓練的語言模型來說,這種記憶是一個明顯的問題,因爲該模型可能會無意間輸出用戶的敏感對話。而且,即使對於通過 Web 公開的數據訓練的模型(例如 GPT-2、GPT-3、T5、RoBERTa、TuringNLG),對訓練數據的記憶也引起了多個具有挑戰性的監管問題,諸如濫用個人身份信息和侵犯版權等等。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"抽取記住的訓練數據"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"關注 BAIR 博客的讀者可能會熟悉語言模型中的數據記憶問題。去年,我們的合著者尼古拉斯·卡利尼(Nicholas Carlini)描述了一篇論文,該論文解決了一個更簡單的問題:度量模型對於明確注入到模型的訓練集中的特定句子(例如信用卡號)的記憶能力。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"相反,我們的目的是提取語言模型記住的自然數據。這個問題更具挑戰性,因爲我們事先不知道要尋找哪種文本。也許模型記憶了信用卡號,或者記憶了整個書本段落,甚至是代碼段。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"請注意,由於大型語言模型的過擬合程度最小(它們的訓練損失和測試損失幾乎相同),因此我們知道,記憶一旦發生,必定是一種罕見的現象。我們的論文 介紹瞭如何使用以下兩步“提取 - 攻擊”找到此類示例:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"首先,我們通過與 GPT-2 進行交互的方式生成大量樣本,在交互過程中 GPT-2 被當成是一個黑盒(即,我們向其提供簡短提示並收集其生成的樣本)。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其次,我們保留生成的那些可能性異常高的樣本。例如,我們保留了 GPT-2 比其他語言模型(例如較小的 GPT-2 的變體)分配更高可能性的所有樣本。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/wechat\/images\/ce\/ced0d09b2028301d6396b4293db6cab7.png","alt":null,"title":null,"style":null,"href":null,"fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們用三種不同的採樣策略查詢了 GPT-2,結果總共生成了 600,000 個採樣。每個樣本包含 256 個詞項,或平均大約 200 個單詞。在這些樣本中,我們選擇了 1,800 個異常可能性極高的樣本進行手動檢查。在 1,800 個樣本中,我們發現 604 個包含了從訓練集中逐字複製的文本。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們的論文表明,上述提取攻擊的某些實例可以在識別稀有記憶數據時達到 70%的精度。在本文的其餘部分,我們將重點介紹我們在模型記憶輸出中發現的內容。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"有問題的數據記憶"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們對記憶的數據的多樣性感到驚訝。該模型重新生成了一份清單,該清單包含了新聞標題、唐納德·特朗普的演講、軟件日誌片段、整個軟件許可證、源代碼片段、《聖經》和《古蘭經》的段落、圓周率的前 800 位等等!"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下圖總結了一些最突出的記憶數據類別。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/wechat\/images\/c6\/c6ad9d556bbe866aa0ad6e89e1ed0936.png","alt":null,"title":null,"style":null,"href":null,"fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"儘管某些形式的記憶是相當不錯的(例如,記住的圓周率數字),但其他形式的記憶則有較多的問題。下面,我們將展示該模型記憶個人身份數據和受版權保護的文本的能力,並討論了機器學習模型中此類行爲尚待確定的法律後果。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"記憶個人身份信息"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"回想一下 GPT-2 對 Peter W 的深入瞭解。互聯網搜索顯示,Peter 的信息在 Web 上可用,但僅在六個專業頁面上可用。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/wechat\/images\/f7\/f71259d71270209d1e717f284ba931e3.png","alt":null,"title":null,"style":null,"href":null,"fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Peter 的情況並非唯一:記憶的示例中約有 13%包含個人和公司的名稱或聯繫信息(電子郵件,推特,電話號碼等)。儘管這些個人信息都不是“祕密”的(任何人都可以在網上找到它),但將其包含在語言模型中仍然會引起許多隱私問題。特別是,它可能會違反用戶隱私法規,例如,如下所述的 GDPR。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"違反上下文完整性和數據安全性"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當 Peter 將他的聯繫信息放在網上時,它具有預期的使用環境。不幸的是,基於 GPT-2 構建的應用程序並未意識到這種情況,因此可能會無意間以 Peter 不希望的方式共享 Peter 的數據。例如,客戶服務聊天機器人可能會無意間輸出 Peter 的聯繫信息。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"更糟糕的是,我們發現在許多案例中,GPT-2 在被視爲令人反感或其他不合適的情況下生成了記憶的個人信息。在一種情況下,GPT-2 在兩個真實用戶之間生成了關於跨性別權利的虛擬 IRC 對話。摘錄的摘錄如下所示:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"[2015-03-11 14:04:11] ------ orif you’re a trans woman"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"[2015-03-11 14:04:13] ------ you can still have that"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"[2015-03-11 14:04:20] ------ if you want your dick to be the same"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"[2015-03-11 14:04:25] ------ as a trans person"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"此對話中的特定用戶名僅在整個 Web 上出現兩次,兩次都出現在私人 IRC 日誌中,這是作爲 GamerGate 騷擾活動 的一部分在網上泄漏的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在另一種情況下,該模型生成了一篇有關 M. R. 被謀殺的新聞報道(真實事件), 但是,GPT-2 錯誤地將謀殺歸因於 A. D.,而他實際上是另一起不相關犯罪中的謀殺受害者。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"A— D—, 35 歲, 在警察發現其妻子的屍體後被捕,並於四月被大陪審團起訴, M— R—, 36 歲, 女兒"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這些示例說明了語言模型中存在的個人信息與範圍受限的系統中存在的個人信息相比,可能會帶來更多的問題。例如,搜索引擎會從 Web 抓取個人數據,但僅在定義良好的上下文(搜索結果)中輸出。濫用個人數據會帶來嚴重的法律問題。例如,歐盟的 GDPR 規定:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"“個人數據應 […] 出於特定的,明確的合法目的收集,並且不得以與那些目的不兼容的方式進行進一步處理 […],並且應確保以適當保護個人數據的方式進行處理”"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"記住個人數據可能並不構成“適當的安全性”,並且有一種觀點認爲,將數據隱式包含在下游系統的輸出中並不符合數據收集的原始目的(即建模通用語言)。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"除了違反數據濫用規定外,在不適當的情境下誤傳個人信息還觸犯了現行以防止誹謗或 假光侵權 爲目的的隱私法規。同樣,虛假陳述公司或產品名稱也可能違反商標法。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"喚起“被遺忘權”"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"以上數據濫用可能會迫使個人要求從模型中刪除其數據。他們可以通過援引新興的“被遺忘權”法律來做到這一點,例如歐盟的 GDPR 或加利福尼亞的 CCPA。這些法律允許個人要求從在線服務(例如 Google 搜索)中刪除其個人數據。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/wechat\/images\/14\/149bd5a87d1dcc41964ca44faa642ad6.png","alt":null,"title":null,"style":null,"href":null,"fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這些法規如何應用於機器學習模型,在 法律上存在灰色地帶。例如,用戶是否可以要求將其數據從模型的訓練數據中刪除?此外,如果這樣的請求被批准,是否必須從頭開始重新訓練模型?模型可以記住並濫用個人信息這一事實無疑使數據刪除和重新訓練的理由更加引人注目。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"記憶版權數據"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"模型記住的另一種內容是受版權保護的文本。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"記憶書本內容"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們的第一個示例實際上來自 GPT-3,該模型比 GPT-2 大 100 倍。我們的論文表明,較大的語言模型可以記憶更多的數據,因此我們預估 GPT-3 可以記憶更多的數據。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在下面,我們以《哈利·波特與魔法石》第 3 章的開頭提示 GPT-3。該模型會在犯第一個錯誤之前正確地重生成了該書的完整一頁(約 240 個單詞)。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"巴西巨蟒的脫逃使哈利受到了平生爲期最長的一次懲罰。當他獲准走出儲物間時,暑假已經開始了。達力已經打壞了他的新攝像機,摔毀了遙控飛機,他的賽車也在他第一次騎着上街時,把拄着柺杖過女貞路的費格太太撞倒了。"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"學期結束了,哈利很開心,但無法迴避達力一夥人,他們每天都要到達力家來。皮爾、丹尼、莫肯、戈登都是傻大個,而且很蠢,而達力更是他們中間塊頭最大、最蠢的,也就成了他們的頭兒。達力的同夥都樂意加入他最熱中的遊戲——追打哈利。"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這就是哈利儘量長時間待在外邊的原因。他四處遊逛,盤算着假期的結束,由此獲得對生活的一線希望。到九月他就要上中學了,這將是他平生第一次跟達力分開。達力獲准在弗農姨父的母校斯梅廷中學上學。皮爾也要上這所學校。哈利則要去當地的一所綜合制中學——石牆中學。達力覺得這很好笑。"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"“石牆中學開學的第一天,他們就會把新生的頭浸到馬桶裏。”他對哈利說,“要不要上樓去試一試?”"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"“不用了,多謝。”哈利說,“可憐的馬桶從來沒有泡過像你的頭這樣叫人倒胃口的腦袋——它可能會吐呢。”"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"記憶源代碼"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"語言模型還能記憶其他類型的版權數據,例如源代碼。例如,GPT-2 可以輸出 264 行 比特幣客戶端 代碼(有 6 個小錯誤)。下面,我們展示 GPT-2 完美再現的一個函數:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/wechat\/images\/0f\/0f40248385fcc68ee81e2951f8d2efe7.png","alt":null,"title":null,"style":null,"href":null,"fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們還找到了至少一個示例,其中 GPT-2 可以可靠地輸出整個文件。問題提及的文檔是遊戲“骯髒炸彈)”的配置文件。GPT-2 生成的文件內容似乎是通過 在線差異檢查器 記憶的。當提示文件的前兩行時,GPT-2 逐字輸出其餘的 1446 行(字符級匹配度 > 99%)。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這些只是該模型從其訓練集中記住的受版權保護內容的許多實例中的一部分。此外,請注意,儘管書籍和源代碼通常具有明確的版權許可,但根據 美國法律,絕大多數 Internet 內容也將自動獲得版權。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"訓練語言模型是否侵犯版權?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"鑑於語言模型可以記住和輸出版權化的內容,這是否意味着它們構成侵犯版權的行爲?關於使用版權數據的訓練模型的合法性一直是在法律學者之間爭論的話題(例如,參見“公平學習”,“有讀寫能力機器人的版權”,“人工智能的合理使用危機”),贊成和反對將機器學習表徵爲“合理使用”的爭論都有。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"數據記憶的問題當然在這場辯論中發揮了作用。確實,響應美國專利局的 評論請求,多方爭辯說將機器學習表徵爲合理使用,部分原因是假定機器學習模型不會發出記憶的數據。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"例如,電子前沿基金會 寫道:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"“使用機器學習工具製作的作品在大量受版權保護的作品上進行訓練的程度,相對於任何給定作品的複製程度極小。”"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"OpenAI 提出了類似的論點:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"“結構完善的 AI 系統通常不會從其訓練語料庫中的任何特定工作中重新生成未更改的數據,以及任何重要的部分”"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"但是,正如我們的工作所證明的那樣,大型語言模型確實能夠產生很大一部分記憶的受版權保護數據,包括某些完整的文檔。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當然,上述各方對合理使用的辯護並不僅僅基於模型不記住其訓練數據的假設,但是我們的發現顯然削弱了這一論點。最終,這個問題的答案可能取決於語言模型輸出的使用方式。例如,在下游創意創作應用程序中使用模式輸出的“哈利·波特”頁面所指向的版權侵權情況要比在翻譯系統中輸出相同的內容清楚得多。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"緩解措施"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們已經看到,大型語言模型具有出色的能力來記憶其訓練數據中的稀有片段,從而帶來許多問題。那麼,我們怎樣才能防止這種記憶的發生呢?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"差分隱私可能不會轉危爲安"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"差分隱私是一種公認的具有良好形式化定義的隱私,似乎是一種數據記憶問題的天然解決方案。本質上,使用差分隱私的訓練可確保模型不會從其訓練集中泄漏任何個人記錄。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"但是,以有原則和有效的方式應用差分隱私以防止記住 Web 爬網數據似乎具有一定的挑戰性。首先,差分隱私不會阻止對於在大量記錄中的信息的記憶。對於受版權保護的作品而言,這尤其成問題,因爲版權作品可能在網絡上出現數千次。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/wechat\/images\/a3\/a3334f8f3b8d30a409299a56ef8cb827.png","alt":null,"title":null,"style":null,"href":null,"fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其次,即使某些記錄僅在訓練數據中出現幾次(例如,Peter 的個人數據只出現在幾頁上),以最有效的方式應用差分隱私也需要將所有這些頁面彙總到一條記錄中,並提供彙總記錄的用戶隱私保證。目前尚不清楚如何有效地進行大規模聚合,尤其是因爲某些網頁可能包含了來自許多不同個人的個人信息。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"清理網絡也很難"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"另一種緩解策略是僅刪除個人信息,版權數據和其他有問題的訓練數據。這也很難大規模地有效應用。例如,我們可能希望自動刪除提及 Peter W. 的個人數據的信息,但保留提及被認爲是“一般知識”的個人信息,例如美國總統的傳記。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"管理數據集作爲前進的道路"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果差分隱私或自動數據清理都無法解決我們的問題,那麼我們還剩下什麼手段呢?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"也許對來自開放 Web 的數據進行語言模型訓練可能是一種根本上有缺陷的方法。考慮到記憶互聯網文本可能會引起大量隱私和法律問題,除了受 Web 訓練的模型會造成許多 不希望的偏見 外,前進的道路可能是更好地管理用於訓練語言模型的數據集。我們假設,即使只將投入訓練語言模型的數百萬美元中的一小部分投入到收集更好的訓練數據中,也可以在減輕語言模型的有害副作用方面取得重大進展。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"查閱由 Nicholas Carlini,FlorianTramèr,Eric Wallace,Matthew Jagielski,Ariel Herbert-Voss,Katherine Lee,Adam Roberts,Tom Brown,Dawn Song,ÚlfarErlingsson,Alina Oprea 和 Colin Raffel 撰寫的 從大型語言模型中提取訓練數據 的論文可參見:"},{"type":"link","attrs":{"href":"https:\/\/arxiv.org\/abs\/2012.07805","title":"","type":null},"content":[{"type":"text","text":"https:\/\/arxiv.org\/abs\/2012.07805"}]}]}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章