AI系統如何識別重複數據?

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"當你同時比較兩個Salesforce記錄或任何其它CRM記錄時,你可以很容易地確定它們是否重複。然而,當你有10萬條這樣的記錄時,你幾乎不可能一個一個地篩選它們,並進行這樣的比較。這就是爲什麼很多公司開發了各種工具來自動化這些過程,爲了做好工作,機器需要識別這些記錄之間的相似性和差異性。在本文中,我們將更仔細地研究數據科學家用來訓練機器學習系統識別重複的一些方法。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"機器學習系統如何對比記錄?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"研究人員使用的主要工具之一是字符串度量。當你取數據中的兩個字符串時,如果字符串相似,返回一個低值;如果字符串不同,則返回一個高值。這在實踐中是如何工作的?讓我們來看看下面兩個記錄:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"embedcomp","attrs":{"type":"table","data":{"content":"

First Name

Last Name

Email

Company Name

Ron 

Burgundy

[email protected]

Acme

Ronald

burgundy

[email protected]

Acme Corp"}}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"如果一個人看到這兩個記錄,很明顯能看出來這是重複的。而機器依賴字符串度量來複現人類的思考過程,這就是所謂的人工智能。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"最著名的字符串度量之一是"},{"type":"link","attrs":{"href":"https:\/\/baike.baidu.com\/item\/%E6%B1%89%E6%98%8E%E8%B7%9D%E7%A6%BB\/475174","title":"xxx","type":null},"content":[{"type":"text","text":"漢明距離"}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":",它度量將一個字符串轉換爲另一個字符串所需的替換次數。例如,如果我們返回到上面的兩條記錄,只需要進行一次替換就可以將“burgundy”變成“Burgundy”,因此漢明距離是1。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"還有許多其它的字符串度量可以用來度量兩個字符串之間的相似性,它們之間的區別是它們所允許的操作。例如,我們前面提到的漢明距離只允許替換,這意味着這種字符串度量只能應用於長度相等的字符串。而編輯距離(Levenshtein distance)允許刪除、插入和替換。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"如何消除Salesforce重複數據?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"人工智能系統有許多方法可以實現Salesforce重複數據消除。其中一種方法是"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"分區塊"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":",如下所示:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"embedcomp","attrs":{"type":"table","data":{"content":"

Record 1

Record 2

Ron Burgundy, [email protected], Acme

Ronald burgundy,[email protected] Acme Corp"}}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"這種分區塊方法具有可擴展性。它的工作方式是,每當你上傳新的記錄到你的Salesforce,系統會自動將看起來“相似”的記錄分塊到一起,比如可以是名字的前三個字母或者任何其它條件。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"這減少了需要進行比較的次數。例如,假設你的Salesforce中有10萬條記錄,而你想要上傳一個包含5萬條記錄的Excel表。傳統的基於規則的重複消除應用程序,需要將每個新記錄與已有記錄進行比較,那需要做50億(100,000 x 50,000)次比較。想象一下這需要多少時間,並且會增加多少出錯的概率。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"此外,我們要知道,10萬條記錄只是Salesforce記錄中相當有限的一部分。有很多組織擁有數十萬甚至上百萬的記錄。因此,在嘗試適應此類模型時,傳統方案的可伸縮性很差。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"另一種選擇是單獨比較每個字段:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"embedcomp","attrs":{"type":"table","data":{"content":"

 

Record 1

Record 2

First Name

Ron

Ronald

Last Name

Burgundy

burgundy

Email

[email protected]

[email protected]

Company

Acme

Acme Corp"}}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"一旦系統將“相似的”記錄分塊到一起,它將繼續逐字段分析每條記錄。這是我們前面討論的所有字符串度量發揮作用的地方。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"除此之外,系統會給每個字段制定一個特定的“權重”或重要性。例如,假設對於你的數據集,“Email”字段是最重要的。你可以自己調整算法,或者當你將記錄標記爲重複(或不重複)時,系統會自動學習正確的權重。這被稱爲主動學習(Active Learning),這種方法更可取,因爲系統可以精確計算一個字段相對於另一個字段的重要性。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"機器學習方法的優點是什麼?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"機器學習能提供的最大好處是,它能爲你做所有的工作。主動學習將自動給每個字段設置必要的權重。這意味着,不需要創建複雜的設置過程或規則。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"讓我們看看下面的場景。假設其中一個銷售代表發現了一個重複記錄,並將這個問題通知給Salesforce管理員。Salesforce管理員將創建一個規則,從而防止將來發生此類重複。每次發現一種新的重複使得這一過程不可持續時,需要一遍又一遍地重複這個過程。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"另外,我們需要記住,Salesforce中基於重複數據消除的功能也是基於規則的,只是非常有限。例如,你一次只能合併三條記錄,不支持自定義對象,以及其它許多限制。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"機器學習是一種更智能的方法,因爲規則的創建是自動化的,而人工智能和機器學習則試圖重現人類的思維過程。在另一篇"},{"type":"link","attrs":{"href":"https:\/\/datagroomr.com\/machine-learning-vs-automation-whats-the-difference\/","title":null,"type":null},"content":[{"type":"text","text":"文章"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"中,討論了更多機器學習與自動化之間的區別。選擇一個簡單地擴展了Salesforce的功能的去重產品,而不修復整個過程,是沒有意義的。這就是爲什麼機器學習方法是最好的方法。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"作者介紹"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/dzone.com\/users\/4398324\/ildudkin.html","title":null,"type":null},"content":[{"type":"text","text":"Ilya Dudkin"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" 是Softwarium的業務開發經理。"}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"原文鏈接"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/dzone.com\/articles\/how-do-ai-systems-identify-duplicate-data","title":null,"type":null},"content":[{"type":"text","text":"How Do AI Systems Identify Duplicate Data?"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]}]}]}

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章