智能標註原理揭祕,一文讀懂人工智能如何解決標註難題

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"無論是在傳統機器學習領域還是現今炙手可熱的深度學習領域,基於訓練樣本有明確標籤或結果的監督學習仍然是一種主要的模型訓練方式。尤其是深度學習領域,需要更多數據以提升模型效果。目前,已經有一些規模較大的公開數據集,如 ImageNet,COCO 等。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於深度學習入門者,這些公開數據集可以提供非常大的幫助;但是對於大部分企業開發者,特別在醫學成像、自動駕駛、工業質檢等領域中,他們更需要利用專業領域的實際業務數據定製 AI 模型應用,以保證其能夠更好地應用在業務中。因此,"},{"type":"text","marks":[{"type":"strong"}],"text":"業務場景數據的採集和標註也是在實際 AI 模型開發過程中必不可少的重要環節。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"數據標註的質量和規模通常是提升 AI 模型應用效果的重要因素,然而完全通過人力手動標註數據建立一個高質量、大規模專業領域數據集卻並不容易:標註人員的培訓與手工標註成本高、耗時長。爲解決此問題,我們可以利用主動學習的方法,採用“Human-in-the-loop”的交互式框架(圖1)進行數據標註,以有效減少人工數據標註量。"}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/f6/f6d8936565612db6c1261a4ac8d52bab.jpeg","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"圖1 基於主動學習的“Human-in-the-loop”交互式數據標註框架"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"主動學習(ActiveLearning,AL)是一種挑選具有高信息度數據的有效方式,它將數據標註過程呈現爲學習算法和用戶之間的交互。"},{"type":"text","text":"其中,算法負責挑選對訓練 AI 模型價值更高的樣本,而用戶則標註那些挑選出來的樣本。如“Human-in-the-loop”交互式數據標註框架,通過用戶已標註的一部分數據來訓練 AI 模型,通過此模型來標註剩餘數據,從中篩選出 AI 模型標註較爲困難的數據進行人工標註,再將這些數據用於模型的優化。幾輪過後,用於數據標註的 AI 模型將會具備較高的精度,更好地進行數據標註。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"以圖像分類問題舉例,首先,人工挑選並標註一部分圖像數據,訓練初始模型,然後利用訓練的模型預測其餘未標註的數據,再通過“主動學習”中的“查詢方法”挑選出模型比較難分辨類別的數據,再人爲修正這些“難”數據的標籤並加入訓練集中再次微調(Fine-tuning)訓練模型。"},{"type":"text","marks":[{"type":"strong"}],"text":"“查詢方法”是主動學習的核心之一,最常見的“查詢方法”有基於不確定性的樣本查詢策略和基於多樣性的樣本查詢策略。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"基於不確定性的樣本查詢策略可查詢出深度學習模型預測時,靠近決策邊界的樣本。以二分類問題舉例,當一個未標註樣本被預測爲任一標籤的概率都是50%時,則該樣本對於預測模型而言是“不確定”的,極有可能被錯誤分類。要注意的是,"},{"type":"text","marks":[{"type":"strong"}],"text":"主動學習是一個迭代過程,每次迭代,模型都會接收認爲修正後的標註數據微調模型,通過這個過程直接改變模型決策的邊界,提高分類的正確率。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"基於多樣性的查詢策略,可實現對當前深度學習模型下狀態未知樣本的查詢。將通過多樣性查詢挑選出的數據加入訓練集,可豐富訓練集的特徵組合,提升模型的泛化能力。模型學習過的數據特徵越豐富,泛化能力越強,預測模型適用的場景也越廣。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲解決大數據量標註的痛點,基於主動學習且融合多樣查詢策略的智能標註 AI 解決方案應運而生。通過 EasyDL 平臺使用智能標註後,"},{"type":"text","marks":[{"type":"strong"}],"text":"開發者們只需標註數據集中30%左右的數據,即可啓動智能標註在 EasyDL 後臺自動標註剩餘數據,"},{"type":"text","text":"再返回少量後臺難以確定的數據再次進行人工標註,同時提升自動標註的準確性,經過幾輪之後,在實際項目測試中,智能標註功能可以幫助用戶節省70%的數據標註量,極大地減少數據標註中的人力成本和時間成本。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"EasyDL 零門檻 AI 開發平臺,面向企業開發者提供"},{"type":"text","marks":[{"type":"strong"}],"text":"智能標註、模型訓練、服務部署等全流程功能,"},{"type":"text","text":"針對 AI 模型開發過程中繁雜的工作,提供便捷高效的平臺化解決方案。EasyDL 面向不同人羣提供了經典版、專業版、行業版三種產品形態,其中 EasyDL 專業版支持深度開發高精度業務模型,內置了豐富的大規模預訓練模型,僅需少量數據即可達到優異的模型效果。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"目前,"},{"type":"text","marks":[{"type":"strong"}],"text":"EasyDL 的智能標註功能已支持計算機視覺 CV 方向的物體檢測模型、自然語言處理 NLP 方向的文本分類模型兩大方向的數據標註。"},{"type":"text","text":"選擇 EasyDL 專業版模型定製,點擊“智能標註“即可進入。使用方法也很簡單,共爲三步:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"size","attrs":{"size":14}},{"type":"strong"}],"text":"Step1 啓動智能標註"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在“數據管理/標註”上傳完訓練數據集後,即可激活“創建智能標註任務”按鈕(圖2),點擊該按鈕後,進入數據集選擇。需要注意的是,系統將自動對選擇的數據集進行校驗。校驗規則如下:"}]},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"圖像數據集:"},{"type":"text","text":"確保每個標籤的標註框數都超過10個。"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"文本數據集:"},{"type":"text","text":"數據集中已標註數據量超過600條;每個標註標籤的數據量超過50條;未標註數據的數據量超過600條。"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"zerowidth"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"以上圖像和文本數據集之所以採取不同的校驗規則,是因爲在實際場景下,文本與圖像的數據集獲取方式及數據規模區別較大,且智能標註後端 AI 模型訓練的啓動樣本數量不一。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"點擊“啓動智能標註”進入數據校驗階段,若校驗不通過,會出現“智能標註啓動失敗”的提示;若校驗通過,則進入篩選數據階段,用戶需稍作等待。"}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/3b/3b28479d11a000ab1b42d6be57a313f9.jpeg","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"圖2 創建智能標註任務"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"size","attrs":{"size":14}},{"type":"strong"}],"text":"Step2 標註部分數據"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"系統會自動從未標註數據集中篩選出最具有代表性、也是最需要優先標註的樣本數據。"},{"type":"text","text":"用戶需要人工標註這些推薦的樣本數據,爲了提高標註效率,系統也會提供預標註供用戶修改確認。在圖像智能標註中,用戶勾選右上角的“顯示預標註”開啓該輔助功能(圖3),點擊“滿意預標註結果”即可對預標註結果進行確認;在文本智能標註中,系統會自動顯示預標註標籤,點擊每一條文本右側的“確認”或右上角的“本頁全部確認”對預標註進行確認(圖4)。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"確認所有推薦數據的預標註後,用戶可以自主選擇是否進行下一輪數據篩選。圖像智能標註中,若用戶不中止智能標註,則系統會自動進行下一輪;文本智能標註中,由於文本數據集規模一般較大,確認數據預標註的人力成本較高,爲了提升用戶體驗,系統不默認進入下一輪迭代,用戶可點擊右上角的“優化智能標註結果”進行下一輪篩選(圖5)。通過多輪篩選,數據預標註準確性也會不斷提升。爲了保證數據標註智能,建議用戶至少進行一輪的數據篩選或“優化智能標註”。"}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/e6/e604f713a2b7248c3fe287bd3ccca204.jpeg","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"圖3 圖像智能標註"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/45/45bb1bf09fd2575b6488896ca5ae37a1.jpeg","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"圖4 文本智能標註"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/0e/0edb4d23e9e978235e50ba4fe510887c.jpeg","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"圖5 文本智能標註進入數據篩選優化迭代"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"size","attrs":{"size":14}},{"type":"strong"}],"text":"Step3 結束智能標註"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"當用戶覺得當前推薦數據的預標註已足夠準確,不再進行下一輪數據標註推薦篩選,或者系統自動判斷當前標註的數據已足夠時,則進入結束智能標註階段。"},{"type":"text","text":"在圖像智能標註中,系統會彈出提示(圖6),選擇“一鍵標註”系統會自動標註剩餘未標註數據,選擇“立即訓練”則停止智能標註,之後可以利用已確認的標註數據去訓練模型;在文本智能標註中,不選擇“優化標註結果”則認爲停止智能標註,系統自動標註所有未標註數據,並歸爲“已標註·智能”數據集,該類數據與“已標註·人工”均可用於模型訓練。"}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/ff/ff9d302fc781760c5cc9e2f836cd20d0.jpeg","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"圖6 結束圖像智能標註"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/c8/c83f25d379ada9bf18a0e86b5ddad229.jpeg","alt":null,"title":null,"style":null,"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"圖7 EasyDL 智能標註使用流程圖"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在智能標註功能的加持之下,"},{"type":"text","marks":[{"type":"strong"}],"text":"重複枯燥的標註功能都交給 AI 模型,大大降低了時間與人力成本。"},{"type":"text","text":"在數據方面,EasyDL 中的 EasyData 智能數據服務平臺,提供覆蓋採集、清洗、標註、加工等一站式數據處理功能,並與模型訓練環節無縫對接,通過數據閉環功能支持高效的模型迭代。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"百度搜索“EasyDL”,嘗試智能標註,開發你的高精度業務模型吧!"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":"br"}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章