與被捧上天的AI相比,元數據可能更重要

{"type":"doc","content":[{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"在網絡搜索領域,元數據或將逐步取代 AI。"}]}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"元數據正逐步取代 AI"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"元數據又稱中介數據、中繼數據,是用來描述數據的數據(Data that describes other data)。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"元數據最大的優勢在於,它具有良好的開放性。利用元數據,你可以輕鬆編寫一款面向 OpenGraph 標籤的解析器,既不需要 AI 模型也不需要雲計算,就能輕鬆瞭解頁面中包含哪些內容。此外,元數據的獲取門檻也並不高,無需大量交互或協同即可快速獲取。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"軟件工程師 Cal Paterson 認爲,"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"在網絡搜索領域,元數據正逐步取代 AI"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"。準確來說,當前的 AI 無法支持搜索引擎查找各類內容、切實理解內容含義,真正在背後起作用的,其實是元數據。當搜索引擎找到目標頁面後,網站管理員需要提供豐富的元數據,來幫助搜索引擎快速理解頁面內容。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"以谷歌搜索引擎爲例,谷歌一直爬取整個互聯網體系頁面,但過去的事實證明,即使是 20 個簡單的網站,谷歌的頁面爬取效果都堪稱“慘不忍睹”。也就是說,常規的爬取網站方式根本駕馭不了無窮無盡的網絡資源。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"因此,在 2005 年的時候,谷歌推出了 Sitemaps 標準,允許網站管理員直接提交頁面列表。有了 Sitemaps 標準,任何一個站點有更新,就會自動通知谷歌,方便谷歌進行檢索,大大減輕谷歌的爬取負擔。不過,即便多數網站都提供站點地圖文件,谷歌面對如此龐大的、包含頁面鏈接的 XML 文件,還是需要藉助專門的工具來幫助互聯網管理員調試具體問題。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"比如,谷歌會通過添加一段元數據的方式,從兩個相似頁面中準確判斷出到底誰纔是真正的原始頁面,以便谷歌不被鏈接所誤導,從而在搜索結果中顯示正確的頁面。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"此外,谷歌在對頁面作者進行識別時,也會用到元數據。此前,谷歌上線了 Google+,並鼓勵網站管理員提供當前頁面作者的 Google+ 個人資料。而隨着 Google+ 項目被廢棄,谷歌轉而從 Facebook 的 OpenGraph 規範當中讀取元數據,以此處理谷歌主要搜索結果集以外的內容(例如向 Android 用戶展示的新聞報道)。而對於其他數據,谷歌則選擇解析 JSON-LD 元數據標籤、“微格式”乃至其他指標。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"雖然谷歌當前掌握文本分析功能,但與其他搜索引擎相比,谷歌的核心優勢並不在於更強的自然語言處理能力,而在於其善於使用元數據——他們會根據反向鏈接判斷代理的知名度。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"例如,PageRank 真正關注的並不是頁面的實際內容,相反,它的辦法更加簡單粗暴——哪個頁面在其他頁面中的鏈接使用量大,就證明前者的優勢地位越強。而這,仍然是在依靠元數據的力量。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"不過,元數據也並不是“萬能神藥”,只有元數據內容完全正確時,搜索引擎才能充分利用它分析內容。當前,有很多中立第三方提供元數據,例如公共記錄或者多個不相關數據點的加權累計等等。谷歌在搜索引擎結果頁面中展示維基百科數據就屬於這種情況,PageRank 的工作原理也是以此爲基礎。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"AI 神話快編不下去了"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"很明顯,在網絡搜索領域,當前的 AI 還沒那麼智能。那麼,在響應搜索查詢、返回文檔內容方面,AI 的表現怎麼樣呢?答案依舊是否定的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"當前,AI 雖然可以從維基百科右側信息框裏提取結構化數據,但其帶來的“邊欄提取”與“零點擊結果”反而讓維基百科原始頁面的訪問量大大下降。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"在搜索結果方面,AI 整理出的結果基本是由商業網站上那些博眼球的劣質“標題黨”組成。這些網站的作者往往是設計頁面元數據的專家,他們特別懂得如何利用谷歌算法,爲自己的網站積累人氣。用戶要想得到自己真正想要的搜索頁面,只能採用更復雜的辦法進行查詢,比如在搜索時加上網站名等等。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"雖然谷歌聲稱“網絡管理員應該放棄元數據,專注於內容本身”,但這種喊口號的方式並沒有真正解決問題,只會誤導用戶,讓用戶以爲當前的搜索引擎可以解決很多問題,元數據扮演的只是輔助角色。但實際上,不只是在網絡搜索領域,在一些發展更快的領域中,元數據的作用也比 AI 明顯。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"以政府對網絡活動的監控爲例,執法機構更傾向於跟蹤往來郵件、語音呼叫或者聊天消息等元數據內容,而不是採用 AI 模型去逐一分析民衆在說些什麼。同樣的道理,很多人以爲自動駕駛汽車會讀取路標來計算特定道路的限速規定,但實際上,自動駕駛系統大多會結合當前 GPS 座標直接查詢限速規定。也就是說,只要移動應用與衛星導航配合良好,我們就會有這種“智能化”的體驗。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"人們朝着增強 AI 的目標一路狂奔,看上去一切美好。但擰巴的是,數據科學家們費盡心力打造的 AI 模型一旦公佈,人們就會不斷利用元數據讓其推理過程更可靠、更具可解釋性。有些時候,一個標記甚至可以抵得上 AI 模型計算半個小時。這個真實世界,就是如此魔幻。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"參考鏈接:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/calpaterson.com\/metadata.html","title":null,"type":null},"content":[{"type":"text","text":"https:\/\/calpaterson.com\/metadata.html"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章