大數據搜索技術下一站：淺談向量檢索的現狀、挑戰和未來

原創

2021-05-31 16:03

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"作者 | 蔡芳芳、羅燕珊、褚杏娟"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"採訪嘉賓 | 鶴衝"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"深度學習的廣泛應用，推動了傳統數據搜索變革的發生。傳統搜索多采用關鍵詞等確定性的檢索方法，更強調相關性的發現。而AI的應用加強了對非結構化數據（如語音、語言、圖片、視頻等）的處理能力，讓搜索從原來的單模態轉爲多模態，從關鍵詞檢索轉爲向量化檢索，從確定性轉爲相似性檢索。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"向量檢索是一個值得關注的新興技術領域，達摩院自研的向量檢索引擎Proxima也已經成爲阿里巴巴許多重要業務（如：淘寶推薦、優酷視頻搜索、城市大腦等）背後的搜索技術支撐。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"面向海量數據的搜索技術如何演進至今？向量檢索技術爲什麼被需要？該領域現有的主流工具各有哪些優缺點？"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"5月19日，阿里巴巴達摩院資深技術專家肖允鋒（鶴衝）現身"},{"type":"link","attrs":{"href":"https:\/\/www.infoq.cn\/video\/WvEN9x6YGzROSDTpjkKV?utm_source=home_video&utm_medium=article","title":null,"type":null},"content":[{"type":"text","text":"大咖說"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"，與我們分享向量檢索領域的現狀、挑戰和未來發展方向，並通過對比當前向量檢索領域已有的工具，闡述如何進行技術方案選型。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"以下內容來自當天的分享，InfoQ 作了不改變原意的編輯："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"InfoQ：您從事大數據搜索技術相關研究和應用已經有十來年了，能否請您先帶我們回顧一下大數據搜索的發展歷程，隨着數據量的暴增、搜索需求的變化，搜索技術是如何不斷演進的？不同階段出現了哪些關鍵的代表性技術\/開源項目？"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"肖允鋒："},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"我印象比較深的是2011年的深圳高交會，主題關於大數據和雲計算，那次算是一個里程碑，因爲那一年可能是大數據和雲計算被普遍認識的一年。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"那會有做大數據業務的主要是各大互聯網公司，數據一大，涉及到的則是需要怎麼基於資源去做計算，從AWS等等開始。2010年前後有一些比較重大的事情，谷歌比較早就開放了一些大數據技術的核心理念，包括存儲，有了存儲之後，怎麼去做MapReduce的計算，就開始有了大數據的處理模型。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"那一年來看，更多的是先解決了大數據怎麼存。原來大量的數據都是存在DB裏的，比如關係型數據庫，大家比較熟悉的像MySQL，微軟的SQL Server、IBM 的 DB2等等，但是到了後來發現，純粹的關係型數據庫沒法Cover所有的大數據業務，再加上像剛剛所說，谷歌已經開放了它相關的一些技術理念，開源後社區可以圍繞技術理念去不斷開發，所以後來又出現了Hadoop、HDFS，這樣一來存儲跟計算兩個領域就做了一些大數據的基礎。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"因爲沒法用原來的關係型數據庫去概括所有，爲了跟它區分，就產生了NoSQL的概念。面向文檔，MongoDB可能用得比較多。再往後，因爲數據不斷地發展，包括以前的Hadoop計算模型不是特別能滿足各大互聯網公司的發展，就有了能夠去加速的東西，包括現在大家常用的Spark，即所謂的流式計算，可能流批還在不斷地磨合之中。到後來整個大數據去統一的時候，就出現像Flink這樣的框架，更多的是把流批計算統一的過程。當然現在又遇到了大數據有新變革的時代，現階段可能又湧現了像SnowFlake這樣的"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#4d5156","name":"user"}}],"text":"公司"},{"type":"text","text":"，他們更多的是面向“在雲原生的情況下怎麼去做”，用戶也不需要關心以前的各種存儲，包括計算是怎麼做的。我把你放到雲上面，就能夠實現自動擴容，自動買、自動縮容等等，那現在就是說，怎麼跟原有的一些基礎設施去做深度結合的問題，這就是大數據處理、數據產品面對的一些問題。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"InfoQ：向量檢索技術是怎樣興起的呢？"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"肖允鋒："},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"向量檢索如果非要去溯源，以前在七八十年代就開始有。我們經常說AI經過了3次發展，而且這3次都是算力的飛躍。80年代那時候更多地遇到kd tree，因爲那時候的人也遇到了怎麼去檢索的問題。圖片可能映射成向量，但是實際上去找的時候，那時候計算機還是很慢的，所以產生一些方法像kd tree等等。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"但是那時候遇到的維度可能還不是特別高，就可能幾十維的這種情況，但是隨着發展就發現，原來很多向量搜索，包括去檢索的這些方法，像剛剛提到的kd tree，包括一些量化的方法，去找的時候，會發現效果沒之前證明的特別的高效，而且去實踐的時候它可能跟數據分佈關係特別大。在業內有一個特別著名的“維度災難”，就是隨着那個維度不斷地上升，比如我們現在三維空間，四維，然後到五維到六維，到上百維，甚至到幾萬維，特別像NLP這種技術裏面可能上萬維的這種情況，這時候原來可能遇到的情況就是稀疏空間，在稀疏空間去檢索的時候，它的效率就下降特別快，就變成說，我產生維度在那，我就自然要有一些新的方法去解決這個問題，這時候能不能用一些量化，包括一些聚類的方法。但其實如果把它去做歸類的話，基本上可以分成三大類。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"向量主要對數據集非常大的情況有用，數據集很小也不涉及到搜索的問題。對於很大的數據集，我們不想所有的數據都考察，那麼能不能有一個路徑可以快速去找到特定數據，這時候就有一類劃分空間的方法。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"但數學還有一個很奇妙的地方，就是能夠做數學變化。因爲多個向量合起來就是一個矩陣，它們本身形成了一個空間之後，能不能把這個矩陣做一個變換，如線性變換或產生某個矩陣，這種方法叫空間變換，簡單講就是把這麼多的向量影射到更小的空間上。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"因爲在小空間，假設原來某個維數可能有幾百維的向量，一影射變成了幾十維的向量，算起來就更快。算起來更快的情況下，這一類的方式如 LSH，敏感哈希，或者是一些量化的方法，當然我們簡單就可以把它歸納爲空間影射，或者是空間變化的思想去做。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"但是後來產生一些新的方法。打個比方，因爲向量有些情況是爲了找鄰居，如果我知道我和你剛好是鄰居，你和他剛好是鄰居，那這時候就有人想了一些方法，你既然跟他是鄰居，我先找到我，然後猜測他可能也是我的鄰居的方法去做，這種就是簡單把它歸納爲：我把你這些關係先保存起來，這些關係可能形成一張圖狀的結構，我在這些圖狀的結構裏面去遍地遊走，用最快的方式把它找到，就可以歸結爲圖狀的東西去影射的，有保證實際的結構。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"InfoQ：這種跟我們傳統的那種搜索方式有什麼樣的關係？"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"肖允鋒："},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"你說的傳統應該是指原來的像搜索引擎。專業名詞上，我們叫倒排索引。裏面有哪個關鍵詞，我就通過這個關鍵詞找到對應的那個文檔。原來的傳統搜索，基本上就是某個關健詞就能確定有沒有在這裏面，要不就存在，全部找出來，要不就找不出來。那這種情況下可以基本上歸結爲是做一個很確定性的查找搜索。如果有一些布爾的，可能說我既要這個詞，但是沒有那個詞，但又要另外一個詞，這種我們叫有帶一些組合條件的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"那麼這時候就把各種條件各種組合，看着好像有一些邏輯關係，但實際上發現，它又是跟原來的那種確定性的是一致的，要不就是有文檔，要不就是找不到。但是因爲遇到向量，向量畢竟是數學上的一些表示，假設很理想的情況，我能夠把每個實體，比如一些圖片，甚至圖片裏面一些元素，能夠通過數學的方式去表達出來，然後把它看成一個空間上的一些點，想象在一個數學空間上去查找到這個點。但是這前面還有一步，專業的名詞叫Embedding，相當於把一些所謂的非結構化的數據，可能是圖片，還有文字，語音、視頻這些東西變成數學上可以表達的向量之後，我們就把它稱爲Embedding的過程，包括一些現在搜索推薦經常應用的技術。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"但是本身“變”這個過程並不是精確的，以前的傳統搜索是確定性的，拿到關鍵詞我就拿到了，但是從非結構化這些變過去的時候，假設它也是文本的情況下，我可能說了一句話，人的意識可能是模糊去理解的，AI其實也是想盡量去模糊。假設也是文本的情況下embedding，那麼它的向量並不是一個百分之百準確去表達的，因爲可能存在着一些模糊跟有誤差的情況。就是因爲它本身的這個特性，那麼導致現在在做向量的時候，一種方式是強制找到最近的，但是有時最近的不表示最好的，當然相對近可能是相對好的，這時候就可以允許向量在查找過程中是有誤差的。這麼說就存在着檢索取捨的問題，因爲如果允許1%的誤差，那麼其實是可以讓這個搜索過程變得更快，用戶的體驗，包括人的體驗是更好的。簡單講，我原來是確定性的檢索搜索，現在是帶有一些相似性的，帶有一些概率性的搜索，這個是最主要的一個區別。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"當然現在其實我們也沒法完全把原來的搜索體系拋開，畢竟兩者是相輔相成的過程。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"InfoQ：向量檢索技術的研發和應用現狀分別是什麼樣的？"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"肖允鋒："},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"如果放在前幾年，2016年以前，這塊也算是一個冷門領域。因爲有些方法一直沒突破，而且深度學習也是後來纔開始廣泛應用。那時候數據量還沒上去，在搜索、推薦領域才慢慢開始嘗試，但是到17年、18年的時候，這個趨勢是非常明顯的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"視頻映射向量、圖片映射向量、語音映射向量這些技術也不斷成熟，而且數據量也是越來越多，特別是現在短視頻普及，各種視頻搜索、圖片搜索，就導致數據量很多，那麼這一塊的重要性就不斷增加。在一些大的互聯網公司也都普遍遇到推薦領域的一些東西，這兩塊衝擊相當於鏈路上的非常重要的一環，因爲我以前不用搜索，可能暴力匹配就解決了。目前的話，別人也意識到了這個問題，包括去做內核技術的很多，但也有一些公司包括開源的、去做服務化的。做內核技術的可能像Facebook AI Research這些實驗室，包括谷歌的實驗室都在做，國內的話，我們達摩院也都在做，我們也有不少沉澱的核心技術。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"InfoQ：我們現在整體的應用情況，可以給我們介紹一下嗎？"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"肖允鋒："},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"搜索推薦，買東西商品推薦、短視頻推薦、用戶的推薦，以前不會像現在用比較深入的深度學習、人工智能的方式去做。另外就是圖片的相似性搜索，淘寶裏有個照相的按紐，拍一張就可以在裏面搜索。這是圖片向量化之後，通過相似性的一些向量檢索再去把它混合。另外就是很多人用的比如天貓精靈，當問某個問題，比如今天天氣怎麼樣。如果按原來的傳統搜索的情況下，就需要每個字去匹配它到底多少種話，但可能你用了一些相關的向量化技術之後，它的質量是模糊的，可能你換個問法去問今天天氣如何，它就會查找相似性的問題。可能系統裏面有一些相似性問天氣怎麼樣，它就已經知道了，就會找到相似性的一些問答。所以它把原來要求非常準確的東西變成了只求相似的東西去查找就夠了。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"其實方方面面的還有很多，包括版權保護。比如視頻指紋場景，你拍了這個視頻有沒有被別人盜了，音頻指紋有沒有被別人盜了，或者是有沒有跟別人重複的。因爲它應用的面挺廣的，所以基本上可以說覆蓋了整個AI相關的領域。只要涉及到AI，並且產生的數據量非常大的情況下就會用到。再來舉個例子，我們買東西有時候要填地址，比如在餓了麼，要填送餐地址，經常就會發現你填幾個字，完整地址就出來了。這個通過我們標準的一些地址庫，發現這幾個字是很相似的，然後就找到了，找到之後，不要求非常準確。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"InfoQ：如果我們用採用技術採用生命週期（創新者、早期採用、早期大衆、晚期大衆、落後者）來評估向量檢索技術，您認爲它當前處在哪個發展階段？爲什麼？"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"肖允鋒："},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"現在我覺得可能是處於技術剛起步、還要大規模去鋪的階段，介於大衆應用跟一些研究領域之間，因爲我們還要邁過一些坎。第一個除了要解決一些核心技術，因爲畢竟還是大的一些互聯網公司有應用，所以還要解決把這個東西能夠做到更好的服務化，用戶的易用性非常好。用戶不需要關注你背後用了什麼技術，就能實現大規模的應用。如果用戶還需要去知道這個東西是用了什麼檢索，或者是用了什麼搜索的方法，這樣其實是不太有利於整個技術的發展的，所以理論上按照正常的發展，應該是我們這些相關領域的人越來越淡出，變成幕後的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"InfoQ：技術上的問題，有哪些需要克服的？"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"肖允鋒："},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"像我剛剛說的，因爲它的應用程度非常廣，有不同的AI模型，跟這種模型產生的向量化的數據分佈是不一樣的。而對於每一種數據分佈，我都要儘量去調，就是沒有一種方法可以對所有的東西，都能檢索得非常好。因爲剛剛說的，都是相似性的，有一定概率性的搜索。如果說每一種都做得很好，那就沒有往下深研，但現在因爲這些問題還沒解決，讓我能夠去適用到各種數據，一套引擎擺上去，立刻就能用，然後也不用任何的調參之類的工作就能得到非常好的效果。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"我們這邊做了一些研究，我們看到的數據集非常多，就阿里巴巴內部來說，整個業務形態太大了，然後各種各樣的數據我們都嘗試了，就會發現遇到客戶的一些問題。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"另外是數據規模，現在數據規模主要還是解決低成本的問題。現在的模型越來越多，可能一張圖片的表示可以有多個向量。假設裏面有很多物品，我們叫圖像的語義識別，從裏面摘出各種物品然後再去做，實際它對應起來是可以做成很多向量的。這時候會產生一些數據爆炸，這一塊的領域要說做得特別成熟的還沒有，包括MapReduce這些模型的一些大數據處理、離線處理的這些組合，都是需要去解決的一些問題。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"InfoQ：目前我們國內的一些企業可能在這方面研究的情況投入情況怎麼樣？"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"肖允鋒："},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"目前直接做檢索算法技術的研究的話，我知道的可能就只有達摩院在做算法的。除了技術，我們還會做阿里內部的很多業務的應用等等。內核技術的話，就是一些算法技術的使用。你看到的只是怎麼搜索，上面還有很多層面，怎麼對外服務，還有引擎技術一些相關的。我的團隊可能就更多地去做整體的東西。但目前主要還是就對內爲主，就阿里內部。我們內部像淘寶推薦、整個後面的這些引擎也都是採用我們一些相關的技術去做。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"InfoQ：目前有哪些主流的一些工具或者是技術方案可以給我們介紹一下？"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"肖允鋒："},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"目前開源的話很多人都會遇到過一些像Facebook AI Lab開源的東西，他們在17年就開源了，像Faiss。Faiss對大多數公司其實是基本能滿足要求的。不像阿里有更精細的一些要求、對性能更高追求的話，Faiss也是夠用的，因爲幾十萬的數據，或者是百來萬的數據，基本都能Cover住的。但是數據量如果不斷往上提，Faiss還是會有一些瓶頸，但是它畢竟是開源，好的話可以依賴社區的力量，包括現在國內像京東，也有開源公司做Milvus，基於原有的一些引擎，去做改進，讓更多的用戶更易用，這個對開源界來講也是非常不錯，因爲畢竟大家用起來還是真的好。我們自己更多的是集中在阿里集團內部、或者螞蟻集團等地方使用，主要做的還是Proxima相關的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"InfoQ：除了這些還有哪些好用，可能大家也不太瞭解的？"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"肖允鋒："},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"普通的一些公司數據規模也不是特別大，就會想說怎麼把Faiss串起來。像做一些服務編排的，我記得有一個開源公司做 Jina 這些東西，做這些服務編排，使整個AI領域能夠串起來，這個也是非常好的一個工具。因爲我們畢竟並不只是說檢索這一塊，還有前面怎麼去embedding，把各種AI模型能夠把它通過一個好的工具組合起來，因爲現有的，包括Tensorflow裏面有太多的模型，有的模型都是開放了一個組件。我們能不能通過一個串聯的工具鏈把它串起來，這個對於很多開發者來說，還是比較能夠很快上手。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"InfoQ：一般的企業應該如何做技術選型？"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"肖允鋒："},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"一般企業去做的話可能有一定的門檻。坦白講，現在這塊領域，包括大數據領域，對傳統企業是不太友好的。這些企業本身產生的數據量也是非常大的，但是現在遇到的一個情況是怎麼這些企業對大數據的應用變得越來越好上手，讓大數據、AI搜索變得低門檻化，這個還是非常有挑戰性的，而且可能存在行業變更的情況。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"InfoQ：目前就是這塊主要是一些頭部企業在做？"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"肖允鋒"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"：對，因爲有一定的技術門檻。大數據雖然說發展到現在也十幾年了，至少十年以上的時間。但很多企業，包括傳統企業對於怎麼去應用，用來做什麼，怎麼去快速搭建，會遇到很多問題，並且也不知道這個行業是怎麼去迭代。但實際上這一塊它是不需要這些傳統企業去過多思考的，反而應該是用已有的整個行業的體系去不斷的推陳迭代，讓易用性變得更好。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"InfoQ：會不會跟這個業務的需要有關係？像直播短視頻這塊的興起也帶起了向量檢索技術的發展。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"肖允鋒："},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"這個是一點。我這裏稍微解釋一下，以前我們遇到的很多數據都是單單一個文字，某個ID，性別是男還是女，處於哪個城市，這些叫做標籤化的數據，因爲打開標籤，你可能是上海的，是北京的，非常清晰。但是我們遇到了圖片，你怎麼去描述一張圖片，這是非常大的一個問題。你沒法說我用一個語言描述，就是描述完之後，你根本沒法生成這個圖片。包括，讓語音說一段話之後，它這些數據，沒法用標籤化或者是某個東西去描述它，這些就統稱叫非結構化數據，而原來可以標籤化的那些數據，我們叫結構化數據。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"原來我們在AI算力不太好的情況下，人臉照片的識別的話，就摳這個人臉出來，想着怎麼去把它變成結構化的數據，這個以前是挺有難度的，但隨着AI算力增強，模型也非常越來越豐富了，現在在網上隨便能找到一個開源的模型，而且做得也特別的好。非結構化有了處理的需求，剛好又遇到非結構化數據的不斷生成，現在每個人每天都得拍多少照片，這些東西就變成了不斷滾雪球的情況，對這塊領域就需求越來越大了。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"InfoQ：現在主要的一些研究方向可能是什麼呢？"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"肖允鋒："},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"像現在的一些研究方向，除了我前面提到的一些難點之外，還有說到向量技術可以分爲三大類，而近些年可能更多的在一些圖計算，包括一些流式計算上面研究。遇到大量的數據，要能夠像普通的索引一樣，能夠實時落地，並且能夠實時被檢索出來。另外就是能夠快速做圖結構上面的一些刪除。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"前面也提到，傳統搜索跟向量技術目前肯定不會是你死我亡的情況，所以現在更多是相關技術怎麼去融合的問題，包括標籤化怎麼跟向量化技術去配合，達到想要的檢索情況。其他的一些就比較籠統了，可能在分佈式計算，在一些高性能跟檢索效率之間的權衡怎麼去做等等。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"InfoQ：您如何去看待未來大數據的檢索技術的方向？"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"肖允鋒："},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"這個問題很大。我剛剛稍微提了大數據發展的歷程，但是你會發現，傳統企業對大數據有一定的技術門檻，而且一入大數據深似海，爲什麼？想想在大數據裏面有多少技術組件。Hadoop、HBase、HDFS這些存儲的組件，還有現在的Spark、Flink等等，特別多，而且實際上還有一些垂域的，如果去深究的話特別多。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"但是我覺得在發展到一定階段之後，就會開始產生統一的過程，中國古話不是合久必分，分久必合。因爲以前可能一些技術能力不夠，一些理論證明也沒有做得非常完善，那麼到某些階段，到某些階段就會產生一些“合的情況”。我們簡單講，像Flink在做流批一體，也是統一大數據處理的某個過程。另外就是存儲領域，這一塊也是大數據領域會涉及到的，就整個存儲體系怎麼去做統一的過程。有了這兩條線的統一之後，我個人覺得會形成一些大數據處理的產品，可能會對傳統企業變得更加友好等等，讓這些技術能跟原有的一些傳統企業進一步融合。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"大數據還會遇到基礎設施的變化，比如說以前更多提到的是雲計算，現在可能是各種容器化的技術，叫做雲原生。傳統數據庫在設計的時候，包括以前的MySQL，SQL Server，沒有考慮實際在雲原生上的一些部署，包括雲原生的一些能力他沒有把它很好去做發揮。這時候有一些數據產品，基於這些原因的產生了一些Database等等。這兩條路可能就會不斷碰撞，碰撞完之後可能會有一統江湖的時候。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"InfoQ：您感覺這個過程可能會需要多久？"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"肖允鋒："},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"感覺暴風雨可能要來。畢竟這幾年這領域也出現了一些百億級的做大數據的數據庫公司。在大數據來之前，大量的傳統企業都是用MySQ去做的，在大數據之後，這一類可能也被統一成一類。因爲原來的像關係型數據庫也有一個統一的過程，並且形成一套統一的理論，關係型的數據庫的理論。當然，現在只是說大數據出來之後，原來的可能滿足不了，那就合久必分，就分了，分完之後我覺得肯定是會有個合的過程。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"InfoQ：在AI領域有哪些技術趨勢是我們需要特別關注的？"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"肖允鋒："},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"高校裏面的學生，包括各種實驗室，前幾年可能有大量的學生在學習AI相關的東西。但是AI現在主要的問題是，第一個我說更加長遠的，因爲AI現在是比較沒法完全用數學化的一些東西去推導跟解釋，這也是爲什麼被大量的數學家覺得不可靠的一個原因，因爲它屬於不可解釋的數學的一些東西。從工程看的話，現在AI的各種模型很多，門檻會變得會越來越低，在門檻越來越低的情況下，就怎麼去做一些垂域的細分的應用，可能會變得會更加熱門。我不一定要做萬物搜索的東西，比如我可以用來識別工廠裏面某個特別大的部件，準確率非常高，那也是非常能提高效率的應用。至於說什麼AI能不能統治人類，目前還達不到這麼牛。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"再往細裏就是超大規模的AI訓練，目前還是需要的。現在AI是跟大數據綁定的，所謂的人工智能又是什麼呢？半人工半智能的，就是既有人工又有智能，因爲大量的模型在前期去做訓練的時候，需要大量的標註，比如我說了什麼話，這句話什麼意思，全部先把它做完標註，先把一些答案對起來，再拿到模型裏面去訓練。再去深耕的話，就是能不能產生一些自訓練，我輸入的數據會越少，但是它也能夠產生一些AI自動識別，感知層容易識別的一些東西。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"InfoQ：對開發者來說，還有哪些可以重點關注的技術點？"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"肖允鋒："},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"像基於雲原生的大數據的擴容處理、在線服務化等等，還有跟AI的結合。"}]}]}

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

相關文章

AI從入門到入門之手寫數字識別模型java方式Dense全連接神經網絡實現

前言：授人以魚不如授人以漁.先學會用，在學原理，在學創造，可能一輩子用不到這種能力，但是不能不具備這種能力。這篇文章主要是介紹算法入門Helloword之手寫圖片識別模型java中如何實現以及部分解釋。目前大家對於人工智能-機器學習-神經網

2024-04-19 23:17:21

Pinecone: 大模型時代的智能索引與搜索解決方案

隨着人工智能技術的飛速發展，大模型（Large Models）已成爲衆多領域的重要工具。無論是自然語言處理、圖像識別還是其他複雜任務，大模型都展現出了強大的性能。然而，隨着模型規模的不斷擴大，數據量的激增，如何有效地管理、索引和搜索這些模型

2024-04-19 11:29:43

軟件測試從自動化到智能化，大模型開始加入

隨着科技的飛速發展，軟件行業也在不斷地演進和創新。作爲軟件行業的關鍵環節之一，軟件測試行業也在經歷着前所未有的變革。從最初的手動測試，到自動化測試，再到如今的智能化測試，軟件測試行業正在經歷一場深刻的技術革命。在這場革命中，Testin雲測

2024-04-19 00:53:25

裁員了！別錯過2024年大數據工程師必備的10項技能

在當今快速發展的世界中，數據被視爲新的石油。隨着對數據驅動洞察的日益依賴，大數據工程師的角色比以往任何時候都更爲關鍵。這些專業人員在管理和優化組織內的數據操作中扮演着至關重要的角色。在本文中，我們將探索2024年大數據工程師必須具備的十

2024-04-16 11:00:53

文心中國行走進成都！4 月 24 日一起把握大模型時代的產業新機遇

4 月 24 日，文心中國行將走進成都。屆時，政府、企業與高校的相關專家和業界同仁將現場分享生成式人工智能與大模型最新進展，從人工智能政策解讀、大模型技術，到產業創新應用的實踐案例，讓參會者全方位瞭解大模型時期的發展與創新機遇。大會還特別

2024-04-23 11:41:07

文心大模型“你說我畫”：PaddleHub與PaddleSpeech的協同實踐

在人工智能領域中，自然語言處理和計算機視覺是兩個非常活躍的研究方向。隨着深度學習技術的發展，這兩個領域之間的交叉融合產生了許多令人興奮的應用場景。其中，“你說我畫”就是這樣一個結合自然語言處理和計算機視覺技術的創新應用。 “你說我畫”的核心

2024-04-22 11:29:20

探索時間序列大模型：TimeGPT的魅力與實踐

在數據科學的各個領域中，時間序列分析一直扮演着重要角色。無論是預測股票價格、氣候變化，還是分析醫療數據，時間序列模型都發揮着不可或缺的作用。然而，傳統的時間序列分析方法在處理複雜數據時常常面臨諸多挑戰，如數據稀疏性、非線性關係等。爲了應對這

2024-04-22 11:29:17

京東廣告研發——AIGC在京東廣告創意的技術應用

一、前言電商廣告圖片不僅能夠抓住消費者的眼球，還可以傳遞品牌核心價值和故事，建立起與消費者之間的情感聯繫。然而現有的廣告圖片大多依賴人工製作，存在效率和成本的限制。儘管最近AIGC技術取得了卓越的進展，但其在廣告圖片的應用還存在缺乏

2024-04-22 11:16:30

Create 2024 分論壇：百度大模型安全解決方案護航開發者一起創造未來

4月16日，百度Create AI開發者大會在深圳國際會展中心（寶安）舉行，大會以“創造未來”爲主題，匯聚了當前科技和產業革命中的開發者先鋒力量。自去年3月16日發佈知識增強大語言模型文心一言以來，百度不斷推動文心大模型的升級迭代，每一次版

2024-04-19 21:33:25

AI大模型應用架構（ALLMA）白皮書解讀

隨着人工智能技術的不斷髮展，AI大模型成爲推動生產、生活方式變革，助推產業智能化轉型升級，驅動數字經濟高質量發展等社會經濟發展方面的新引擎。爲了全面展示AI大模型的發展全貌，爲各界提供新思路，本文將對AI大模型應用架構（ALLMA）白皮書進

2024-04-19 11:29:39

文心大模型ERNIE-Tiny：輕量化技術的全面解讀

隨着人工智能技術的日益成熟，大模型成爲了衆多領域的研究熱點。大模型通過龐大的數據量和複雜的網絡結構，實現了對數據的深度挖掘和高效處理。然而，大模型的龐大體積和高計算成本也限制了其在一些實際場景中的應用。爲了解決這一問題，文心大模型ERNIE

2024-04-18 11:29:53

文檔圖像大模型

隨着信息技術的快速發展，文檔處理已經成爲日常生活和工作中不可或缺的一部分。傳統的文檔處理方法往往需要人工參與，效率低下且易出錯。近年來，隨着深度學習技術的突破，文檔圖像大模型在智能文檔處理領域嶄露頭角，爲提升文檔處理性能提供了新的解決方案。

2024-04-18 11:29:52

王海峯：百度 500 萬 AI 人才培養目標已提前達成

4 月 16 日，以“創造未來”爲主題的 Create 2024 百度 AI 開發者大會在深圳國際會展中心成功舉辦。百度首席技術官王海峯以“技術築基，星河璀璨”爲題，發表演講，解讀了智能體、代碼、多模型等多項文心大模型的關鍵技術和最新進展。

2024-04-17 23:41:11

提高 RAG 應用準確度，時下流行的 Reranker 瞭解一下？

檢索增強生成（RAG）是一種新興的 AI 技術棧，通過爲大型語言模型（LLM）提供額外的“最新知識”來增強其能力。基本的 RAG 應用包括四個關鍵技術組成部分： Embedding 模型：用於將外部文檔和用戶查詢轉換成 Embeddi

2024-04-17 21:20:19

從零開始學習大模型

隨着人工智能技術的快速發展，大模型已成爲許多領域的熱門話題。然而，大模型的創建並不是一件容易的事情。在本文中，我們將從零開始學習如何創建一個大模型，幫助讀者掌握大模型的創建過程。一、數據收集創建大模型的首要任務是收集數據。數據是大模型的

2024-04-16 11:29:26

24小時熱門文章

最新文章

最新評論文章