中英數據庫專家“達摩院論劍”:數據庫的過去、未來和現在

簡介: 數據庫是什麼?未來的數據會被存在DNA裏?數據庫裏的數據湖是什麼? 1月16日,掃地僧做了一場直播,請到我的同事——數據庫資深專家封神,和來自帝國理工的高級講師Thomas Heinis(托馬斯·海尼斯),2人就數據庫這個話題做了比較深入的探討,老僧印象比較深的是一些前沿的DNA儲存大數據等概念。在此老僧奉上雙方談話的全部內容,由於英國學者使用英文講解,所以對全文進行了中英文的翻譯。希望這個速記能幫助對前沿科學有興趣的同好。

1月16日,掃地僧做了一場直播,請到我的同事——數據庫資深專家封神,和來自帝國理工的高級講師Thomas Heinis(托馬斯·海尼斯),2人就數據庫這個話題做了比較深入的探討,老僧印象比較深的是一些前沿的DNA儲存大數據等概念。在此老僧奉上雙方談話的全部內容,由於英國學者使用英文講解,所以對全文進行了中英文的翻譯。希望這個速記能幫助對前沿科學有興趣的同好。

主持人:大家好,歡迎來到阿里達摩院掃地僧的直播間,我是掃地僧的小助理。今天,我們的自動駕駛機器人小蠻驢帶大家逛了一下阿里雲飛天園區,一路沒人講話,不知道盆友們有沒有看急了,那接下來我們就聊聊天。

Moderator: Hello everyone. Welcome to the live streaming studio of the sweeper monks of Alibaba DAMO Academy. I am the assistant to the sweeper monks. This morning, our autonomous robot Xiaomanlv (Little Competent Donkey) took you guys on a tour in Alibaba Cloud Apsara Park without saying anything. I wonder if you felt anxious or not. Now let’s have a good chat.

這次陪我們聊天的是掃地僧的老同事封神。

Today, we have with us our old friend, the Sweeper Monk Fengshen.

封神(阿里雲智能 雲原生數據湖分析DLA 技術負責人):大家好,我叫封神,來自阿里雲數據庫團隊,09年加入阿里,目前主要做數據庫數據湖分析方向,主要負責雲原生數據湖分析DLA的技術,之前也做了10年左右的數據庫與大數據相關的事。

Hello everyone, I am Fengshen from Alibaba Cloud database team. I joined Alibaba in 2009. Currently, I am mainly responsible for Data Lake Analytics, known as DLA. Before joining Alibaba, I had spent 10 years doing things related to databases and big data.

主持人:直播間還請到了1位來自遠方的客人,Mr. Thomas Heinis,他是帝國理工學院的講師,請客人自我介紹。

Moderator: We are also honored to have with us here Mr. Thomas Heinis, lecturer at Imperial College London. Mr. Heinis, please kindly introduce yourself.

托馬斯·海尼斯(帝國理工數據庫專業高級講師):是的,當然。 我叫托馬斯·海尼斯,現在是帝國理工學院的高級講師。我在研究小組做研究,我們研究小組基本上負責一切與數據有關的研究工作。我從事大量的數據分析和數據可視化工作,目前也負責數據存儲工作,包括所有的新技術,確保我們未來能夠有效地分析和理解數據。

Yes, sure. My name is Thomas Heinis. I am a senior lecturer at the moment at the Imperial College in London. I do research here in the research group, which basically takes care of everything to do with data.So I do a lot of data analytics, also, data visualization, and then also data storage at the moment, including all new technologies such that we can basically analyze data efficiently and understand it in the future.

主持人:既然請到2位數據庫領域的專家,我們今天討論的主題肯定離不開數據庫。數據庫對我們這些非專業人士而言,可能最常聽的詞是刪庫跑路,而對數據庫最直觀的理解就是Excel表格。先給我們的觀衆介紹一下什麼是數據庫。請問Mr. Heinis,你給學生上第一堂課時如何介紹?

Moderator: Since we have invited two experts in the field of databases, the topic of our discussion today is certainly inseparable from databases. For non-professionals like us, probably the most common words we have heard is dropping a database. And our most intuitive understanding of databases is Excel tables. First, please introduce to our audience what a database is. Mr. Heinis, how would you explain databases to freshmen in your first class?

托馬斯·海尼斯:我通常會給學生稍微介紹一下數據的由來,這與銀行密不可分。上個世紀六七十年代期間,銀行儲存了大量的數據。它們需要將這些數據組織起來,關係型數據庫應運而生。數據庫本質上是很多信息和數據的集合,這些數據被組織起來,從而實現高效的分析、訪問、管理和更新的目的。所以,計算機數據庫通常來自於數據文件的導航,從傳統上而言,或者就歷史淵源而言,數據庫確實來自銀行,有很多關於銀行客戶的信息以及他們賬戶餘額的信息。

I usually explain a little bit of history about it. That's all to do with banks. Banks had a lot of data in the 60s and 70s. And they needed to organize that data. That's kind of where relational databases come from.

Essentially, what I tell students is that a database is a collection of lots of information, lots of data that are organized so that it can be analyzed, accessed, managed, updated very efficiently, right?

So computer databases, typically from the navigation of data files. Databases, traditionally, or historically, do come from banks. There was a lot about bank customers and clients their balance of their accounts and information.

And within databases, then we have this massive for this big branch of this big technology called relational database, which is really where we organize data to tables, rows, columns, which really contain information about customers, clients, transactions, sales, etc. and all kinds of very well structured information.

而在數據庫裏面,有一個龐大的技術分支,叫做“關係型數據庫”,其實就是我們把客戶、交易、銷售等信息以高度結構化的形式組織到表、行、列當中。而有些同學一開始就已經聽說過SQL,也就是用於查詢數據庫的查詢語言。最近幾年,情況有所變化。但總的來說,關係型數據庫確實是在上世紀六七十年代期間由銀行的應用案例所驅動的。過去二十年間,情況發生了巨大的變化。因爲我們有了需要組織數據的新型應用,比如科學應用,或者是社交網絡等類型的應用。

基本上,這些應用需要略微不同的數據庫。因此,後來我們轉向了NoSQL數據庫或非關係型數據庫,有了不同的用例,也開始管理更多的數據,也就是現在所說的大數據。簡單地說,我們收集了海量的數據,需要用數據庫來分析和存儲這些數據。

And some of the students have already, on the start, already heard about SQL, which is the query language to ask for a query database.

In recent years, things have changed a little bit. So relational databases, they do come from the 60s 70s driven by this banking use case, for this banking application.

And in the last 20 years, things have changed drastically, right?

Because we have new applications that need to organize the data such as scientific applications, or the types of applications like social networks, and etc.

Basically, these applications require slightly different databases. And so that's where then we went to kind of noSQL databases or non-relational databases and we moved to different use cases. And we also move to managing much much more data, What is nowadays called Big Data. Basically, we collect tremendous amounts of data. And then we need to come to databases to analyze and store these data.

主持人:想問從事這個行業的封神老師,進入到現代,數據庫爲何越來越重要?

Moderator: I have a question to Fengshen. Why are databases more and more important in the modern era?

封神:其實數據庫一直很重要,最爲簡單的一條就是 數據不能丟。企業如果丟了最爲核心的數據庫,則企業可能直接面臨破產。

Actually, databases have always been important. To put it simply, data cannot be lost. If an enterprise loses its core database, it might directly go bankrupt.

爲什麼數據庫越來越重要 那是因爲數據還蘊藏着寶藏,之前數據存着也就存着,一般存核心的數據,如交易、客戶、商品的數據,一些日誌數據存就是了查詢問題。也不會去埋點,更加談不上去爬取或者購買數據了。

Why are databases more and more important? That is because data also contain treasure, In the past, data were just stored there. Generally, core data were stored, such as the data of transactions, customers, and goods. Some log data were stored just for query purposes. There were no buried points, let alone crawling or purchasing data.

互聯網也經歷了好幾個階段,從開始的新聞門戶時代,到可以有交互的類似BBS、淘寶購物時代,到all in無線,到智能時代,產生的數據也越來越多。

The Internet has gone through several stages, from the news portal era in the beginning, to the era of interactions with such platforms as BBS and Taobao, to the era of wireless technology, and then to the era of intelligence. More and more data have been generated in the process of evolution.

-從數據量來看,IDC統計,2005年全球數據是130EB,2019年爲41ZB,漲了322倍;

-According to the statistics released by IDC, the amount of global data generated in 2005 were 130 EB, and that in 2019 was 41ZB, an increase of 322 times.

-從數據應用來看,越來越多的公司也使用數據做出了出色的成績,阿里、頭條、百度、滴滴等Top100知名互聯網企業都是數據驅動的企業。我們再來看傳統的產業,有智慧園區、城市大腦、縣域大腦、智慧農業、智慧城市、智慧醫療、工業4.0等等,也都在使用數據技術在賦能各個產品,幫助這些產業數字化轉型,提升效率。在產業應用看,要充分發揮想象的空間,使用數據賦能產業轉型成長。

-From the perspective of data application, more and more companies have also made outstanding achievements with data. For example, the Top 100 well-known internet enterprises, including Alibaba, Toutiao, Baidu and DiDi, are all data-driven enterprises. In terms of the traditional industries, data technology is used in smart parks, city brain, agriculture brain, smart agriculture, smart cities, smart medicine and Industry 4.0 to empower each product and help the respective industries achieve digital transformation and increased efficiency. Seen from the perspective of industrial application, it is necessary to give full play to imagination and use data to empower industrial transformation and growth.

-從國內形勢看,國家也提出了新基建,核心是以 雲計算、大數據、人工智能、5G、區塊鏈爲核心,這些核心中的核心是數據的應用,再過10年,真的是萬物互聯的時代,數據量增長的速度會更加快;在疫情時代,有相關機構研究表明,疫情讓數字化轉型快了5年左右。

-Seen from the domestic situation, the Chinese government has also put forward the concept of new infrastructure. Its core technologies include cloud computing, big data, artificial intelligence, 5G and blockchain. And the core of these core technologies is the application of data. It is expected that in 10 years, we will enter the era of the Internet of Everything, when the growth of data volume will be even faster. Relevant institutions have found through research that the COVID-19 pandemic has expediated the digital transformation by 5 years.

-從高校和研究機構看,國內高校專業增設最多是 大數據技術、人工智能的專業;總之,21世紀除了人才,還有什麼最貴,那就是數據,數據相當於20實際的石油,是21世紀整個社會效能運轉的潤滑劑。因爲數據越來越重要,所以數據庫越來越重要,數據庫是這一切的核心載體。

-From the perspective of universities and research institutions, big data technology and artificial intelligence programs have been widely offered these days. In short, in addition to talents in the 21st century, what else is the most expensive? The answer is data. Data are equivalent to the oil of the 20th century. Data are the lubricant for the functioning of the whole society in the 21st Century. Due to the increasing importance of data, databases are also becoming more and more important. And databases are the core carriers of everything.


主持人:中國和英國的對大數據的定義有什麼不同?這會導致雙方程序員對數據庫的理解不同嗎?

Questions: What is the difference in the definition of big data between China and the U.K.? Will such differences lead to different perceptions of databases between the programmers in the two countries?

封神:大數據其實我認爲沒有準確的定義。比如,如果數據量比較小,但是訓練用的機器比較多,也可以認爲是用到了大數據技術。我一般認爲用數據驅動業務發展就屬於使用了大數據相關的技術。之前國內提Big Data(中文指:大數據)比較多,現在國外提Data Lake(中文指:數據湖)的概念比較多,主要還是雲公司在主導,數據很多存在了對象存儲上。阿里數據庫團隊提的是庫、倉(Data Warehouse)、湖(DataLake)、多模(Multi-Model),並且我們還專門做了一個 雲原生數據湖分析DLA的產品。另外,我們看到著名的諮詢公司Gartner把大數據報告合併到了數據庫報告裏面。

Fengshen: I don’t think there is a precise definition of big data. For example, if the amount of data is small, but many machines are used for training, it can still be considered that big data technology has been used. I generally think that adopting the data-driven business mode is equivalent to using big data-related technologies. In the past, the concept of Big Data was mentioned a lot in China. Nowadays, the concept of Data Lake becomes popular in foreign countries. In most cases, cloud companies are taking the lead. And a lot of data exist in object-based storage. Alibaba’s database team mentions data library, Data Warehouse, Data Lake, and multi-model. We have also specifically made a product based on Data Lake Analytics (DLA). In addition, the renowned advisory company Gartner Inc. integrated big data reports into its database report.

我認爲數據庫包括傳統的數據庫技術(如MySQL、PG); 也包括數據倉庫、數據湖的技術,如開源的Spark、Hadoop,阿里的ADB、DLA等,也包括最近流行的LakeHouse技術。

In my view, databases include traditional database technologies (such as MySQL and PG) as well as data warehouse and data lake technologies, such as open-source Spark, Hadoop, Alibaba’s ADB and DLA. They also include the Lakehouse technology which has been quite popular these days.

我跟英國的程序員交流比較少,跟北美有一些交流,整體應該理解差不多。技術傳播的速度也比較快,大家理解應該比較類似。由於中國的市場比較大,由於某一些原因,數據也比較多,這些會加快對數據的應用的發展。

I haven’t had much communication with programmers in the UK. But I have had some communication with programmers in North America. Overall, our perceptions are more or less similar. The speed of technology communication is also relatively fast. So, the understanding should be more or less similar. Besides, the huge market in China and the larger amount of data generated here will accelerate the development of data application.

托馬斯·海尼斯:On some level, yes. I think this is gonna be long long rounds. But I don't think you know, if you look at it, from a technical perspective, the types of data could be the same, it's going to be very similar. But what I do think is that the scale is massively different. And that's quite, you know, just because there's so much more data available in China.

某種程度上,會的。這一點說來話長。但我認爲,從技術角度而言,數據的類型可能相同,或者非常相似。但數據的規模卻差異巨大,這主要是因爲在中國,可獲取的數據更多。

And it's because I think personally, I think it's because the society is a little bit more technologically advanced, or is it easier to…I think China adopts technology easier, which means that, for example, you have more sensors everywhere with traffic measurements, and an engine mentioned the DiDi, right, which, which, you know, produces the same kind of data as Uber, for example, but on a different scale, right. And that applies to everything.

我個人認爲,這是因爲相比之下,中國社會的科技更爲發達,並且中國更易採用技術。這就意味着,例如,中國的路面上會有更多的傳感器來監測交通情況。又如,滴滴打車軟件所產生的數據和Uber所產生的數據一樣,但是規模卻不同。其他領域也是如此。

And what he said about the pandemic, being a catalyst for this transformation is absolutely also absolutely true. Like we have made way more electronic payments, now, we have just everything is more digital, and I think a lot of it will stay digital, and that produces data that produces data that we need to analyze.

剛剛他提到說新冠肺炎疫情是推動這一轉型的催化劑,這一點毋庸置疑。比如,我們現在使用電子支付的頻率比以前高了很多,生活的方方面面都更加數字化了,我認爲數字化的趨勢會持續下去,這就產生了需要我們分析的數據。

So with what I mentioned, about China, being a technological, a bit more about having more technical technological affinity means that we basically have more places where we collect data, and much more sensors, people interacting online that produces more data.

我剛提到說中國更親近技術,這就意味着在中國,數據的來源更多,有更多的傳感器,網絡用戶互動所產生的數據也更多。

And then also, you know, China's population is huge. So that also means that more data is being produced. So it's, I would say, you know, there are two technical challenges when it comes to big data or databases in general.

此外,中國人口衆多,因而產生的數據也更多。因此,我認爲,大數據或數據庫主要面臨兩大技術挑戰。

One is the data formats, you know, that's life changing, as mentioned, is going towards also data lake right loads of different formats. But I don't think that differs much between China and the rest of the world.

一是數據格式上的挑戰,其發展趨勢是走向數據湖等不同類型的格式。但在這一點上,中國和世界其他地方的區別不大。

But what what is really different is the amounts of data. And so that means that we need whatever we develop the analysis, visualization, storing the data that needs to happen efficiently on a much, much larger scale, which then again, brings in a lot of technical challenges and challenges as well. Right. So I think i think that's that's, that's what I think that's the difference. But I think the definition as such, it's roughly roughly the same.

中外真正的區別在於數據量。具體而言,中國需要更大規模和更高效地進行數據分析、可視化和存儲數據等任務。這又帶來了很多技術挑戰。我認爲中外在數據庫領域的區別就在於此,但兩者對數據庫的定義基本相同。

It's also I also have the feeling That, that, you know, China has a different, the Chinese people have a different understanding of personal data as well,

此外,我認爲中國人對個人資料的理解與外國也不同。

it's very, very difficult for us to get data from companies here or from you know, there's always a perception of these breaches data privacy, whereas in China and appear of works a lot with with Chinese University as well. So we get a lot of data from I don't think it's DiDi, but somebody saw some similar data sets, it's quite easy. So there's also kind of like, I feel like China has really this, this kind of this, this more affinity towards technology, and this is a and like, this more of a project, let's see what we can do with the data. Let's see if we can improve things, you know, so we're also collaborating on a traffic optimization project that you know, they collect massive amounts of data about which vehicle passes through the road, where at what speed what's the congestion level? Can we remove traffic and all these kinds- of things? It's really kind of like a very pragmatic approach to using data really, what can we do to improve you know, everything really. That’s what data does these days.

在英國或者其他國家,從公司獲取數據非常困難,因爲外國企業總是擔心侵犯數據隱私。相對而言,從中國企業或者院校獲取數據容易得多。因此,我們獲取了很多數據,可能不是直接來自滴滴公司,但數據集也比較類似。總地來說,在中國,數據的獲取更爲容易。我感覺中國更親近技術,更願意使用技術,更願意利用技術來改善現狀。我們正在與中方合作一個交通優化的項目,他們收集了大量的數據,包括經過道路的車輛信息,車輛的位置,車輛的行駛速度,路面的擁堵程度,是否可以消除擁堵等。這種數據使用真的非常務實,相當於使用數據來改善一切可改善之處。這便是數據在當今社會所發揮的作用。


主持人:隨着大數據這個概念的出現,數據庫是怎麼進化的,請封神老師講講?

With the emergence of the concept of big data, how has databases evolved over the years? Fengshen, please share with us your view on this question.

封神:數據庫怎麼進化,先看看數據庫怎麼來的。從廣義看,數據記錄的歷史早就有了。在5000年前,人類開始用繩結計數;在2000年前有紙張,到1946第一臺計算機的誕生;計算機的誕生後,纔有了現代意義的數據庫。爲了形象說數據庫是什麼,好比 你有一個管家,管家有一個記賬本,你每天花費多少錢,收入多少都會告知管家,管家記錄下來。你就可以知道你目前多少錢,每個月花費多少錢;數據庫就是管家+賬本;管家提供計算力,賬本提供存儲;

To figure out how databases have evolved, let’s first look at how they came into being. In a broad sense, data recording dates back to a long time ago. Back 5000 years ago, humans began to count with knots; 2000 years ago, paper started to be used. And in 1946, the first computer was invented. After the invention of the first computer, the databases in the modern sense came into being. Let me use an analogy to explain what a database is. Suppose you have a housekeeper, and the housekeeper has a ledger. You inform the housekeeper how much you spend and how much you earn every day. The housekeeper records your income and expenditure in the ledger accordingly. You can then know how much money you currently have and how much you spend each month. In this case, the database is the housekeeper + the ledger: The housekeeper provides the computing power, while the ledger provides the storage.

數據庫也發展經歷了很多階段,爲了“記好賬”,數據庫也在不斷演進。我們一般根據把數據庫發展分爲了4個階段:

The development of databases has gone through several stages. And in order to “keep good accounts”, databases have also been evolving. We generally divide the development of databases into 4 stages:

  • 1970~1990商業數據庫時代 收費時代

  • The two decades from 1970 to 1990 was an era of business databases when fees were charged for the use of databases.

  • 1990~2000 開源數據庫時代 開源時代

  • The decade from 1990 to 2000 was an era of open-source databases.

  • 2000~2015 互聯網浪潮 大數據時代(大數據計算、存儲、NoSQL)

  • The period from 2000 to 2015 was an era of Internet and big data (big data computing, storage, NoSQL)

  • 2015~現在 雲的浪潮 雲原生時代+AI

  • Finally, the era from 2015 to the present is an era of cloud technology and the widespread application of Cloud Native and AI technologies.

數據庫的發展跟幾個因素有關, 硬件的發展,需求; 硬件主要指 存儲、網絡、內存、CPU。存儲就是存數據,內存與CPU關係到計算力,網絡就是傳輸。

The development of databases is related to several factors, including the development of hardware and the market demand; hardware mainly refers to storage, network, memory and CPU; storage refers to data storage; memory and CPU are related to computing power; and network concerns transmission.

大數據這個詞語大概在10年前開始流行,大數據系統開始獨立於數據庫系統發展的,隨着最近5年的發展,大數據相關技術又慢慢與數據庫技術結合迴歸到數據庫的大家庭。比如,2020年,著名的諮詢公司Gartner把大數據報告合併到了數據庫報告裏面。最爲典型的是 DalaLake的發展,融入了事務&MVCC的概念,NewSQL的發展,NewSQL也融合了分佈式的理論,並且還有一個HTAP的方向在探索。目前數據庫領域分爲TP、NoSQL、AP等領域。TP一般有單機、分佈式、事務型的數據庫;NoSQL就相對散一些:寬表、圖、文檔、時序、時空等;AP有Data Warehouse、DalaLake領域。

The term big data became popular around 10 years ago, when big data systems started to develop independent from database systems. With the development in the last 5 years, big data-related technologies have slowly returned to the database family by combining with database technologies. For example, in 2020, the famous research and advisory company Gartner integrated its big data report into its database report. And the most typical case is the development of Alibaba’s Dala Lake Analytics team, which incorporates the concepts of transactional databases & Multi-Version Concurrency Control (MVCC). Besides, the development of NewSQL also incorporates the theory of distributed databases. And the direction of HTAP is under exploration. Currently, the database field consists of such segments as TP, NoSQL and AP, TP generally consists of standalone databases, distributed databases and transactional databases; NoSQL covers a relatively wider scope, including wide tables, graphs, documents, time sequence and space-time. AP consists of such fields as data warehouses and data lakes.

在企業界,肯定是做看得見,並且在5年內能落地的事情。未來5年,數據庫領域核心發展方向是雲原生+分佈式,具體講:Serverless、數據庫與大數據一體化、智能化、安全可信、軟硬件一體化、離在線一體化、多模數據處理。舉個例子,我負責做的雲原生數據湖分析DLA就是傳統大數據、Hadoop、Spark的升級,需要融合傳統數據庫技術,並且基於存儲與計算完全分離的雲原生架構。我們選用對象存儲,支持常見的消息、TP和NoSQ 數據庫系統數據的歸檔。我們一般歸檔到DalaLake裏面。還支持了一些事務、版本的東西,並且把Spark、Presto等組件做成雲原生的彈性、隨時可用,即開即用,按需計費,分離後帶寬的損耗通過引入本地的Cache解決。

The prospects of the business community in the next 5 years are definitely foreseeable. In the next 5 years, the core development directions of the database field are Cloud Native and distributed databases, which specifically include serverless, integration of databases and big data, intelligence, security and trustworthiness, hardware and software integration, offline and online integration, and multi-mode data processing. For example, I am currently responsible for Data Lake Analytics (DLA), which can be considered the upgrade of the traditional technologies such as big data, Hadoop and Spark. It requires the integration of traditional database technologies and is based on the Cloud Native architecture of complete separation of storage and computing. It selects object-based storage and supports the archiving of common messages as well as the data in TP & NoSQL databases systems. Normally we archive the data in the data lakes. Besides, DLA also supports transactions and versions. Besides, the Spark and Presto components are also incorporated to achieve elasticity for the Cloud Native, which is accessible at any time and charged on the basis of demand. The loss of bandwidth after separation is solved by introducing local Cache.

托馬斯·海尼斯:其實封神的分享已經很詳細,並無太多可補充之處。如果非要補充的話,如您方纔所言,開源數據庫大大推動了數據在社區內的擴散。我認爲相較以前,數據庫的使用也變得容易得多了。

Well, there's not much more to add to what Fengshen has already said. Right. But I think I think what would what would I do want to maybe add really is that also, you know, Like you said absolutely correctly, is that open source databases have done a tremendous service to the community in kind of getting databases everywhere. I will say that I think it's also kind of the databases that have become much, much, much easier to use as well. And, you know, years back the first year students, they'd never seen a database. Today, if I asked, they've all seen MongoDB, or other technologies, kind of like easy-to-use databases, you know, not relational databases necessarily, but easy to use technology.

比方說,若干年前,大一的學生從未見過數據庫。而現在的大一新生。幾乎都見過MongoDB或者其他一些易於使用的數據技術。他們未必見過關係型數據庫,但基本都見過易於使用的數據科技。

And that really has made a difference in terms of training people, they also kind of has changed their understanding the approach of people to using databases. Now, back in the day, people were storing data just in RAW files. Nowadays, they know, if I want to have efficient access to the data, then I need to use a database and they know how to do so.

這對人員的培訓起到了巨大的助推作用。這也改變了人們對數據庫使用方法的理解。以前,人們只是把數據存儲在原始檔案中。現在,他們知道,需要使用數據庫纔能有效地訪問數據,並且他們也知道如何使用數據庫。

So databases have really changed, or kind of database have become much more pervasive. They're used everywhere these days. So that has definitely changed.

所以,數據庫真的發生了變化,或者說數據庫的使用變得更加普遍,現在各地的人們都在使用數據庫,這是一個很明顯的變化。

And now I unfortunately, forgot your question, which I didn't answer.

不好意思,我忘記你提的問題了,所以沒有回答。

So essentially, what really changed, right, and I said this initially, already kind of databases were designed relational databases were designed for banking applications that were revolving around transactions, which was really the centerpiece of banking applications.And that has made a lot of design decisions difficult.

至於數據庫發生了什麼樣的演變。之前我有提到,數據庫,或者說關係型數據庫是爲銀行的應用而設計的,是圍繞交易設計的,而交易是銀行應用的真正核心。這也使得許多設計決策變得困難。

And then in recent years, like XX had mentioned, right, databases are kind of new use cases emerged, all of a sudden, we no longer have the data we have along fits nicely in a in a table, we actually have a graph.

然而,最近幾年,正如封神剛提到的,出現了一些數據庫的新用例。突然間,我們獲得的已經不再是表格數據,而是圖形數據。

So we have graph databases, or we realized a lot of data is natively very structured in a document, XML or similar. So we develop document databases.

於是我們有了圖形數據庫,或者說我們意識到很多數據天生就是高度結構化的,類似於文檔數據或XML數據。所以我們開發了文檔數據庫。

So there's, there's the now we, in the early, maybe around 2000, a little bit after 2000, we had this understanding that one size doesn't fit all. So we need to have different types of databases. So I mentioned graph database, document databases. But there have also been other other databases, very customized databases. For scientific applications, right?

所以,大約在2000年左右,具體說是剛過2000年的時候,我們意識到不能一刀切,而是需要不同類型的數據庫。我剛提到了圖形數據庫和文檔數據庫。但是也有其他類型的數據庫,比如用於科學應用的高度客製化的數據庫。

They produce massive amounts of data, like physics experiments, like astronomy, DNA experiments in biological experiments, they have all kinds of their own database technology these days,

這些科學應用數據庫會產生大量的數據,比如物理實驗、天文學、生物實驗中的DNA實驗等。這些領域都有各自不同類型的數據庫技術。

Back in the day, we've tried to fit everything in relational database and it didn't work really well. So each one of those now as their own title type of database.

以前,我們試圖把所有數據都裝進關係型數據庫,但效果並不理想。因此,現在不同領域都有各自不同類型的數據庫。

At the same time, we also, you know, more and more data has been produced in different formats. And this is really where the kind of what is this notion of a data lake of engine has been mentioning is coming from that we have tons of data in different formats, we still want to analyze the data as a whole. So we need some sort of kind of some sort of integration between that or some sort of way of analyzing heterogeneous different data types. And that has also changed. So we have now this capability to just produce data, throw it in a database, Put simply, and then analyze it efficient, efficiently, efficiently at scale. Right. So that's really how things I believe have changed.

同時,越來越多數據也在以不同的格式產生,因此我們不斷提到數據湖引擎的概念。之所以引出這個概念,是因爲我們有大量的不同格式的數據,但仍然希望將數據作爲一個整體來分析。所以我們需要某種整合技術,或者說需要採用某種方式來分析不同類型的數據。所以我們現在有這樣的能力:即把數據生產出來,簡單地扔到數據庫裏面,然後高效地進行大規模的分析。我認爲這就是數據庫所發生的演變。

There's also other trends like cloud computing, in general, which has made it which is also supported for particularly smaller businesses to have their own database their own data solution. Because they no longer need to own the resources. And they can just if they have a big analysis to run on their data, or they just use cloud resources to do so temporarily, without having the hassle of owning. Right. So that has also held,

還有其他的趨勢,比如雲計算。總地來說,雲計算幫助小微企業擁有自己的數據庫和自己的數據解決方案,因爲它們不再需要擁有資源,只需要掌握數據的分析能力,或者只是暫時使用雲端資源來進行分析,而不需要擁有資源。所以雲計算對小微企業起到了助推作用。

then we also have a huge trend in terms of hardware. So we have, obviously we have better hardware. And every now and again, the database community tries to really optimize the database for new hardware beat is multi core processors, which are not particularly new, but all kinds of hardware aspects of new CPUs, new types of memory, non volatile memory, for example, change a little bit how we organize and analyze data. So a lot of hardware trends has also changed or shaped database, database technology.

此外,硬件方面也有很大的發展趨勢。顯然,我們有更好的硬件。而且每隔一段時間,數據庫社區就會嘗試真正的優化數據庫,針對新的硬件採用多核處理器,這也並不新奇,但是硬件各個方面的優化,比如使用新的中央處理器、新的儲存器,非易失性存儲器等,在一定程度上改變了我們組織和分析數據的方式。所以說,硬件的很多發展趨勢也改變或者塑造了數據庫技術。

And then finally, what has happened in the last couple of years is really the use of machine learning or artificial intelligence in and around data and that has driven a lot of research and has also produced a lot of products. And when I talk about AI or artificial intelligence databases, it's really kind of The database research community has taken an approach And has done a lot of different things

最後,過去幾年間,機器學習或人工智能技術開始應用於數據領域或與數據相關的領域,這推動了很多研究,也催生了很多產品。談到人工智能或人工智能數據庫,數據庫研究社區做了很多不同的事情。

for example, you know, artificial intelligence machine learning requires a lot of learning, which requires a lot of data, and for that we need to have data that is clean and has been processed and has been manual has been brought in the right format. So, that's a database task.

例如,人工智能和機器學習的發展需要大量的學習,這就需要大量的數據。爲此,我們需要有乾淨的數據,經過處理的數據,和經人工處理爲正確格式的的數據。這是數據庫層面的任務。

And then the learning itself is also to some degree a database. Right And so, we have worked on that, that has had a tremendous impact in recent years.

此外,從一定程度上而言,學習本身也是一個數據庫。我們在這方面也下了不少功夫,這在近幾年產生了巨大的影響。

We also use artificial intelligence within the database itself to accelerate the database accelerate query execution the analysis. And then we also use artificial intelligence to organize the database itself.

我們還在數據庫內部使用了人工智能技術來加速數據庫的索引、執行和分析。我們也使用了人工智能技術來組織數據庫本身。

so that so I would say that artificial intelligence is a mega trend of course, we all know and you know has touched all aspects of life but it is also interesting enough to touch databases which not just touched but changed profoundly how we design and use databases.

因此,我認爲,人工智能當然是大趨勢。衆所周知,人工智能已經觸及到我們生活的方方面面,但更有趣的是,它也觸及到了數據庫領域。準確的說,不僅是觸及,而且深刻地改變了我們設計和使用數據庫的方式。


主持人:剛剛兩位老師說到了很多關於數據庫的基礎知識,如果我現在給這場直播起個名字,我會叫它“數據庫入門必看”。開玩笑,我們實際上是個前沿學術分享的直播。其實任何一門學科在學界和工業界都有2種形態,在工業界落地很重要,你能在工業界爲一門學科找到很多應用場景,比如剛剛封神老師講到的雙十一、工業大腦,而學界的探索往往非常有想象力。我們很想請2位來展望一下,數據庫的未來會如何發展?比如5年內、10年內、50年內、100年內?

Our two teachers just shared a lot of basics about databases with us. If I were to give this livestreaming interview a name, I would call it “Database Essentials”. Just kidding. This is actually a live interview about cutting-edge database technologies. In fact, any discipline exists in two different forms, one in academia and the other in industry. It is important to apply technologies in industry. You can find many application scenarios for a discipline in industry, such as Double 11 Shopping Festival and Industrial Brain mentioned by Fengshen just now. In comparison, the exploration in academia is often very imaginative. We would love to ask you two to look ahead into the development of databases in the future. For example, what will databases be like in 5 years, 10 years, 50 years, or even 100 years?

封神:我關注的一些方向,未來5年,數據庫領域核心發展方向是雲原生+分佈式,具體講:Serverless、數據庫與大數據一體化、智能化、安全可信、軟硬件一體化、離在線一體化、多模數據處理,這個會對每個數據庫的每個子領域都有影響。具體在學術界研究的,我看的還相對模糊一些。按照人類發展來看,發展應該是越來越快。不過,計算機還是馮諾依曼架構,未什麼時候會顛覆,目前我也沒有概念。10年是什麼樣,我其實壓根不知道。目前唯一的就是保持敬畏之心,保持學習。

I would like to talk about some of the directions I focus on. In the next 5 years, the core development directions of databases would be Cloud Native and distributed databases. Specifically, I’m talking about serverless, integration of databases and big data, intelligence, security and trustworthiness, software and hardware integration, offline and online integration, and multi-mode data processing. These technologies will have an impact on each subfield of each database. As to database research in academia, I only have a vague idea. Seen from the history of human development, the development of databases should be faster and faster. However, computers nowadays are still based on the von Neumann architecture. I have no idea when it will be replaced. And I actually have no idea what kind of development will have happened in 10 years. At present, the only thing I am sure about is to maintain a sense of awe and keep learning.

主持人:學術界就是Mr.Heinis的研究方向了,請 Mr.Heinis 繼續來說

托馬斯·海尼斯:Yeah, well, what's the future? It’s difficult to predict, right?But in terms of, you know, kind of like a five year perspective, the only thing I would add in that I think will make a difference in the in the short term is probably also, like I mentioned, AI, artificial intelligence helping us a little bit to organize the data to accelerate analysis, etc.

未來會如何?我們很難預測。但就未來5年而言,短期內可能出現的進展,就是我剛提到的:人工智能將在一定程度上幫助我們組織數據和加速數據分析。

So I think we're kind of lucky that this is the case. Because a lot of students want to work on AI. And if you can kind of combine as a database technology, we get a lot of talented students involved. But yeah, so I think in the short term, I think AI will also have an impact on databases.

從這點來看,我認爲我們很幸運。因爲很多學生想研究人工智能。如果能把數據庫技術結合起來,我們就能吸引大量人才參與進來。所以,我認爲在短期內,人工智能也會對數據庫產生影響。

I think also that visualization will become important. And we move there to virtual reality, right, kind of which, which offers us a much more, much more kind of, you know,

可視化技術也會變得更加重要,以及與之相關的虛擬現實技術。

we can kind of interact with the data, we can touch the data to some degree, you know,

它能幫助我們與數據互動,在某種程度上,我們將能“觸摸”到數據。

I used to do research and have a feed with gloves with haptic feedback, we can touch to data, this kind of thing will I think will become more important not for an individual analyzing data. But I think for collaborative analysis of data to analyze data together to understand it together, I think that's where we also need to put in and put some research to kind of like help to, to find easier ways for people to understand the impact of things.

我曾做過相關研究,戴上觸覺反饋手套,我們可以“觸摸”到數據。我認爲這類技術會變得更加重要,不是針對個人分析數據而言,而是針對數據協同分析,即團隊共同分析和理解數據。這也是我們需要投入研究的地方,以便找到更簡單的方法,幫助公衆理解數據及其帶來的影響。

And then, like Fengshen said right, at one, one important thing that's going to happen fairly soon, probably five to 10 years, maybe a little bit more, it's going to be quantum

另外,正如封神所言,不久的將來數字領域將發生重大突破,那便是量子科技。這也許會發生在5到10年之後,也許更久一點。

and quantum. You know, it's difficult to fathom what is gonna, what it's going to do to our two databases. But one thing is for sure, I believe with quantum sensing, quantum sensors will just have so much more data to deal with. And that will challenge database technology, or Big Data technology in itself, right?

我們很難弄清楚量子科技會對數據庫造成什麼影響。但有一點是肯定的,我相信隨着量子傳感器的應用和普及,我們將有更多的數據需要處理。而這將對數據庫技術或大數據技術帶來挑戰。

Then when it comes to go a little bit beyond 20-30-50 years, maybe or 50 years a bit later. But yeah, one of my favorite topics DNA storage basic for the store information to store data within synthetic DNA. And this is interesting, because we know essentially has been talking about numbers initially how much data we have

再過20、30、50年,甚至超過50年之後,就得談到我最喜歡探討的話題之一,DNA存儲,也就是在合成DNA中存儲數據。這個話題很有意思,因爲封神剛剛一直在談論我們目前擁有海量數據。

a lot of this data we don't look at every day, right, we store it in the long from the long term, because we need to for the law says we have to keep records around for hundreds of years, right.

很多數據我們並不是每天查看,只是長期保存而已,因爲法律規定我們必須保存數百年的數據記錄。

And we do this with traditional technology with tape, disk, they don't last forever, the last maybe 10,15 years. And then we need to copy the data on to a new disk or a new tape etc. So as always, as data migration, as much as the hassling is also quite expensive. And a lot of companies don't want to afford this anymore, can't afford to do this anymore.

我們使用磁帶和光盤等傳統技術保存數據。但是磁帶和光盤無法永久保存,可能頂多保存10到15年。接着,我們就需要把數據複製到新的光盤或磁帶上。數據遷移耗時耗力耗財,很多公司要麼不想再承擔這樣的成本,要麼承擔不起這樣的成本。

So what we're looking at with DNA storage, for example, is really to store data for 10s of years, maybe hundreds of years, right, such that we can retrieve it.

通過DNA存儲技術,我們希望將數據存儲幾十年,甚至幾百年,以便日後檢索。

So we really can take the data, convert it to two strings of nucleotides and then synthesize this and store it in, in the fridge essentially. And when we need it, we sequence and get it back. So anyway, that's kind of I think that's going to happen.

我們可以把數據轉換成核苷酸串,然後合成並儲存在冰箱裏。需要時,我們再進行測序,並取回數據。這就是我所設想的未來。

So generally, I don't want to focus too much on DNA storage itself, I think like, the underlying technology will change drastically

總體來講,我不想過多地關注DNA存儲本身,我認爲其底層技術將會發生翻天覆地的變化。

in the past, we looked a lot of when we looked at storage, the storage medium, we had a lot of collaboration with computing, and electrical engineering. Now I think we're getting to a point where we go from, from computing, collaborating between computing and biology, or chemistry, etc.

過去研究存儲介質的時候,我們與計算和電氣工程領域有很多合作。而現在,我們開始在計算和生物或化學等等領域之間進行協作。

Doesn't have to be DNA can be another kind of storage medium. But I think that's what's going on.

不一定是DNA,也可以是另一種存儲介質。但我的設想大概就是這樣。

And what's quite interesting there is also I think, when we look at a little bit beyond 20 years, when it comes to DNA storage, but we can also implement some of some data processing some data analytics on top of the DNA using biological processes,

同樣有意思的是,展望20年之後,在DNA存儲方面,我們還可以通過生物過程在DNA之上實現數據處理和數據分析。

which is extremely energy efficient, and also very, very fast.

這種做法非常節能,而且速度極快。

There are limits to this technology, but we'll find out over the next couple of years. next couple of decades, maybe we'll come in on there

這項技術存在侷限性,我們將在未來幾年或幾十年內找到答案。

but I think generally that we also will the whole field of computing will expand into other into other will collaborate more with other fields. And that also has implications for databases for data analytics to

can use biological processes or chemical processes or anything or similar to do computations right. I think that's that's what's gonna that's what's definitely gonna happen. But it's, you know, the difficulty in the future is very difficult to predict.

總體而言,整個計算領域將會擴展到其它領域,與其它領域開展更多的協作。而這也將對數據庫和數據分析產生影響,我們可以利用生物過程、化學過程或其它類似的方法進行計算。我想這絕對是一個趨勢。但未來是很難預測的。

The implication of quantum, for example, like I mentioned, quantum sensing will deliver tons of data. But there's gotta be other implications.

例如量子技術的影響。如我之前所言,量子傳感技術的應用和普及將給我們提供大量數據。但除此之外,肯定還會產生其它影響。

For example, it's one tiny operation in a database, query optimization, which is kind of like you give the database a query, it figures out how to do it efficiently. And that takes a lot of time to compute to figure out how to execute that query efficiently.

例如,數據庫中有這樣一個小操作,即查詢優化,也就是說,你在數據庫裏進行一項查詢,它會找到高效執行的辦法。這一操作需要花費大量的時間。

And we've also already seen in the community that somebody took a query optimization and implemented it on a quantum computer showing that this would be massively faster to optimize the query on the quantum computer. So there's a lot of really I don't think I understand all the implications of quantum but there the quantum computing but that will be definitely also have an impact on databases.

而我們已經在社區裏目睹了這樣一個案例,有人在一臺量子計算機上進行查詢優化,結果表明,在量子計算機上優化查詢,速度要快得多。我無法瞭解量子技術的所有影響,但量子計算肯定會對數據庫產生影響。

So in the short term, adding to essentially I say, is really kind of I think AI is having a tremendous impact in the short term, in the somewhat longer term, I think, we really have to think about interfaces to data virtual reality being one of them, right?Augmented reality being another, but we need to think about how can we make it easy for people to interact with data and understanding typing the query that works for an analyst that's not going to work for everyone right for pretty good for for a broad class of people who need to you know, I think we all need to deal with interpret and analyze data and I think we need to make it easy for everyone. That's a little bit more medium term and in the long term, I think that hardware will change dramatically with quantum with DNA storage with other types of storage medium etc. But 100 years I'm not gonna make a prediction here that's too far out.

短期而言,接着封神剛剛的觀點講,我認爲人工智能將在短期內產生巨大影響,而更長遠來講,我們必須思考數據界面,比如虛擬現實和增強現實。我們必須思考如何找到更簡單的方法,幫助公衆與數據互動,理解數據。輸入查詢對數據分析師而言是可行的,但並不適用於所有人。所以,我們需要解釋和分析數據,降低數據的門檻。這是針對中期而言。長遠來看,在量子技術、DNA存儲和其它類型的存儲介質影響下,硬件將發生巨大的變化。但100年後我就不做預測了,那太遙遠了。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章