大数据技术发展(二):Hadoop 技术生态圈的发展

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"size","attrs":{"size":14}}],"text":"大家好,这里是抖码课堂,抖码课堂专注提升互联网技术人的软硬实力。在抖码课堂的公众号中可以听这篇文章的音频,体验更好~~~~"}]},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"google 的\"三驾马车\""}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"size","attrs":{"size":14}}],"text":"我们在上一篇文章中知道了,google 为了解决数据量越来越大的问题,开发了分布式存储技术 GFS 和分布式计算技术 MapReduce,这两个技术奠定了大数据技术的发展。如果 google 对这两个技术不开放出来的话,它的影响力也不会很大,可能很多人就不会知道这两个技术,但是 google 分别在 2003 年和 2004 年将这两个技术以论文的方式发布出来了,从而奠定开源大数据技术的发展,也就是我们现在免费使用的大数据技术 (Hadoop)。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"size","attrs":{"size":14}}],"text":"要了解 Hadoop 的发展史,我们得先从 google 的\"三驾马车\"开始说起,google 分别在 2003 年、2004 年以及 2006 年发布了三篇论文:"}]},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"size","attrs":{"size":14}}],"text":"The Google File System,简称 GFS"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"size","attrs":{"size":14}}],"text":"MapReduce:Simplified Data Processing on Large Clusters"}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"size","attrs":{"size":14}}],"text":"Bigtable:A Distributed Storage System for Structured Data"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"size","attrs":{"size":14}}],"text":"这三篇论文阐述的是解决大数据处理这个问题的思想,思想的发展是缓慢的,到今天为止,我们仍然使用这三篇论文提供的思想解决大数据处理的问题,但是我们发现技术的发展是不断迭代更新的,比如大数据的技术从 Hadoop 到 Spark 再到 Flink 等。所以说,如果我们要学习一项技术的话,最好的学习方法就是彻底掌握这个技术解决问题的思想,这样一来,即使技术不断的更新发展,我们也能很容易的跟上。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"size","attrs":{"size":14}}],"text":"说完 google 的\"三驾马车\",继续回到我们的主题,即 Hadoop 的发展。很多人都说 Hadoop 起源于 google 的\"三驾马车\",其实这个是不严谨的,我们可以说成:Hadoop 的实现是参考 google \"三驾马车\" 中的 GFS 和 MapReduce 这 \"两驾马车\" 解决大数据处理问题的思想。如果把思想和技术分开的话,下面的说话应该更加的准确:"}]},{"type":"bulletedlist","content":[{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"size","attrs":{"size":14}}],"text":"从思想上,Hadoop 起源于google \"三驾马车\" 中的 GFS 和 MapReduce 这 \"两驾马车\""}]}]},{"type":"listitem","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"size","attrs":{"size":14}}],"text":"从技术上,Hadoop 实际上是起源于 Doug Cutting 研发的网络搜索引擎技术,而非 google 的搜索引擎技术"}]}]}]},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"Hadoop 技术的起源"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"size","attrs":{"size":14}}],"text":"故事要从 1997 年的一个晴朗的下午开始说起,Doug Cutting 从这个时间开始开发第一个版本的 Lucene。Lucene 是一个全文搜索库,它要解决的问题本质上就是我们在前一篇文章提到的倒排索引。Lucene 其实是倒排索引的技术实现了,有了 Lucene ,我们就可以在海量的文章中快速的搜索包含指定关键词的文章。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"size","attrs":{"size":14}}],"text":"Doug Cutting 花了 3 个月的时间开发完 Lucene,然后一直优化 Lucene,为了让更多人的使用 Lucene,Doug Cutting 在开源社区 Source Forge 中开源 Lucene,使用 Lucene 的人越来越多,反响越来越好,在 2001 年的时候 Lucene 在 Apache 上开源,用的人就越来越多了。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"size","attrs":{"size":14}}],"text":"完成了 Lucene 项目后,Cutting 开始和 Mike Cafarella(卡法雷拉) 致力于对全网的网页进行索引,从而可以快速的通过关键字搜索(其实做的事情和 google 做的事情差不多),通过他俩的努力,研发出了Apache Nutch 这个项目,这个项目是 Lucene 的子项目,它其实就是网络爬虫的技术实现。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"size","attrs":{"size":14}}],"text":"一开始,他们在一台服务器上部署 Nutch,然后去爬取网络上所有的网页,当爬取到 1 亿个网页的时候,单台机器存储不下了,很明显,单台机器不可能存储的下整个网络上所有的网页。然后他们将服务器的数量提升到 4 台,虽然加服务器可以存储更多的网页数据,但是带来的复杂度确实非常的大,他们需要手动的处理服务器之间的数据交换、以及手动的处理磁盘空间(因为磁盘会满),而且每次增加服务器的时候,复杂度都会呈指数级增长。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"size","attrs":{"size":14}}],"text":"为了解决服务器扩展的问题,他们也尝试了很多方法,直到 2003 年十月份,google 发布了 GFS 分布式存储文件系统的论文,Cutting 和 Cafarella(卡法雷拉) 两个人阅读到这篇论文后,发现这篇论文解决的就是他们现在遇到的问题。随后,他俩就按照这篇论文的思路实现了一个分布式存储文件系统,就是 Nutch Distributed File System (简称 NDFS)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"size","attrs":{"size":14}}],"text":"解决了存储的问题,Cutting 和 Cafarella 同样也需要面对怎么样对分布式存储的数据进行计算的问题。NDFS 分布式存储的特点,也就是数据分块存储在多台机器上,那么计算需要并行的进行计算,这样的话可以利用分布式存储的特点,从而提高计算的性能。在他们摸索的时候,同样,在 2004 年 google 发布了另一篇 MapReduce 的论文,Cutting 和 Cafarella(卡法雷拉) 看完这个论文后,发现这个解决方案也就是他们想要的,所以,他们就按照这篇论文的思想重新实现了一个分布式计算的技术,也命名为 MapReduce。并在 2005 年,将 MapReduce 集成到 Nutch 项目中。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"size","attrs":{"size":14}}],"text":"在 2006 年的二月,Cutting 将 NDFS 和 MapReduce 从 Nutch 项目中剥离出来,单独创建一个名为 Hadoop 的项目,这个 Hadoop 项目包含了 Hadoop Common、HDFS (其实就是 NDFS,就是改了下名字而已) 和 MapReduce 三个模块。这样 Hadoop 就出世了。"}]},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"Hadoop 技术的发展"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"size","attrs":{"size":14}}],"text":"在这个时候,Yahoo 看到了 google 的 GFS 和 MapReduce 的好处,也想使用这个分布式技术来实现他们的数据存储和处理,他们决定聘用 Doug Cutting 来帮助他们使用 Hadoop 来代替他们之前的数据存储和处理方案。事实证明,这是一个明智的选择,在 2007 年二月份的时候,Yahoo 的 Hadoop 集群的节点就已经达到了 1000 个了。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"size","attrs":{"size":14}}],"text":"2007 年和 2008 年是 Hadoop 发展最快的两年,在这两年内,很多大公司,比如 Twitter、Facebook、LinkedIn 等都在使用 Hadoop,并且他们还贡献了很多的以 Hadoop 为核心的大数据处理分析工具,比如分布式协调工具 Zookeeper、Pig、Hive 、HBase 等。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"size","attrs":{"size":14}}],"text":"2009 年的亚马逊开始提供 MapReduce 计算服务。同年,Cutting 离开 Yahoo,以首席架构师的身份去了 Cloudera (也就是 Hadoop 商用的组织,CDH 就是他们研发的)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"size","attrs":{"size":14}}],"text":"2012 年,Yahoo 的 Hadoop 集群的节点数达到了 42000 个,同年,Hadoop 的 contributors 也达到了最多的 1200 个。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"Hadoop 2.x"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"size","attrs":{"size":14}}],"text":"以上我们聊的是 1.x 版本的 Hadoop,在 Hadoop 1.x 中的 MapReduce 有一个很大的问题,那就是分布式数据处理和资源管理是在一起的,这样当数据量特别大,或者当数据处理任务特别多的时候会存在性能瓶颈,为了解决这个问题,在 2012 年,Hadoop 团队将分布式数据处理的逻辑和资源管理的逻辑拆分开,形成了一个独立的分布式资源管理技术,也就是 Yarn,这代表着 Hadoop 进入 Hadoop 2.x 时代。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"Hadoop 3.x"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"size","attrs":{"size":14}}],"text":"为了提高 Hadoop 的性能,在 2017 年 12 月份,Hadoop 发布了 Hadoop 3.0.0 版本,预示着 Hadoop 进入到 Hadoop 3.x 时代。相对于 Hadoop 2.x,Hadoop 3.x 做了很多性能和功能上的优化。至今 Hadoop 3.x 版本仍然在迭代更新。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"size","attrs":{"size":14}}],"text":"这篇文章就分享到这里,感谢你的阅读和收听,如果你觉着这篇文章对你有所帮助的话,你可以分享给你更多的朋友哟~~"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"系统学习大数据技术入口:"},{"type":"link","attrs":{"href":"http://douma.ke.qq.com","title":""},"content":[{"type":"text","text":"http://douma.ke.qq.com"}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章