Hive大数据表性能调优

原創

2021-03-22 18:35

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"本文要点"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"大数据应用程序开发人员在从Hadoop文件系统或Hive表读取数据时遇到了挑战。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"合并作业（一种用于将小文件合并为大文件的技术）有助于提高读取Hadoop数据的性能。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通过合并，文件的数量显著减少，读取数据的查询时间更短。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"当通过map-reduce作业读取Hive表数据时，Hive调优参数也可以帮助提高性能。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/cwiki.apache.org\/confluence\/display\/Hive\/Tutorial","title":"","type":null},"content":[{"type":"text","text":"Hive"}]},{"type":"text","text":"表是一种依赖于结构化数据的大数据表。数据默认存储在Hive数据仓库中。为了将它存储在特定的位置，开发人员可以在创建表时使用location标记设置位置。Hive遵循同样的SQL概念，如行、列和模式。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在读取Hadoop文件系统数据或Hive表数据时，大数据应用程序开发人员遇到了一个普遍的问题。数据是通过"},{"type":"link","attrs":{"href":"https:\/\/spark.apache.org\/","title":"","type":null},"content":[{"type":"text","text":"spark streaming"}]},{"type":"text","text":"、"},{"type":"link","attrs":{"href":"https:\/\/nifi.apache.org\/","title":"","type":null},"content":[{"type":"text","text":"Nifi streaming"}]},{"type":"text","text":"作业、其他任何流或摄入程序写入Hadoop集群的。摄入作业将大量的小数据文件写入Hadoop集群。这些文件也称为part文件。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"这些part文件是跨不同数据节点写入的，如果当目录中的文件数量增加时，其他应用程序或用户试图读取这些数据，就会遇到性能瓶颈，速度缓慢。其中一个原因是数据分布在各个节点上。考虑一下驻留在多个分布式节点中的数据。数据越分散，读取数据的时间就越长，读取数据大约需要“N *（文件数量）”的时间，其中N是跨每个名字节点的节点数量。例如，如果有100万个文件，当我们运行MapReduce作业时，mapper就必须对跨数据节点的100万个文件运行，这将导致整个集群的利用率升高，进而导致性能问题。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"对于初学者来说，Hadoop集群有多个"},{"type":"link","attrs":{"href":"https:\/\/hadoop.apache.org\/docs\/current\/hadoop-project-dist\/hadoop-hdfs\/HdfsDesign.html","title":"","type":null},"content":[{"type":"text","text":"名字节点"}]},{"type":"text","text":"，每个名字节点将有多个数据节点。摄入\/流作业跨多个数据节点写入数据，在读取这些数据时存在性能挑战。对于读取数据的作业，开发人员花费相当长的时间才能找出与查询响应时间相关的问题。这个问题主要发生在每天数据量以数十亿计的用户中。对于较小的数据集，这种性能技术可能不是必需的，但是为长期运行做一些额外的调优总是好的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在本文中，我将讨论如何解决这些问题和性能调优技术，以提高Hive表的数据访问速度。与Cassandra和Spark等其他大数据技术类似，Hive是一个非常强大的解决方案，但需要数据开发人员和运营团队进行调优，才能在对Hive数据执行查询时获得最佳性能。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"让我们先看一些Hive数据使用的用例。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"用例"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Hive数据主要应用于以下应用程序："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"大数据分析，就交易行为、活动、成交量等运行分析报告；"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"跟踪欺诈活动并生成有关该活动的报告；"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"基于数据创建仪表板；"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"用于审计和存储历史数据；"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"为机器学习提供数据及围绕数据构建智能"}]}]}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"优化技术"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"有几种方法可以将数据摄入Hive表。摄入可以通过Apache Spark流作业、Nifi或任何流技术或应用程序完成。摄入的数据是原始数据，在摄入过程开始之前考虑所有调优因素非常重要。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"组织Hadoop数据"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第一步是组织Hadoop数据。我们从摄入\/流作业开始。首先，需要对数据进行分区。数据分区最基本的方法是按天或小时划分。甚至可以同时拥有按天和按小时的分区。在某些情况下，在按天划分的分区里，你还可以按照国家、地区或其他适合你的数据和用例的维度进行划分。例如，图书馆里的一个书架，书是按类型排列的，每种类型都有儿童区或成人区。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/a8\/a8aebb755e3dc116bf3dece1defb111d.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"图1：组织好的数据"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"以此为例，我们像下面这样向Hadoop目录中写入数据："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"hdfs:\/\/cluster-uri\/app-path\/category=children\/genre=fairytale OR\nhdfs:\/\/cluster-uri\/app-path\/category=adult\/genre=thrillers\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"这样，你的数据就更有条理了。大多数时候，在没有特殊需求的情况下，数据按天或小时进行分区："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"hdfs :\/\/cluster-uri\/app-path\/day=20191212\/hr=12\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"或者只根据需要按天分区："}]},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"hdfs:\/\/cluster-uri\/app-path\/day=20191212\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/0f\/0f5aa3110185fd74d0095bed6544c040.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"图2：分区文件夹摄入流"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"Hadoop数据格式"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在创建Hive表时，最好提供像zlib这样的表压缩属性和orc这样的格式。在摄入的过程中，这些数据将以这些格式写入。如果你的应用程序是写入普通的Hadoop文件系统，那么建议提供这种格式。大多数摄入框架（如Spark或Nifi）都有指定格式的方法。指定数据格式有助于以压缩格式组织数据，从而节省集群空间。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"合并作业"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"合并作业在提高Hadoop数据总体读取性能方面发挥着至关重要的作用。有多个部分与合并技术有关。默认情况下，写入HDFS目录的文件都是比较小的part文件，当part文件太多时，读取数据就会出现性能问题。合并并不是Hive特有的特性——它是一种用于将小文件合并为大文件的技术。合并技术也不涉及任何在线的地方，因此，这项特定的技术非常重要，特别是批处理应用程序读取数据时。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"什么是合并作业？"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"默认情况下，摄入\/流作业写入到Hive，目录写入比较小的part文件，对于高容量应用程序，一天的文件数将超过10万个。当我们试图读取数据时，真正的问题来了，最终返回结果需要花费很多时间，有时是几个小时，或者作业可能会失败。例如，假设你有一个按天分区的目录，你需要处理大约100万个小文件。例如，如果运行count，输出如下："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"#Before:\nhdfs dfs -count -v \/cluster-uri\/app-path\/day=20191212\/*\nOutput = 1Million\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"现在，在运行合并作业之后，文件的数量将显著减少。它将所有比较小的part文件合并成大文件。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"#After:\nhdfs dfs -count -v \/cluster-uri\/app-path\/day=20191212\/*\nOutput = 1000\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"注意：cluster-uri因组织而异，它是一个Hadoop集群uri，用于连接到特定的集群。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"合并作业有什么好处？"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"文件合并不仅是为了性能，也是为了集群的健康。根据Hadoop平台的指南，节点中不应该有这么多文件。过多的文件会导致读取过多的节点，进而导致高延迟。记住，当读取Hive数据时，它会扫描所有的数据节点。如果你的文件太多，读取时间会相应地增加。因此，有必要将所有小文件合并成大文件。此外，如果数据在某天之后不再需要，就有必要运行清除程序。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"合并作业的工作机制"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"有几种方法可以合并文件。这主要取决于数据写入的位置。下面我将讨论两种不同的常见的用例。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"使用Spark或Nifi向日分区目录下的Hive表写入数据"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"使用Spark或Nifi向Hadoop文件系统（HDFS）写入数据"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在这种情况下，大文件会被写入到日文件夹下。开发人员需要遵循下面的某个选项。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/62\/620b13413ddc6486f605b7102abce860.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"图3：合并逻辑"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":1,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"编写一个脚本来执行合并。该脚本接受像天这样的参数，在同一分区数据中执行Hive select查询数据，并在同一分区中insert overwrite。此时，当Hive在同一个分区上重写数据时，会执行map-reduce作业，减少文件数量。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"有时，如果命令失败，在同一命令中重写相同的数据可能会导致意外的数据丢失。在这种情况下，从日分区中选择数据并将其写入临时分区。如果成功，则使用load命令将临时分区数据移动到实际的分区。步骤如图3所示。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在这两个选项中，选项B更好，它适合所有的用例，而且效率最高。选项B很有效，因为任何步骤失败都不会丢失数据。开发人员可以编写一个control M，并安排它在第二天午夜前后没有活跃用户读取数据时运行。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"有一种情况，开发者不需要编写Hive查询。相反，提交一个spark作业，select相同的分区，并overwrite数据，但建议只有在分区文件夹中文件数量不是很大，并且spark仍然可以读取数据而又不需要指定过多的资源时才这样做。这个选项适合低容量的用例，这个额外的步骤可以提高读取数据的性能。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"整个流程是如何工作的？"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"让我们通过一个示例场景来回顾上述所有的部分。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"假设你拥有一个电子商务应用程序，你可以根据不同的购买类别跟踪每天的客户量。你的应用容量很大，你需要基于用户购买习惯和历史进行智能数据分析。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"从表示层到中间层，你希望用"},{"type":"link","attrs":{"href":"https:\/\/kafka.apache.org\/","title":"","type":null},"content":[{"type":"text","text":"Kafka"}]},{"type":"text","text":"或IBM"},{"type":"link","attrs":{"href":"https:\/\/www.ibm.com\/products\/mq","title":"","type":null},"content":[{"type":"text","text":"MQ"}]},{"type":"text","text":"发布这些消息。下一步是有一个流应用程序，消费Kafka\/MQ的数据，并摄取到Hadoop Hive表。这可以通过Nifi或Spark实现。在此之前，需要设计和创建Hive表。在创建Hive表的过程中，你需要决定分区列什么样，以及是否需要排序或者使用什么压缩算法，比如"},{"type":"link","attrs":{"href":"https:\/\/community.cloudera.com\/t5\/Support-Questions\/Snappy-vs-Zlib-Pros-and-Cons-for-each-compression-in-Hive\/td-p\/97110","title":"","type":null},"content":[{"type":"text","text":"Snappy"}]},{"type":"text","text":"或者"},{"type":"link","attrs":{"href":"https:\/\/community.cloudera.com\/t5\/Support-Questions\/Snappy-vs-Zlib-Pros-and-Cons-for-each-compression-in-Hive\/td-p\/97110","title":"","type":null},"content":[{"type":"text","text":"Zlib"}]},{"type":"text","text":"。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Hive表的设计是决定整体性能的一个关键方面。你在设计时必须考虑如何查询数据。如果你想查询每天有多少顾客购买了特定类别的商品，如玩具、家具等，建议最多两个分区，如一个天分区和一个类别分区。然后，流应用程序摄取相应的数据。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"提前掌握所有可用性方面的信息可以让你更好地设计适合自己需要的表。因此，对于上面的例子，一旦数据被摄取到这个表中，就应该按天和类别进行分区。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"只有摄入的数据才会形成Hive location里的小文件，所以如上所述，合并这些文件变得至关重要。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下一步，你可以设置调度程序或使用control M（它将调用合并脚本）每天晚上运行合并作业，例如在凌晨1点左右。这些脚本将为你合并数据。最后，在这些Hive location中，你应该可以看到文件的数量减少了。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"当真正的智能数据分析针对前一天的数据运行时，查询将变得很容易，而且性能会更好。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"Hive参数设置"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"当你通过map-reduce作业读取Hive表的数据时，有一些方便的调优参数。要了解更多关于这些调优参数的信息，请查阅"},{"type":"link","attrs":{"href":"https:\/\/docs.cloudera.com\/documentation\/enterprise\/5-9-x\/topics\/admin_hive_tuning.html","title":"","type":null},"content":[{"type":"text","text":"Hive调优参数"}]},{"type":"text","text":"。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"Set hive.exec.parallel = true;\nset hive.vectorized.execution.enabled = true;\nset hive.vectorized.execution.reduce.enabled = true;\nset hive.cbo.enable=true;\nset hive.compute.query.using.stats=true;\nset hive.stats.fetch.column.stats=true;\nset hive.stats.fetch.partition.stats=true;\nset mapred.compress.map.output = true;\nset mapred.output.compress= true;\nSet hive.execution.engine=tez;\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"要进一步了解其中的每个属性，可以参考这个"},{"type":"link","attrs":{"href":"https:\/\/www.hdfstutorial.com\/blog\/hive-performance-tuning\/","title":"","type":null},"content":[{"type":"text","text":"教程"}]},{"type":"text","text":"。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"技术实现"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"现在，让我们用一个示例场景，一步一步地进行展示。在这里，我正在考虑将客户事件数据摄取到Hive表。我的下游系统或团队将使用这些数据来运行进一步的分析（例如，在一天中，客户购买了什么商品，从哪个城市购买的？）这些数据将用于分析产品用户的人口统计特征，使我能够排除故障或扩展业务用例。这些数据可以让我们进一步了解活跃客户来自哪里，以及我如何做更多的事情来增加我的业务。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"步骤1：创建一个示例Hive表，代码如下："}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/82\/82cf37e3385908830ce82ee6e4f70b45.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"步骤2：设置流作业，将数据摄取到Hive表中"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"这个流作业可以从Kafka的实时数据触发流，然后转换并摄取到Hive表中。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/70\/7070665ca2f6b1ec0a0a653446efce1f.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"图4：Hive数据流"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"这样，当摄取到实时数据时，就会写入天分区。不妨假设今天是20200101。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"hdfs dfs -ls \/data\/customevents\/day=20200101\/\n\/data\/customevents\/day=20200101\/part00000-djwhu28391\n\/data\/customevents\/day=20200101\/part00001-gjwhu28e92\n\/data\/customevents\/day=20200101\/part00002-hjwhu28342\n\/data\/customevents\/day=20200101\/part00003-dewhu28392\n\/data\/customevents\/day=20200101\/part00004-dfdhu24342\n\/data\/customevents\/day=20200101\/part00005-djwhu28fdf\n\/data\/customevents\/day=20200101\/part00006-djwffd8392\n\/data\/customevents\/day=20200101\/part00007-ddfdggg292\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"当一天结束时，这个数值可能是10K到1M之间的任意一个值，这取决于应用程序的流量。对于大型公司来说，流量会很高。我们假设文件的总数是141K。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"步骤3：运行合并作业"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在20201月2号，也就是第二天，凌晨1点左右，我们运行合并作业。示例代码上传到git中。文件名为consolidated .sh。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下面是在edge node\/box中运行的命令："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":".\/consolidate.sh 20200101\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"现在，这个脚本将合并前一天的数据。合并完成后，你可以重新运行count："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"hdfs dfs -count -v \/data\/customevents\/day=20200101\/*\ncount = 800\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"之前是141K，合并后是800。因此，这将为你带来显著的性能提升。合并逻辑代码见"},{"type":"link","attrs":{"href":"https:\/\/github.com\/skoloth\/Hive-Consolidation","title":"","type":null},"content":[{"type":"text","text":"这里"}]},{"type":"text","text":"。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"统计数据"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在不使用任何调优技术的情况下，从Hive表读取数据的查询时间根据数据量不同需要耗费5分钟到几个小时不等。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/d2\/d2a1541d91029d214a4c50432181ea44.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"图5：统计数据"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"合并之后，查询时间显著减少，我们可以更快地得到结果。文件数量显著减少，读取数据的查询时间也会减少。如果不合并，查询会在跨名字节点的许多小文件上运行，会导致响应时间增加。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"参考资料"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Koloth, K. S. (2020年10月15日)."},{"type":"link","attrs":{"href":"https:\/\/londondailypost.com\/sudhish-koloth-the-importance-of-big-data-on-artificial-intelligence\/","title":"","type":null},"content":[{"type":"text","text":"大数据对人工智能的重要性"}]},{"type":"text","text":"."}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Apache. (n.d.). Hive Apache."},{"type":"link","attrs":{"href":"https:\/\/hive.apache.org\/","title":"","type":null},"content":[{"type":"text","text":"Hive Apache"}]},{"type":"text","text":"."}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Gauthier, G. L. (2019年7月25日)."},{"type":"link","attrs":{"href":"https:\/\/www.adaltas.com\/en\/2019\/07\/25\/hive-3-features-tips-tricks\/","title":"","type":null},"content":[{"type":"text","text":"运行Apache Hive 3，新特性、要诀和技巧"}]},{"type":"text","text":"."}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"作者简介"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/timebusinessnews.com\/sudhish-koloth-played-a-key-role-2020\/","title":"","type":null},"content":[{"type":"text","text":"Sudhish Koloth"}]},{"type":"text","text":"是一家银行和金融服务公司的首席开发。他已从事信息技术领域的工作13年。他使用过各种技术，包括全栈、大数据、自动化和Android开发。在2019冠状病毒病大流行期间，他还在交付有重要影响的"},{"type":"link","attrs":{"href":"https:\/\/play.google.com\/store\/apps\/details?id=com.feedom.uandus","title":"","type":null},"content":[{"type":"text","text":"项目"}]},{"type":"text","text":"方面发挥了重要的作用。Sudhish用他的专业知识解决人类面临的共同问题，他是一名志愿者，为非营利组织的应用程序提供帮助。他也是一位导师，利用他的技术专长帮助其他专业人士和同事。关于STEM教育对于学龄儿童和年轻大学毕业生的重要性，Sudhish先生也是一位积极的传道者和激励者。他在职业关系网内外的工作都得到了认可。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"原文链接："}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/www.infoq.com\/articles\/hive-performance-tuning-techniques\/","title":"","type":null},"content":[{"type":"text","text":"Performance Tuning Techniques of Hive Big Data Table"}]}]}]}

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

相關文章

Python爬虫技术与数据可视化：Numpy、pandas、Matplotlib的黄金组合

前言在當今信息爆炸的時代，數據已成爲企業決策和發展的關鍵。而互聯網作爲信息的主要來源，網頁中蘊含着大量的數據等待被挖掘。Python爬蟲技術和數據可視化工具的結合，爲我們提供了一個強大的工具箱，可以幫助我們從網絡中抓取數據，並將其可視

2024-04-29 23:26:28

大模型将进一步推动AI数据发展，行业数据类型更加丰富

爲支撐加快推進新型工業化，發展新質生產力，探索數據要素與智能算力網絡協同發展路徑，促進數字技術與實體經濟深度融合，中國信息通信研究院作爲新型基礎設施建設者，科技創新的領軍者，在2024星火生態大會期間，舉辦了"數據要素及智能算力網絡創新專題

2024-04-29 00:55:15

1 名工程师轻松管理 20 个工作流，创业企业用 Serverless 让数据处理流程提效

作者：嶽洋、陳德全、劉靜娜北京語勢科技有限公司成立於 2023 年 6 月，語勢科技定位爲“智能投資時代的主題入口”，在資管行業從以機構爲核心轉向以用戶爲核心的變革時代，通過打造主題投資引擎，賦能普惠投資一體化，打造以投資者和資管機構爲主

2024-04-28 21:12:22

大数据小白的测试成长之路

引言 22年校招入職京東後，我一直在數據中臺測試部從事測試開發的工作。畢業後，寫的最多的文檔是測試計劃和測試報告，鮮有機會就自己的成長碼字進行回顧和總結。借“up技術人”欄目，也終於是在工作之餘回頭望，對自己這近兩年時光進行一個小總結

2024-04-28 11:17:19

赋能开发者，腾讯云与你共探AI提升十倍生产力之路

引言 AI 技術發展迅速，對於開發者而言，AI 既可能是提高生產力的神兵利器，也可能成爲職業生涯潛在的“威脅”。開發者如何與 AI 協同進化，提升個人能力和價值；如何利用提高 AI 生產力，推動企業創新，實現降本提效

2024-04-28 11:11:17

京东广告研发——效率为王：广告统一检索平台实践

1、系統概述實踐證明，將互聯網流量變現的在線廣告是互聯網最成功的商業模式，而電商場景是在線廣告的核心場景。京東服務中國數億的用戶和大量的商家，商品池海量。平臺在兼顧用戶體驗、平臺、廣告主收益的前提推送商品具有挑戰性。京東廣告檢索平臺

2024-04-25 23:17:47

RocketMQ 之 IoT 消息解析：物联网需要什么样的消息技术?

前言：從初代開源消息隊列崛起，到 PC 互聯網、移動互聯網爆發式發展，再到如今 IoT、雲計算、雲原生引領了新的技術趨勢，消息中間件的發展已經走過了 30 多個年頭。目前，消息中間件在國內許多行業的關鍵應用中扮演着至關重要的角色。隨着數

2024-04-24 23:40:04

“企业创新新引擎”数据库专项赋能会，让云原生技术普惠千行百业！

本文分享自華爲雲社區《“企業創新新引擎”數據庫專項賦能會，讓雲原生技術普惠千行百業！》，作者： GaussDB 數據庫。 4月19日，由福州軟件園科技創新發展公司和華爲技術有限公司聯合主辦的HCDG城市行福州站——“企業創新新引擎”數據庫專

2024-04-24 10:32:53

如何增强Java API 的导入和导出性能

前言 GrapeCity Documents for Excel (以下簡稱GcExcel) 是葡萄城公司的一款服務端表格組件，它提供了一組全面的 API 以編程方式生成 Excel (XLSX) 電子表格文檔的功能，支持爲多個平臺創建、操

2024-04-23 10:23:02

SLS 查询新范式：使用 SPL 对日志进行交互式探索

作者：無哲引言在構建現代數據和業務系統的過程中，可觀測性已經變得至關重要，日誌服務（SLS）爲 Log/Trace/Metric 數據提供了大規模、低成本、高性能的一站式平臺服務，並提供數據採集、加工、投遞、分析、告警、可視化等功能，從

2024-04-22 21:12:05

WhaleScheduler为银行业全信创环境打造统一调度管理平台解决方案

項目背景數字金融是數字經濟的重要支撐和驅動力。近年來，我國針對數字金融的發展政策頻頻出臺，《金融科技發展規劃（2022-2025年）》、《“十四五”數字經濟發展規劃》、《關於銀行業保險業數字化轉型的指導意見》、《金融標準化“十四五”

2024-04-19 21:18:25

千帆杯AI原生应用创意挑战赛-效率工具常规赛重磅上线！

賽題內容本期比賽爲開放賽題，參賽者需要圍繞“效率工具”主題，結合自身的專業背景和創意想法，設計並開發出具有創新性和實用性的AI原生應用。要求使用工具：AppBuilder。參賽者可用0代碼創建應用調試指令，也可自定義組件與workf

2024-04-19 11:29:42

文档图像大模型

隨着信息技術的快速發展，文檔處理已經成爲日常生活和工作中不可或缺的一部分。傳統的文檔處理方法往往需要人工參與，效率低下且易出錯。近年來，隨着深度學習技術的突破，文檔圖像大模型在智能文檔處理領域嶄露頭角，爲提升文檔處理性能提供了新的解決方案。

2024-04-18 11:29:52

GaussDB(DWS)基于Flink的实时数仓构建

本文分享自華爲雲社區《GaussDB(DWS)基於Flink的實時數倉構建》，作者：胡辣湯。大數據時代，廠商對實時數據分析的訴求越來越強烈，數據分析時效從T+1時效趨向於T+0時效，爲了給客戶提供極速分析查詢能力，華爲雲數倉GaussDB

2024-04-18 10:32:57

五一假期畅游指南：Python技术构建的热门景点分析系统解读

導言五一假期即將到來，作爲一名熱愛旅遊的技術達人，我總是希望能夠通過技術手段更好地規劃我的旅行路線。在這篇文章中，我將向大家介紹一款基於Python技術的熱門景點分析系統，幫助您在五一假期中游玩得更加盡興！ 1. 系統概述熱門景點

2024-04-16 23:25:46

24小時熱門文章

最新文章

Hive大數據表性能調優

最新評論文章