Apache Druid 从Kafka加载数据 -- 全流程分析

原創

osc_8rbrmk98

2021-01-30 10:04

一、Kafka 创建topic、生产者

二、向kafka生产数据

三、Apache Druid 配置DataSource 数据源

7) Configure Schema【重点配置】

一、Kafka 创建topic、生产者

1. 创建topic

kafka-topics.sh --create --zookeeper node-01:2181,node-02:2181,node-03:2181 --replication-factor 1 --partitions 1 --topic fast_sales

2. 创建生产者

kafka-console-producer.sh --broker-list node-01:9092,node-02:9092,node-03:9092 --topic fast_sales

3. 创建消费者

kafka-console-consumer.sh --bootstrap-server node-01:9092,node-02:9092,node-03:9092 --topic fast_sales --group topic_test1_g1

二、向kafka生产数据

{"timestamp":"2020-08-08T01:03.00z","category":"手机","areaName":"北京","monye":"1450"}
{"timestamp":"2020-08-08T01:03.00z","category":"手机","areaName":"北京","monye":"1450"}
{"timestamp":"2020-08-08T01:03.00z","category":"家电","areaName":"北京","monye":"1550"}

{"timestamp":"2020-08-08T01:03.00z","category":"手机","areaName":"深圳","monye":"1000"}
{"timestamp":"2020-08-08T01:03.01z","category":"手机","areaName":"深圳","monye":"2000"}
{"timestamp":"2020-08-08T01:04.01z","category":"手机","areaName":"深圳","monye":"2200"}

三、Apache Druid 配置DataSource 数据源

1) Start

2) Connect

3) Pase Data

4) Pase Time

5) Transform【可跳过】

6) Filter 【可跳过】

7) Configure Schema【重点配置】

8) Partition

9) Tune

10) Pulish

Max parse exceptions: 2147483647

11) Edit Json spec

{
  "type": "kafka",
  "dataSchema": {
    "dataSource": "fast_sales",
    "parser": {
      "type": "string",
      "parseSpec": {
        "format": "json",
        "timestampSpec": {
          "column": "timestamp",
          "format": "iso"
        },
        "dimensionsSpec": {
          "dimensions": [
            "areaName",
            "category"
          ]
        }
      }
    },
    "metricsSpec": [
      {
        "type": "count",
        "name": "count"
      },
      {
        "type": "longSum",
        "name": "sum_monye",
        "fieldName": "monye",
        "expression": null
      }
    ],
    "granularitySpec": {
      "type": "uniform",
      "segmentGranularity": "DAY",
      "queryGranularity": "MINUTE",
      "rollup": true,
      "intervals": null
    },
    "transformSpec": {
      "filter": null,
      "transforms": []
    }
  },
  "tuningConfig": {
    "type": "kafka",
    "maxRowsInMemory": 1000000,
    "maxBytesInMemory": 0,
    "maxRowsPerSegment": 5000000,
    "maxTotalRows": null,
    "intermediatePersistPeriod": "PT10M",
    "basePersistDirectory": "/usr/local/imply-3.0.4/var/tmp/1609509057384-0",
    "maxPendingPersists": 0,
    "indexSpec": {
      "bitmap": {
        "type": "concise"
      },
      "dimensionCompression": "lz4",
      "metricCompression": "lz4",
      "longEncoding": "longs"
    },
    "buildV9Directly": true,
    "reportParseExceptions": false,
    "handoffConditionTimeout": 0,
    "resetOffsetAutomatically": true,
    "segmentWriteOutMediumFactory": null,
    "workerThreads": null,
    "chatThreads": null,
    "chatRetries": 8,
    "httpTimeout": "PT10S",
    "shutdownTimeout": "PT80S",
    "offsetFetchPeriod": "PT30S",
    "intermediateHandoffPeriod": "P2147483647D",
    "logParseExceptions": true,
    "maxParseExceptions": 2147483647,
    "maxSavedParseExceptions": 0,
    "skipSequenceNumberAvailabilityCheck": false
  },
  "ioConfig": {
    "topic": "fast_sales",
    "replicas": 1,
    "taskCount": 1,
    "taskDuration": "PT3600S",
    "consumerProperties": {
      "bootstrap.servers": "node-01:9092,node-02:9092,node-03:9092"
    },
    "pollTimeout": 100,
    "startDelay": "PT5S",
    "period": "PT30S",
    "useEarliestOffset": false,
    "completionTimeout": "PT1800S",
    "lateMessageRejectionPeriod": null,
    "earlyMessageRejectionPeriod": null,
    "stream": "fast_sales",
    "useEarliestSequenceNumber": false,
    "type": "kafka"
  },
  "context": null,
  "suspended": false
}

四、查询示例说明

1）数据源

2）回忆向Kafka输入数据有，如下：

{"timestamp":"2020-08-08T01:03.00z","category":"手机","areaName":"北京","monye":"1450"}
{"timestamp":"2020-08-08T01:03.00z","category":"手机","areaName":"北京","monye":"1450"}
{"timestamp":"2020-08-08T01:03.00z","category":"家电","areaName":"北京","monye":"1550"}

{"timestamp":"2020-08-08T01:03.00z","category":"手机","areaName":"深圳","monye":"1000"}
{"timestamp":"2020-08-08T01:03.01z","category":"手机","areaName":"深圳","monye":"2000"}
{"timestamp":"2020-08-08T01:04.01z","category":"手机","areaName":"深圳","monye":"2200"}

-- 查询所有数据

-- 按时间范围查询数据

-- 查询输入数据总记录数

-- 按地域、商品类别分类，统计销售总金额

-- 按地域分组，计算消费总额

-- 按商品品类分组，计算消费总额

-- 先搂时间范围过滤，再按地域、商品品类分组，计算消费总额

文章最后，给大家推荐一些受欢迎的技术博客链接：

欢迎扫描下方的二维码或搜索公众号“大数据高级架构师”，我们会有更多、且及时的资料推送给您，欢迎多多交流！

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Apache Druid 从Kafka加载数据 -- 全流程分析

一、Kafka 创建topic、生产者

二、向kafka生产数据

三、Apache Druid 配置DataSource 数据源

1) Start

2) Connect

3) Pase Data

4) Pase Time

5) Transform【可跳过】

6) Filter 【可跳过】

7) Configure Schema【重点配置】

8) Partition

9) Tune

10) Pulish

11) Edit Json spec

四、查询示例说明

Linux該怎麼學？LINUX就該這麼學(內含學習教程)【建議新手收藏】

Apache Druid 從Kafka加載數據 -- 全流程分析

Elasticsearch：使用 Nginx 來保護 Elastic Stack

golang冒泡排序算法

Linux下javaweb項目jar包零基礎部署（華爲雲ubuntu+jar+mysql+本地上傳）

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結