kafka文檔(7)----0.10.1-QuickStart-快速開始

This tutorial assumes you are starting fresh and have no existing Kafka™ or ZooKeeper data. Since Kafka console scripts are different for Unix-based and Windows platforms, on Windows platforms use bin\windows\ instead of bin/, and change the script extension to .bat.

本指導假定你剛開始學習kafka,對kafka以及zookeeper還沒有相關知識。由於kafka腳本在unix系統和windows系統上是不同的,所以windows平臺上一般使用bin\windows而不是bin/,同時腳本名後綴一般是.bat


Step 1: Download the code

Download the 0.10.1.0 release and un-tar it.
> tar -xzf kafka_2.11-0.10.1.0.tgz
> cd kafka_2.11-0.10.1.0

Step 2: Start the server

Kafka uses ZooKeeper so you need to first start a ZooKeeper server if you don't already have one. You can use the convenience script packaged with kafka to get a quick-and-dirty single-node ZooKeeper instance.


kafka依賴於zookeeper,因此在啓動kafka之前,需要首先啓動zookeeper server。可以使用官方版啓動腳本,啓動單例模式的zookeeper實例,然後啓動kafka。


> bin/zookeeper-server-start.sh config/zookeeper.properties
[2013-04-22 15:01:37,495] INFO Reading configuration from: config/zookeeper.properties (org.apache.zookeeper.server.quorum.QuorumPeerConfig)
...

Now start the Kafka server:

> bin/kafka-server-start.sh config/server.properties
[2013-04-22 15:01:47,028] INFO Verifying properties (kafka.utils.VerifiableProperties)
[2013-04-22 15:01:47,051] INFO Property socket.send.buffer.bytes is overridden to 1048576 (kafka.utils.VerifiableProperties)
...

Step 3: Create a topic

Let's create a topic named "test" with a single partition and only one replica:

按照以下方式創建名爲test的topic,此topic只包含一個partition以及一個備份。

> bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic test

We can now see that topic if we run the list topic command:

可以使用以下命令列出當前集羣中所有的topic

> bin/kafka-topics.sh --list --zookeeper localhost:2181
test
Alternatively, instead of manually creating topics you can also configure your brokers to auto-create topics when a non-existent topic is published to.


如果不想每次都手動創建topic,可以配置broker在首次讀寫topic時自動創建此topic。


Step 4: Send some messages

Kafka comes with a command line client that will take input from a file or from standard input and send it out as messages to the Kafka cluster. By default, each line will be sent as a separate message.

Run the producer and then type a few messages into the console to send to the server.


kafka可以通過命令行從一個文件或者標準輸入讀取數據,然後發送到kafka集羣。默認情況下,每行是一條消息。

按照以下命令行運行,在終端輸入一些消息,然後就會發送到server

> bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test
This is a message
This is another message

Step 5: Start a consumer

Kafka also has a command line consumer that will dump out messages to standard output.

kafka也可以通過命令行方式的consumer從集羣獲取一些消息並輸出到標準輸出

> bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test --from-beginning
This is a message
This is another message

If you have each of the above commands running in a different terminal then you should now be able to type messages into the producer terminal and see them appear in the consumer terminal.

All of the command line tools have additional options; running the command with no arguments will display usage information documenting them in more detail.


如果你在不同的中斷運行以上命令,你就可以在producer終端輸入消息,然後在consumer終端查看消息

所有命令行都有其他選項。運行不帶參數的命令就可以打印更詳細的使用信息。


Step 6: Setting up a multi-broker cluster

So far we have been running against a single broker, but that's no fun. For Kafka, a single broker is just a cluster of size one, so nothing much changes other than starting a few more broker instances. But just to get feel for it, let's expand our cluster to three nodes (still all on our local machine).

First we make a config file for each of the brokers (on Windows use the copy command instead):

目前爲止,我們已經測試了單例broker,但是這還不夠。對於kafka來說,單例模式的broker只是一個節點的集羣,多節點的集羣也不過是多啓動幾個節點。下面可以體驗一下多節點集羣,將集羣擴展爲3個節點,依然在當前機器上。

首先需要創建每個broker的配置文件(windows上使用copy命令)

> cp config/server.properties config/server-1.properties
> cp config/server.properties config/server-2.properties

Now edit these new files and set the following properties:

下面編輯新拷貝的配置文件,並設置以下配置:

config/server-1.properties:
    broker.id=1
    listeners=PLAINTEXT://:9093
    log.dir=/tmp/kafka-logs-1

config/server-2.properties:
    broker.id=2
    listeners=PLAINTEXT://:9094
    log.dir=/tmp/kafka-logs-2

The broker.id property is the unique and permanent name of each node in the cluster. We have to override the port and log directory only because we are running these all on the same machine and we want to keep the brokers from all trying to register on the same port or overwrite each other's data.

broker.id是唯一的,而且是每個節點在集羣中的永久性名字。由於是在同一臺機器上運行多個broker,所以需要改變port以及日誌目錄,可以避免爭奪同一個port或者覆蓋各自的數據。


We already have Zookeeper and our single node started, so we just need to start the two new nodes:

剛纔已經啓動了zookeeper以及一個kafka節點,現在需要啓動兩個新節點

> bin/kafka-server-start.sh config/server-1.properties &
...
> bin/kafka-server-start.sh config/server-2.properties &
...

Now create a new topic with a replication factor of three:

現在來創建一個備份數目爲3的topic

> bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 3 --partitions 1 --topic my-replicated-topic

Okay but now that we have a cluster how can we know which broker is doing what? To see that run the "describe topics" command:

好了,現在想知道kafka集羣都做了什麼,可以通過“describe topics ” 命令查看:

> bin/kafka-topics.sh --describe --zookeeper localhost:2181 --topic my-replicated-topic
Topic:my-replicated-topic	PartitionCount:1	ReplicationFactor:3	Configs:
	Topic: my-replicated-topic	Partition: 0	Leader: 1	Replicas: 1,2,0	Isr: 1,2,0

Here is an explanation of output. The first line gives a summary of all the partitions, each additional line gives information about one partition. Since we have only one partition for this topic there is only one line.

  • "leader" is the node responsible for all reads and writes for the given partition. Each node will be the leader for a randomly selected portion of the partitions.
  • "replicas" is the list of nodes that replicate the log for this partition regardless of whether they are the leader or even if they are currently alive.
  • "isr" is the set of "in-sync" replicas. This is the subset of the replicas list that is currently alive and caught-up to the leader.

此處解釋一下輸出信息,第一行給出所有partitions的彙總信息,下面的每一行給出一個partition的信息。由於剛纔只創建了一個partition,所以只有一行信息。

   -“leader”負責每個partition的讀寫操作。每個節點都是某個隨機partition的leader。

   - “replicas”列舉了當前partition的備份節點,包括leader以及死掉的備份節點

   -“isr”是活躍的備份節點。它是replicas的子集,是replicas中依然活躍並且可以和leader進行通信的節點


Note that in my example node 1 is the leader for the only partition of the topic.

We can run the same command on the original topic we created to see where it is:

注意,上面例子中,節點1是topic僅有partition的leader

可以運行相同的命令查看test這個topic的信息:

> bin/kafka-topics.sh --describe --zookeeper localhost:2181 --topic test
Topic:test	PartitionCount:1	ReplicationFactor:1	Configs:
	Topic: test	Partition: 0	Leader: 0	Replicas: 0	Isr: 0

So there is no surprise there—the original topic has no replicas and is on server 0, the only server in our cluster when we created it.

Let's publish a few messages to our new topic:

沒什麼意外的,test沒有備份節點,他只有一個leader節點0,即最初單例模式時的broker的id。

下面輸入一些消息給新topic

> bin/kafka-console-producer.sh --broker-list localhost:9092 --topic my-replicated-topic
...
my test message 1
my test message 2
^C

Now let's consume these messages:

> bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --from-beginning --topic my-replicated-topic
...
my test message 1
my test message 2
^C

Now let's test out fault-tolerance. Broker 1 was acting as the leader so let's kill it:

接下來測試一下容錯。broker 1是leader,因此殺掉這個進程看一下:

> ps aux | grep server-1.properties
7564 ttys002    0:15.91 /System/Library/Frameworks/JavaVM.framework/Versions/1.8/Home/bin/java...
> kill -9 7564
On Windows use:
> wmic process get processid,caption,commandline | find "java.exe" | find "server-1.properties"
java.exe    java  -Xmx1G -Xms1G -server -XX:+UseG1GC ... build\libs\kafka_2.10-0.10.1.0.jar"  kafka.Kafka config\server-1.properties    644
> taskkill /pid 644 /f

Leadership has switched to one of the slaves and node 1 is no longer in the in-sync replica set:

leader已經變成原來slaves節點中的一個,節點1也不在活躍的備份列表中了

> bin/kafka-topics.sh --describe --zookeeper localhost:2181 --topic my-replicated-topic
Topic:my-replicated-topic	PartitionCount:1	ReplicationFactor:3	Configs:
	Topic: my-replicated-topic	Partition: 0	Leader: 2	Replicas: 1,2,0	Isr: 2,0

But the messages are still available for consumption even though the leader that took the writes originally is down:

即使leader剛纔被殺掉了,但是消息依然是可用的:

> bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --from-beginning --topic my-replicated-topic
...
my test message 1
my test message 2
^C

Step 7: Use Kafka Connect to import/export data

Writing data from the console and writing it back to the console is a convenient place to start, but you'll probably want to use data from other sources or export data from Kafka to other systems. For many systems, instead of writing custom integration code you can use Kafka Connect to import or export data.

Kafka Connect is a tool included with Kafka that imports and exports data to Kafka. It is an extensible tool that runs connectors, which implement the custom logic for interacting with an external system. In this quickstart we'll see how to run Kafka Connect with simple connectors that import data from a file to a Kafka topic and export data from a Kafka topic to a file.

First, we'll start by creating some seed data to test with:

> echo -e "foo\nbar" > test.txt


使用Kafka連接器導入或導出數據


從終端導入數據,然後再將數據導出到終端,這是開始學習時比較方便的方式,但是你可能想導入其他來源的數據,或者將數據導出到其他系統。對於很多系統而言,不需要你寫一個客戶端代碼,kafka提供相應的連接器進行導入導出數據

kafka連接器是一個可以將數據導入或導出kafka的工具。這是可擴展的工具,可以實現與其他系統的交互。在本文中,可以看到如何使用kafka連接器進行簡單的導入導出數據--從kafka到文件

首先,創建一些種子數據:


Next, we'll start two connectors running in standalone mode, which means they run in a single, local, dedicated process. We provide three configuration files as parameters. The first is always the configuration for the Kafka Connect process, containing common configuration such as the Kafka brokers to connect to and the serialization format for data. The remaining configuration files each specify a connector to create. These files include a unique connector name, the connector class to instantiate, and any other configuration required by the connector.


接下來,啓動兩個單例模式的連接器,即單例的、本地的、專門的進程。提供三個配置好的文件路徑作爲參數。第一個是kafka鏈接器的配置文件,包含一些通常的配置,例如kafka brokers以及數據序列化格式等。其他的配置問津啊每個都指定了需要創建的連接器。這些文件包含一個獨一無二的連接器名字,連接器類別,以及連接器需要的其他配置。


> bin/connect-standalone.sh config/connect-standalone.properties config/connect-file-source.properties config/connect-file-sink.properties

These sample configuration files, included with Kafka, use the default local cluster configuration you started earlier and create two connectors: the first is a source connector that reads lines from an input file and produces each to a Kafka topic and the second is a sink connector that reads messages from a Kafka topic and produces each as a line in an output file.


這些簡單的配置文件都包含在kafka中,配置文件使用默認本地集羣配置,創建兩個連接器:第一個是源連接器,從輸入文件中按行讀取消息,然後發送消息到kafka topic中;第二個是目的連接器,從kafka topic讀取消息,然後按行輸出到文件中。


During startup you'll see a number of log messages, including some indicating that the connectors are being instantiated. Once the Kafka Connect process has started, the source connector should start reading lines from test.txt and producing them to the topic connect-test, and the sink connector should start reading messages from the topic connect-testand write them to the file test.sink.txt. We can verify the data has been delivered through the entire pipeline by examining the contents of the output file:


在啓動過程中,你會看到大量的日誌消息,包括一些指示,指明瞭連接器正在實例化;一旦kafka連接器進程已經啓動,源連接器應當從test.txt中讀取消息,然後將它們發往topic connect-test,同時目的連接器應當開始從connect-test中讀取消息,然後將它們寫入test.sink.txt中。我們可以確認一下通過數據管道傳遞過來的數據是否和發送的數據一致。

> cat test.sink.txt
foo
bar

Note that the data is being stored in the Kafka topic connect-test, so we can also run a console consumer to see the data in the topic (or use custom consumer code to process it):

注意:數據已經存儲到kafka topic connect-test中,因此,可以運行一個終端consumer來看一下topic中的數據是什麼樣的(或者使用客戶端consumer):

> bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic connect-test --from-beginning
{"schema":{"type":"string","optional":false},"payload":"foo"}
{"schema":{"type":"string","optional":false},"payload":"bar"}
...

The connectors continue to process data, so we can add data to the file and see it move through the pipeline:

連接器不斷的處理數據,因此可以向輸入文件中增加數據,然後可以看到數據通過管道傳遞到目的連接器

> echo "Another line" >> test.txt

You should see the line appear in the console consumer output and in the sink file.

你可以看到這行數據會出現終端consumer的輸出以及輸出文件中


Step 8: Use Kafka Streams to process data

Kafka Streams is a client library of Kafka for real-time stream processing and analyzing data stored in Kafka brokers. This quickstart example will demonstrate how to run a streaming application coded in this library. Here is the gist of the WordCountDemo example code (converted to use Java 8 lambda expressions for easy reading).


kafka Streams是kafka進行實時流式處理以及分析存儲在kafka brokers中數據的客戶端庫。本文例子將說明如何使用這個庫。下面是WordCountDemo 例程代碼(使用java 8的lambda表達式比較易於理解)

KTable wordCounts = textLines
    // Split each text line, by whitespace, into words.
    .flatMapValues(value -> Arrays.asList(value.toLowerCase().split("\\W+")))

    // Ensure the words are available as record keys for the next aggregate operation.
    .map((key, value) -> new KeyValue<>(value, value))

    // Count the occurrences of each word (record key) and store the results into a table named "Counts".
    .countByKey("Counts")

It implements the WordCount algorithm, which computes a word occurrence histogram from the input text. However, unlike other WordCount examples you might have seen before that operate on bounded data, the WordCount demo application behaves slightly differently because it is designed to operate on an infinite, unbounded stream of data. Similar to the bounded variant, it is a stateful algorithm that tracks and updates the counts of words. However, since it must assume potentially unbounded input data, it will periodically output its current state and results while continuing to process more data because it cannot know when it has processed "all" the input data.

We will now prepare input data to a Kafka topic, which will subsequently be processed by a Kafka Streams application.


它實現了WordCount算法,計算了輸入文本中的單詞出現統計結果,這不像你以前看到那些WordCount例子-計算有限的數據,本例程表現稍有不同,因爲本例程設計是處理無限的數據流。和有限統計類似,這是有狀態的算法-持續並不斷更新詞的總數。然而,由於它已經假定輸入數據是持續不斷的,它將週期性的輸出當前計算結果,因爲它並不知道數據流的“尾”在哪。

現在將輸入數據寫入kafka topic中,這些數據都將順序的由Kafka Streams應用所處理。

> echo -e "all streams lead to kafka\nhello kafka streams\njoin kafka summit" > file-input.txt
Or on Windows:
> echo all streams lead to kafka> file-input.txt
> echo hello kafka streams>> file-input.txt
> echo|set /p=join kafka summit>> file-input.txt

Next, we send this input data to the input topic named streams-file-input using the console producer (in practice, stream data will likely be flowing continuously into Kafka where the application will be up and running):

接下來,需要將輸入數據發送到名爲streams-file-input的topic,可以藉助終端producer實現(實際中,數據將不斷的流入kafka)

> bin/kafka-topics.sh --create \
            --zookeeper localhost:2181 \
            --replication-factor 1 \
            --partitions 1 \
            --topic streams-file-input
> bin/kafka-console-producer.sh --broker-list localhost:9092 --topic streams-file-input < file-input.txt

We can now run the WordCount demo application to process the input data:

現在可以運行WordCount 例程應用來處理輸入數據

> bin/kafka-run-class.sh org.apache.kafka.streams.examples.wordcount.WordCountDemo

There won't be any STDOUT output except log entries as the results are continuously written back into another topic named streams-wordcount-output in Kafka. The demo will run for a few seconds and then, unlike typical stream processing applications, terminate automatically.

We can now inspect the output of the WordCount demo application by reading from its output topic:


這不會有任何標準輸出,而是會把結果不斷的寫回名爲streams-wordcount-output的topic中。例程將會運行幾秒鐘,他不會像典型的流式處理應用自動終止。

我們可以通過從輸出topic讀取數據來查看WordCount例程應用的輸出結果。


> bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 \
            --topic streams-wordcount-output \
            --from-beginning \
            --formatter kafka.tools.DefaultMessageFormatter \
            --property print.key=true \
            --property print.value=true \
            --property key.deserializer=org.apache.kafka.common.serialization.StringDeserializer \
            --property value.deserializer=org.apache.kafka.common.serialization.LongDeserializer

with the following output data being printed to the console:

終端輸出結果如下:

all     1
lead    1
to      1
hello   1
streams 2
join    1
kafka   3
summit  1

Here, the first column is the Kafka message key, and the second column is the message value, both in in java.lang.String format. Note that the output is actually a continuous stream of updates, where each data record (i.e. each line in the original output above) is an updated count of a single word, aka record key such as "kafka". For multiple records with the same key, each later record is an update of the previous one.


結果中的第一列是kafka消息的key,第二列是消息的value,二者都是java.lang.String格式。注意,輸出實際是一個不斷更新的數據流,流中的每條記錄(例如上面輸出的每行數據)都是一個單詞更新後的數量,例如“kafka”這個單詞。對於有相同key的多條記錄,後面的記錄是對前面記錄的更新。

Now you can write more input messages to the streams-file-input topic and observe additional messages added to streams-wordcount-output topic, reflecting updated word counts (e.g., using the console producer and the console consumer, as described above).

You can stop the console consumer via Ctrl-C.

現在你可以輸入更多的數據到streams-file-inut這個topic中,可以看到更多的詞被添加到streams-wordcount-output的topic,更多的統計結果也會輸出。

如果想要停止,可以Ctrl-C


發佈了130 篇原創文章 · 獲贊 40 · 訪問量 82萬+
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章