谷歌集羣數據分析clusterdata-2011-2

本篇主要是解析數據集clusterdata-2011-2

by ——https://github.com/google/cluster-data

dataset的說明文檔：https://drive.google.com/file/d/0B5g07T_gRDg9Z0lsSTEtTWtpOW8/view

數據集描述：The clusterdata-2011-2 trace represents 29 day's worth of cell information from May 2011, on a cluster of about 12.5k machines.

將csv文件導入到MySQL中的各表信息如下：（表結構在末尾）

job event表：

row1672923 286.86 MB (300,795,688)

index：jobid，btree 35.56 MB (37,285,888)

machine events表：

row：37780 2.99 MB (3,138,540)

machine attribute：

row：10748566 1.09 GB (1,175,642,124)

task constrains：

row：28485619 2.95 GB (3,163,127,240)

task usage：

row：1232799308 182.55 GB (196,015,089,972)

index：69.61 GB (74,743,799,808)

machineid（btree） jobid（btree）

task event：（導入數據有點問題，正在處理）

row：144648292 12.76 GB (13,700,652,148)

index： 6.90 GB (7,414,187,008)

machineid，jobid，username

explain part1:字段

explain part2:表格

part1.字段

一個job包含多個task，每一個task表示一個Linux項目，可能有多個進程。

timestamp：以微秒爲單位，在日誌開始前600s開始計時（如20s開始的時間爲620s）

0時刻的記錄代表在日誌記錄之前發生的事件，因爲作業可能在日誌記錄之前被提交。

2的63次方-1的時間爲日誌記錄結束之後的事件。

job和machine的ID不會被複用，可以當作唯一表識。（machineID重複可能是由於一個機器被移除集羣后又重新加了進來，jobID重複可能是一個job被停止然後配置重新啓動）

user和job的name被hash了，爲了保密以及測試時相同。

machine event type：0.add 1.remove 2.update

job和task的event type：0.submit 1.schedule 2.evict 3.fail 4.kill 5.finish 6.lost 7.update_pending 8.update_running

priority：0爲最低的

infrastructure (11)—this is the highest (most entitled to get resources) priority in the trace and accounts for most of the recorded disk I/O, so we speculate it includes some storage services;
monitoring (10)
normal production (9)—this is the lowest (and most occupied) of the priorities labeled ‘production’. The trace providers indicate that jobs at this priority and higher which are latency-sensitive should not be “evicted due to over-allocation of machine resources” .
other (2-8) — we speculate that these priorities are dominated by batch jobs;
gratis (free) (0-1) — the trace providers indicate that resources used by tasks at these priorities are generally not charged.

missing info：正常數據爲NULL，丟失數據爲0-2.

0.SNAPSHOT_BUT_NO_TRANSITION：we did not find a record representing the given event, but a later snapshot of the job or task state indicated that the transition must have occurred. The timestamp of the synthesized event is the timestamp of the snapshot.

1.NO_SNAPSHOT_OR_TRANSITION : we did not find a record representing the given termination event, but the job or task disappeared from later snapshots of cluster states, so it must have been terminated. The timestamp of the synthesized
event is a pessimistic upper bound on its actual termination time assuming it could have legitimately been missing from one snapshot.
2.EXISTS_BUT_NO_CREATION : we did not find a record representing the creation of the given task or job. In this case, we may be missing metadata (job name, resource requests, etc.) about the job or task and we may have placed SCHEDULE or SUBMIT events latter than they actually are.

scheduleclass，該類粗略地表示作業的延遲敏感程度。調度類型由一個數字表示，3表示一個對延遲比較敏感的作業，0表示一個非生產任務（例如:非關鍵業務分析等）

comparison operator：？？

怎麼比的不明白。。。

小於(2)，大於(3)：將機器屬性表示爲整數(或0，如果屬性不存在)，然後將其與提供的屬性值進行比較。這些比較嚴格小於和嚴格大於;等於(0)，不等於(1)：機器屬性表示爲字符串(或空字符串如果它不存在的話),然後比較所提供的屬性值。（翻譯文檔）

part2:

table：

1.Machine events
Each machine is described by one or more records in the machine event table. The majority of records describe machines that existed at the start of the trace.
1. timestamp
2. machine ID
3. event type
4. platform ID
5. capacity: CPU
6. capacity: memory

2.job event&task event

The two event tables describe jobs/tasks and their lifecycles. The constraints table describes task placement constraints that restrict the machines onto which tasks can schedule.

The simplest case is shown by the top path in the diagram above: a job is SUBMITted and gets put into a pending queue; soon afterwards, it is SCHEDULEd onto a machine and starts running; some time later it FINISHes successfully.

先提交（0），然後進隊（1），之後完成（4）

3.task usage

這篇博客詳細解釋了https://blog.csdn.net/yangss123/article/details/78298749

生成的中間表有

分別是各平臺內包含的機器id，以及所有中等優先級的task（priority爲2-8），以及所有成功進入隊列的task（event type爲1）的表，並建立相應的索引。（使用中間表後，檢索時間由數小時級別下降到1min以內）

谷歌集羣數據分析clusterdata-2011-2

工作中用到的腳本合集

24-5-18 X

面試題——常識題，Java，Mysql（逐步更新）

面試題&筆試歸納——個人筆記（常見問題，逐步更新）

ubuntu常用指令（持續補充）

一、LeetCode——求和問題

java學習筆記（五）——並行程序

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結