Hive和Pandas巧妙實現wordcount

本文Pandas版本必須爲0.25以上才能使用explode,可以通過以下命令看Pandas版本:

pip show pandas

可以通過以下命令升級Pandas到最新版:

pip install pandas  --upgrade

也可以通過以下命令安裝指定版本的Pandas:

pip install pandas==1.0.3


效果展示

在hive中:

hive> select s from wc_t;
s
apple apple cdh
dest cdh firend dest
english firend apple dest
girl gift hit dest girl
Time taken: 0.191 seconds, Fetched 4 row(s)


select wc.word, count(1) count
from
  (select explode(split(s, ' ')) as word from wc_t) wc
group by wc.word
order by count desc;


word    count
--------------
dest    4
apple   3
girl    2
cdh     2
firend  2
hit     1
english 1
gift    1
Time taken: 2.119 seconds, Fetched 8 row(s)

在python中:

import pandas as pd


In[1]:
df = pd.read_csv("word.txt", header=None)
df
out[1]: 
   0
0  apple apple cdh
1  dest cdh firend dest
2  english firend apple dest
3  girl gift hit dest girl


In[2]: 
df = pd.read_csv("word.txt", header=None, names=['s'])
df["s"] = df["s"].str.split(" ")
se = df.explode("s").rename(columns={"s": "word"}).groupby("word").apply(len)
se.sort_values(ascending=False, inplace=True)
se.reset_index(name="count")
out[2]:
   word    count
0  dest     4
1  apple    3
2  girl     2
3  firend   2
4  cdh      2
5  hit      1
6  gift     1
7  english  1


hive實現的詳解

word.txt文件的內容:

apple apple cdh
dest cdh firend dest
english firend apple dest
girl gift hit dest girl

hive表數據準備:

create table wc_t(s string);
load data local inpath 'word.txt' into table wc_t;

首先,使用 split 函數將數據切割成一個一個的單詞:

hive> select split(s, ' ') from wc_t;
["apple","apple","cdh"]
["dest","cdh","firend","dest"]
["english","firend","apple","dest"]
["girl","gift","hit","dest","girl"]
Time taken: 0.36 seconds, Fetched 4 row(s)

然後,使用 explode 函數將集合中的元素拆分成多行元素:

hvie> select explode(split(s, ' ')) word from wc_t;
word
apple
apple
cdh
dest
cdh
firend
dest
english
firend
apple
dest
girl
gift
hit
dest
girl
Time taken: 0.207 seconds, Fetched 16 row(s)

最後,使用聚合函數統計多行數據:

select wc.word, count(1) count
from
  (select explode(split(s, ' ')) word from wc_t) wc
group by wc.word
order by count desc;


word    count
--------------
dest    4
apple   3
girl    2
cdh     2
firend  2
hit     1
english 1
gift    1
Time taken: 2.119 seconds, Fetched 8 row(s)


python實現的詳解

讀取數據:

import pandas as pd


df = pd.read_csv("word.txt", header=None, names=['s'])
df
   s
0  apple apple cdh
1  dest cdh firend dest
2  english firend apple dest
3  girl gift hit dest girl


將單詞切割成數組:

df["s"] = df["s"].str.split(" ")
df
   s
0  [apple, apple, cdh]
1  [dest, cdh, firend, dest]
2  [english, firend, apple, dest]
3  [girl, gift, hit, dest, girl]

將數組中的元素拆分成多行元素:

df = df.explode("s")
df
   s
0  apple
0  apple
0  cdh
1  dest
1  cdh
1  firend
1  dest
2  english
2  firend
2  apple
2  dest
3  girl
3  gift
3  hit
3  dest
3  girl

修改列名:

df = df.rename(columns={"s": "word"})
df.head()
   word
0  apple
0  apple
0  cdh
1  dest
1  cdh

分組聚合,計算每個單詞出現的次數(返回一個Series)

se = df.groupby("word").apply(len)
se
word
apple      3
cdh        2
dest       4
english    1
firend     2
gift       1
girl       2
hit        1
dtype: int64

由於無多餘的數值列進行數值統計,故只能通過apply傳遞函數進行計算。

對次數進行排序:

se.sort_values(ascending=False, inplace=True)
se
word
dest       4
apple      3
girl       2
firend     2
cdh        2
hit        1
gift       1
english    1
dtype: int64

最後將結果還原爲DataFrame:

se.reset_index(name="count")
   word    count
0  dest     4
1  apple    3
2  girl     2
3  firend   2
4  cdh      2
5  hit      1
6  gift     1
7  english  1

一步到位:

df = pd.read_csv("word.txt", header=None, names=['s'])
df["s"] = df["s"].str.split(" ")
se = df.explode("s").rename(columns={"s": "word"}).groupby("word").apply(len)
se.sort_values(ascending=False, inplace=True)
se.reset_index(name="count")


   word    count
0  dest     4
1  apple    3
2  girl     2
3  firend   2
4  cdh      2
5  hit      1
6  gift     1
7  english  1


小例子

有一個gross.csv文件,內容如下:

Action|Adventure|Fantasy|Sci-Fi,760505847.0
Action|Adventure|Fantasy,309404152.0
Action|Adventure|Thriller,200074175.0
Action|Thriller,448130642.0
Documentary,
Action|Adventure|Sci-Fi,73058679.0
Action|Adventure|Romance,336530303.0
Adventure|Animation|Comedy|Family|Fantasy|Musical|Romance,200807262.0
Action|Adventure|Sci-Fi,458991599.0
Adventure|Family|Fantasy|Mystery,301956980.0
Action|Adventure|Sci-Fi,330249062.0
Action|Adventure|Sci-Fi,200069408.0
Action|Adventure,168368427.0
Action|Adventure|Fantasy,423032628.0
Action|Adventure|Western,89289910.0
Action|Adventure|Fantasy|Sci-Fi,291021565.0
Action|Adventure|Family|Fantasy,141614023.0

每行數據表示某部電影所屬的電影類型,和該部電影的票房總數。現在要求用python或hive 按照電影類型分類,統計出不同類型的票房總數。

hive實現

加載數據:

CREATE TABLE movie_gross (
  genres string,
  gross bigint
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';


load data local inpath 'gross.csv' into table movie_gross;

sql查詢語句:

select a.genre,sum(a.gross) gross
from
  (select genre,gross from movie_gross
  lateral view explode(split(genres,'\\|')) tmp as genre) a
group by a.genre
order by gross desc;

結果:

genre        gross
-------------------------
Adventure    4284974020
Action       4230340420
Fantasy      2428342457
Sci-Fi       2113896160
Thriller     648204817
Family       644378265
Romance      537337565
Mystery      301956980
Musical      200807262
Animation    200807262
Comedy       200807262
Western      89289910
Documentary  NULL
Time taken: 2.592 seconds, Fetched 13 row(s)

python實現

import pandas as pd


df = pd.read_csv("gross.csv", header=None, names=["genres", "gross"])
df["genres"] = df["genres"].str.split("|")
df.explode("genres").groupby('genres').sum().sort_values("gross", ascending=False)

結果:

             gross
genres    
Adventure    4.284974e+09
Action       4.230340e+09
Fantasy      2.428342e+09
Sci-Fi       2.113896e+09
Thriller     6.482048e+08
Family       6.443783e+08
Romance      5.373376e+08
Mystery      3.019570e+08
Animation    2.008073e+08
Comedy       2.008073e+08
Musical      2.008073e+08
Western      8.928991e+07
Documentary  0.000000e+00

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章