數據競賽 Task1

原創

2019-04-06 15:03

數據：

train_set.csv：此數據集用於訓練模型，每一行對應一篇文章。文章分別在“字”和“詞”的級別上做了脫敏處理。共有四列：
第一列是文章的索引(id)，第二列是文章正文在“字”級別上的表示，即字符相隔正文(article)；第三列是在“詞”級別上的表示，即詞語相隔正文(word_seg)；第四列是這篇文章的標註(class)。
注：每一個數字對應一個“字”，或“詞”，或“標點符號”。“字”的編號與“詞”的編號是獨立的！

test_set.csv：此數據用於測試。數據格式同train_set.csv，但不包含class。
注：test_set與train_test中文章id的編號是獨立的。

代碼：

import pandas as pd

train = pd.read_csv('./data/train_set.csv', index_col=None)
test = pd.read_csv('./data/test_set.csv', index_col=None)

print('train :', train.shape)
print('test :', test.shape)
print(train.columns)
print(test.columns)

train['char_len'] = train['article'].map(lambda x: len(x.split(' ')))
train['word_len'] = train['word_seg'].map(lambda x: len(x.split(' ')))
test['char_len'] = test['article'].map(lambda x: len(x.split(' ')))
test['word_len'] = test['word_seg'].map(lambda x: len(x.split(' ')))

print(train['char_len'].describe())
print(train['word_len'].describe())
print(test['char_len'].describe())
print(test['word_len'].describe())

輸出：

train : (102277, 4)
test : (102277, 3)
Index(['id', 'article', 'word_seg', 'class'], dtype='object')
Index(['id', 'article', 'word_seg'], dtype='object')
count    102277.000000
mean       1177.100159
std        1348.431565
min          50.000000
25%         497.000000
50%         842.000000
75%        1408.000000
max       55804.000000
Name: char_len, dtype: float64
count    102277.000000
mean        716.954604
std         801.804540
min           6.000000
25%         305.000000
50%         514.000000
75%         862.000000
max       39759.000000
Name: word_len, dtype: float64
count    102277.000000
mean       1177.731865
std        1320.447219
min          50.000000
25%         499.000000
50%         842.000000
75%        1412.000000
max       31694.000000
Name: char_len, dtype: float64
count    102277.000000
mean        718.052152
std         792.131628
min           6.000000
25%         306.000000
50%         516.000000
75%         863.000000
max       19755.000000
Name: word_len, dtype: float64

分析：

訓練集和測試集各自有102277條

訓練集中的每條樣本article平均長度、最大長度、最小長度分別爲：1177、55804、50

訓練集中的每條樣本word_seg平均長度、最大長度、最小長度分別爲：717、39759、6

測試集中的每條樣本article平均長度、最大長度、最小長度分別爲：1177、31694、50

測試集中的每條樣本word_seg平均長度、最大長度、最小長度分別爲：718、19755、6

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

數據競賽 Task1

linux安裝cuda和cudnn

測試人員都是畫畫大神，讓我看看誰還不會用代碼圖？

Object.values()對象遍歷

Mellanox網卡開啓SR-IOV

我拍了拍Redis，被移出了羣聊···

網絡現代化通向雲原生應用的高速公路

面試官：說說你對序列化的理解

我宣佈，這是我找到的史上AI最全論文體系！

NLP實踐-Task1

pytorch-task2

pytorch-task4

pytorch-task3

數據競賽 Task2

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結