Python_data pre-processing

原創

2018-08-22 13:11

All of the tech are basic solving solution. In order to obtain robust predictive model, missing value imputation should be consider case by case.

如果是個小白，比如我，就可以這樣非常generally clean data.

0. Check for shape, missing value and variable types in dataset

data.dtypes
data.isnull.sum()
data.info()

1. impute numeric missing value with pandas

data['column name'].fillna(0.0, inplace=True)
data['column name'].fillna('U1', inplace=True)

2. encode categorical columns with number (otherwise may cause problem in machine learning algorithms)

cleanup_nums = {
               "stage":{"IV":0,"IIB":1,"III":2,"IIA":3,"I":4},
               "side":{"both":2,"right":1,"left":0}}

data.replace(cleanup_nums, inplace=True)

3. Check for correlation between variables

corr = data.iloc[:,1:31].corr()

mask = np.zeros_like(corr) 
mask[np.triu_indices_from(mask)] = True

sns.set(style="white") 
sns.heatmap(corr, mask = mask)

4. Check whether binary output are skewed

data['label name'].value_counts().plot(kind='bar')

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Python_data pre-processing

Python- How to format datetime and replace value by multiple datetime conditions

兩種方法解決leetcode 153. Find Minimum in Rotated Sorted Array

三種方法解決Leetcode169. Majority Element in Python

三種方法解決Lintcode39 Recover Rotated Sorted Array in Python

兩種方法解決leetcode 53. Maximum Subarray

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結