data_extreme

#爲什麼要做去極值的工作（Why）

在做迴歸分析的時候，因爲過大或過小的數據可能會影響到分析結果，離羣值會嚴重影響因子和收益率之間的相關性估計結果，因此需要對那些離羣值進行處理

## 有哪些去極值的方法（What）

根據不同的距離判斷標準，去極值有以下三種方法：
* MAD法
* 3𝜎法
* 百分位法

## 去極值怎麼做（How）

一般去極值的處理方法就是先確定該項指標的上下限，然後找出超出限值的數據，並將它們的值統統變爲限值。如下圖：

1.去極值 — MAD

去極值-MAD法的距離判斷標準是因子與中位數之間的距離，因此MAD法又稱爲絕對值差中位數法

步驟：
1. 需要找出所有因子的中位數，記作𝐹_𝑚𝑒𝑑𝑖𝑎𝑛

2. 求每個因子與中位數的絕對偏差值，再求因子絕對偏差值的中位數

3. 根據因子的中位數與絕對偏差值的中位數確定閾值範圍，對超出閾值範圍的因子值做調整

4. 令超出閾值範圍的因子值等於閾值，處在閾值範圍內的因子的值保持不變

#代碼實現
from atrader import *
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# 獲取因子數據
factor = get_factor_by_factor(factor='PB', target_list=list(get_code_list('hs300').code), begin_date='2019-01-01', end_date='2019-03-01')
factors = factor.set_index('date').T

# MAD:中位數去極值
def extreme_MAD(dt,n):
    median = dt.quantile(0.5)   # 找出中位數
    new_median = (abs((dt - median)).quantile(0.5))   # 偏差值的中位數
    dt_up = median + n*new_median    # 上限
    dt_down = median - n*new_median  # 下限
    return dt.clip(dt_down, dt_up, axis=1)    # 超出上下限的值，賦值爲上下限

ex_MAD = extreme_MAD(factors,5.2)

#去極值前
factors.tail()

#去極值後
ex_MAD.tail()

# 核密度分佈畫圖
fig = plt.figure()
factors.iloc[:,-1].plot(kind = 'kde',label='PB' )
extreme_MAD(factors,5.2).iloc[:,-1].plot(kind = 'kde',label = 'MAD')
plt.legend()
plt.show()

### conclude：使用MAD去極值後，因子數據的取值範圍明顯縮小了

2.去極值 — 3𝜎法

去極值-3𝜎法使用標準差來設置閾值範圍

步驟：
1. 先計算出因子的平均值𝐹_𝑚𝑒𝑎𝑛與標準差𝜎

2. 確定閾值參數 𝑛，n通常默認爲3，最後對超出範圍 [𝐹_𝑚𝑒𝑎𝑛−𝑛𝜎, 𝐹_𝑚𝑒𝑎𝑛+𝑛𝜎] 的因子值做調整

3. 令超出閾值範圍的因子值等於閾值，處在閾值範圍內的因子的值保持不變

# 3sigma 去極值
def extreme_3sigma(dt,n=3):
    mean = dt.mean()           # 截面數據均值
    std = dt.std()             # 截面數據標準差
    dt_up = mean + n*std       # 上限
    dt_down = mean - n*std     # 下限
    return dt.clip(dt_down, dt_up, axis=1)   # 超出上下限的值，賦值爲上下限

ex_3 = extreme_3sigma(factors)

# 核密度分佈畫圖
fig = plt.figure()
factors.iloc[:,-1].plot(kind = 'kde',label='PB' )
extreme_3sigma(factors).iloc[:,-1].plot(kind = 'kde',label = '3sigma')
plt.legend()
plt.show()

3.去極值 — 百分位法

百分位法利用所有因子值的某個百分位作爲因子的合理範圍，然後超出範圍的因子值按照上下限值處理，一般情況下，合理範圍設爲2.5%-97.5%之間

步驟：

1. 找到因子值的97.5%和2.5%分位數

2. 對大於97.5%分位數的因子值，或小於2.5%分位數的因子值進行調整

def extreme_percentile(dt,min=0.025,max=0.975):
    p = dt.quantile([min,max])                    # 得到上下限的值
    return dt.clip(p.iloc[0],p.iloc[1], axis=1)  # 超出上下限的值，賦值爲上下限

ex_p = extreme_percentile(factors)

# 核密度分佈畫圖
fig = plt.figure()
factors.iloc[:,-1].plot(kind = 'kde',label='PB' )
extreme_percentile(factors).iloc[:,-1].plot(kind = 'kde',label = 'percen')
plt.legend()
plt.show()

4. 三種方法比較

# 核密度分佈畫圖
fig = plt.figure()
factors.iloc[:,-1].plot(kind = 'kde',label='PB' )
extreme_MAD(factors,5.2).iloc[:,-1].plot(kind = 'kde',label = 'MAD')
extreme_3sigma(factors).iloc[:,-1].plot(kind = 'kde',label = '3sigma')
extreme_percentile(factors).iloc[:,-1].plot(kind = 'kde',label = 'percen')
plt.legend()
plt.show()

### MAD法處理後的因子值範圍最小，3sigma法隨後，百分位法最大。這個結果與參數的設置有關
### 在實際運用中，投資者可以從三種方法中任意選擇一種對因子數據進行去極值處理。

股票多因子選股模型 —— 數據去極值

data_extreme

#爲什麼要做去極值的工作（Why）

## 有哪些去極值的方法（What）

## 去極值怎麼做（How）

1.去極值 — MAD

2.去極值 — 3𝜎法

3.去極值 — 百分位法

4. 三種方法比較

localhost、127.0.0.1、本機IP之間的區別

過濾掉python中的FutureWarning

基金定投：100%抄到底的方法

pandas to_excel,把數據存到不同的sheet

多因子模型 —— 因子正交化處理

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結