文章目錄

一、python package

1.numba

numba有兩種編譯模式：nopython模式和object模式。前者能夠生成更快的代碼，但是有一些限制可能迫使numba退爲後者。想要避免退爲後者，而且拋出異常，可以傳遞nopython=True.

import numba
@jit(nopython=True)
def f(x, y):
    return x + y

numba目標是加快面向數組的計算，可使用它們庫中提供的函數來解決。需要說明的是，numba庫也有很多函數是不能用的（這個後續會整理）。
numba更多功能請參考：https://www.jianshu.com/p/4eb221c9bf55

2 pandas

2.1 向量操作

有一組數據，需要實現如下功能："Time"是日期-時分秒的格式，現在要求把"Time"拆爲日期和時分秒兩列，“day"和"hhmmss”。

 import pandas as pd
   column = ['Time', 'val1', 'val2', 'val3', 'val4']
    data = [['20190603-09:41:45', 11, 8, 17.12, 7.7],
            ['20190603-09:41:48', 12, 9.2, 12.23, 3.6],
            ['20190603-09:41:51', 12, 9.3, 15.13, 5.8],
            ['20190603-09:41:54', 13, 3.4, 11.9, 2.4],
            ['20190603-09:41:57', 14, 2.6, 9.3, 3.7],
            ['20190603-09:42:32', 15, 3.0, 6.5, 13.5],
            ['20190603-10:01:02', 11, 2.5, 2.22, 9.4]]
    print(data)
    df = pd.DataFrame(data=data, columns=column)

採用iloc，iterrows、itertuple、apply實現上述功能，並對其進行性能比較。

2.1.1 iloc

顯然，用iloc或者loc逐行遍歷，然後用正則匹配即可達到效果，代碼如下：
def iloc_loop(df):
# 逐行遍歷df，以’-'爲分隔符將字符串split

  day_lis = []
    time_lis = []
    for i in range(len(df)):
        str_split = df.iloc[i]['Time'].split('-')
        day_lis.append(str_split[0])
        time_lis.append(str_split[1])
    df['day'] = day_lis
    df['hhmmss'] = time_lis
    print(df)

2.1.2 iterrows

用iterrows逐行訪問
代碼如下：

def use_iterrows(df):
    day_lis = []
    time_lis = []

# 將iloc定位行改爲iterrows遍歷    for index, row in df.iterrows():

    str_split = row['Time'].split('-')
        day_lis.append(str_split[0])
        time_lis.append(str_split[1])
    df['day'] = day_lis
    df['hhmmss'] = time_lis
    print(df)

2.2.3 itertuples

也可用itertuples實現
代碼如下：

def use_itertuples(df):                       
    day_lis = []                              
    time_lis = []                             
    # 將iloc定位行改爲iterrows遍歷                    
    for row in df.itertuples():               
        # print('index=', row[1])             
        str_split = row[1].split('-')         
        day_lis.append(str_split[0])          
        time_lis.append(str_split[1])         
    df['day'] = day_lis                       
    df['hhmmss'] = time_lis                   
    return df

三種處理方式的性能比較如下

顯然，iterrows 和itertuples效率更高。

2.2.4 apply 函數

利用apply函數也可實現上述功能。

def try_apply(df):
      df['day'] = df['Time'].apply(lambda x: x.split('-')[0])
      df['hhmmss'] = df['Time'].apply(lambda x: x.split('-')[1])
try_apply(df)

執行時間結果如下：

使用apply()函數讓代碼變得更簡潔、易讀，並且耗時大幅減小至0.0009S！這是因爲apply函數對傳入的參數進行了並行化處理，使處理效率大大提升.

2.2.5 isin()

有一組數據，需要將小時按照區間劃分，每個區間乘以不同的參數值作爲新值

def apply_tariff_isin(df):
    # 定義小時範圍Boolean數組
    peak_hours = df.index.hour.isin(range(17, 24))
    shoulder_hours = df.index.hour.isin(range(7, 17))
    off_peak_hours = df.index.hour.isin(range(0, 7))
    # 使用上面的定義
    df.loc[peak_hours, 'cost_cents'] = df.loc[peak_hours, 'energy_kwh'] * 28
    df.loc[shoulder_hours,'cost_cents'] = df.loc[shoulder_hours, 'energy_kwh'] * 20
    df.loc[off_peak_hours,'cost_cents'] = df.loc[off_peak_hours, 'energy_kwh'] * 12

run一下，觀察一下運行時長

>>> apply_tariff_isin(df)
Best of 3 trials with 100 function calls per trial:

Function apply_tariff_isin ran in average of 0.010 seconds.
.isin()方法返回的是一個布爾值數組
結果如下：
[False, False, False, …, True, True, True]
從這一點上看，發現仍然有性能提升，但它本質上變得更加邊緣化。

2.2.6 cut

2.2.5需要實現的邏輯，上述操作也可通過cut完成，代碼如下
def apply_tariff_cut(df):
# pd.cut() 根據每小時所屬的bin應用一組標籤(costs)
cents_per_kwh = pd.cut(x=df.index.hour,
bins=[0, 7, 17, 24],
include_lowest=True,
labels=[12, 20, 28]).astype(int)
df[‘cost_cents’] = cents_per_kwh * df[‘energy_kwh’]
run一下，觀察一下運行時長

>>> apply_tariff_cut(df)
Best of 3 trials with 100 function calls per trial:
Function `apply_tariff_cut` ran in average of 0.003 seconds.

使用numpy的 digitize() 也可實現上述操作
它類似於Pandas的cut()，數據將被分箱，但這次它將由一個索引數組表示，這些索引表示每小時所屬的bin。然後將這些索引應用於價格數組。

def apply_tariff_digitize(df):
    prices = np.array([12, 20, 28])
    bins = np.digitize(df.index.hour.values, bins=[7, 17, 24])
    df['cost_cents'] = prices[bins] * df['energy_kwh'].values

觀察一下運行時長

>>> apply_tariff_digitize(df)
Best of 3 trials with 100 function calls per trial:
Function `apply_tariff_digitize` ran in average of 0.002 seconds

從這一點看，發現仍然有性能提升，但它本質上變得更加邊緣化

2.2 stack

stack具有逆透視的功能，通俗地講，即將列名轉換爲普通的列
有如下一組數據

需要將其轉換爲如下格式

初步思路是，先進行一步正則化操作，然後利用stack,將該列名轉化爲普通的列
代碼如下：

df.set_index('value',inplace=True)
df['id']= df['id'].apply(lambda x:str(x)).apply(lambda x:re.sub('{|}|\[|\]','',x))
df= df['id'].str.split(',',expand=True).stack().reset_index(level=1, drop=True).reset_index(drop=False)
df.rename(columns={0:'id'}, inplace = True)

世歡版

from itertools import chain
tmp_df = pd.DataFrame()
df['id'] = df['id'].apply(lambda x: list(chain(*x)) if isinstance(x, list) else x)
tmp_df['id'] = list(chain(*df['id']))
tmp_df['value'] = list(chain(*[[j]*i for i,j in zip(df['id'].apply(len), df['value'])]))

兩版本時間複雜度對比

2.3 agg

對分組後的部分列進行聚合，並修改列名

import pandas as pd
df = pd.DataFrame({'Country': ['China', 'China', 'India', 'India', 'America', 'Japan', 'China', 'India'],
 'Income': [10000, 10000, 5000, 5002, 40000, 50000, 8000, 5000], 'Age': [5000, 4321, 1234, 4010, 250, 250, 4500, 4321]})
print(df)
df_agg = df.groupby('Country').agg({'Age':['min', 'mean', 'max'],'Income':['min','max']})
col = ['_'.join(col).strip() for col in df_agg.columns.values]
df_agg.columns = col
print(df_agg)

運行結果如下：

2.3.1 調用多個聚合函數#對函數加元祖，添加新的列名

按照某一列進行分組，獲取DataFrame中某一分組的數據的最大值和最小值之差

lis = [[6,3,'a','one'],
       [5,8,'b','one'], 
       [8,7,'a','two'], 
       [1,2,'b','three'],  
       [4,6,'a','two'],   
       [5,4,'b','two'],  
       [1,7,'a','one'],  
       [3,9,'a','three']]
df = pd.DataFrame(data = lis,columns=['vec1','vec2','vec3','vec4'])
print(df)

自定義函數實現

def peak_range(df):
    return df.max()-df.min()
tmp= df.groupby('vec3').agg(peak_range)
print(tmp)

lambda實現
tm= df.groupby(‘vec3’).agg(lambda x:x.max()-x.min())
print™
調用多個聚合函數#對函數加元祖，添加新的列名

tmp =df.groupby('vec3').agg(['mean','std','count',('pt',peak_range)])
print(tmp)

結果如下

2.4 rolling

pandas的rolling函數用來計算時間窗口數據
函數原型爲：

DataFrame.rolling(window, min_periods=None, center=False, win_type=None, on=None, axis=0, closed=None)

DataFrame.rolling(window, min_periods=None, center=False, win_type=None, on=None, axis=0, closed=None)
arr = np.array([[2,2,2], [4,4,4], [6,6,6], [8,8,8], [10,10,10]])
df2 = pd.DataFrame(arr, columns = ['one', 'two', 'three'], 
                  index = pd.date_range('1/1/2018', periods = 5))

print(‘創建的移動窗口對象爲：\n’, df2.rolling(3).mean())
結果如下：

print('創建的移動窗口對象爲：\n', df2.rolling(3,min_periods=2).mean())

結果如下：

其中，min_periods表示窗口最少包含的觀測值，小於這個值的窗口長度顯示爲空，等於和大於時有值，如上面結果所示：min_periods=2，表示窗口最少包含的觀測值爲2，所以2018-01-01沒有值

2.5 to_datetime()

pd.to_datetime()是一個很好的時間轉換工具,但是函數雖好，如果不注意細節會存在耗時問題，如使用format和不使用format耗時就不一樣。如下：
數據如下

index,date_time
1,2019-08-07 23:59:59
2,2019-08-05 23:59:59
3,2019-08-07 23:59:59
4,2019-08-07 23:59:59
5,2019-08-07 23:59:59
6,2019-08-07 23:59:59
7,2019-08-05 23:59:59
8,2019-08-07 23:59:59
9,2019-08-05 23:59:59
10,2019-08-07 23:59:59
11,2019-08-05 23:59:59
12,2019-08-07 23:59:59
13,2019-08-05 23:59:59

”“”
代碼如下

df = pd.read_csv('sx.csv')
print(df)
df1 = df.copy()
start = time.time()
df1['date_time'] = pd.to_datetime(df1['date_time'])
end=time.time()
print('time no use format=',end-start)
df1 = df.copy()
start = time.time()
format='%Y-%m-%d %H:%M:%S'
df1['date_time']= pd.to_datetime(df1['date_time'],format=format)
end=time.time()
print('time use format=',end-start)

結果如下：

也可採用自定義函數，在read_csv的文件頭中調用
代碼如下：

start = time.time()
dateparse = lambda dates: pd.datetime.strptime(dates, '%Y-%m-%d %H:%M:%S')
df = pd.read_csv('sx.csv',date_parser=dateparse,parse_dates=True,index_col='date_time')
end=time.time()
print('parse time=',end-start)

消耗時間爲：
由此可知，在文件頭調用自定義函數所耗時間更長

2.6 cumsum()

cumsum函數：計算軸向元素累積加和的數據
axis可取0，1。axis等於幾，就是在那個軸累計求和
cumsum用法如下：
當axis=0時，代碼如下

import pandas as pd
fs = pd.DataFrame([[2.0, 1.0,3.0,5],
                   [3.0, 4.0,5.0,5],
                   [3.0, 4.0,5.0,5],
                   [1.0, 0.0,6.0,5]],
columns = list('ABCD'))
print(fs.cumsum(axis=0))

結果如下

當axis=1時，代碼如下

print(fs.cumsum(axis=1))

結果如下：

拓展：numpy 的cumsum

arr =np.array([[[1,2,3],[8,9,12]],[[1,2,4],[2,4,5]]])#2*2*3
print(arr.cumsum(0))
print("\n")
print(arr.cumsum(1))
print("\n")
print(arr.cumsum(2))
print("\n")

結果如下：

2.7 pd.crosstab()

交叉表是用於統計分組頻率的特殊透視表
** 參數**

index : array-like, Series, or list of arrays/Series 
Values to group by in the rows
columns : array-like, Series, or list of arrays/Series 
Values to group by in the columns
values : array-like, optional 
Array of values to aggregate according to the factors
aggfunc : function, optional 
If no values array is passed, computes a frequency table
rownames : sequence, default None 
If passed, must match number of row arrays passed
colnames : sequence, default None 
If passed, must match number of column arrays passed
margins : boolean, default False 
Add row/column margins (subtotals)
dropna : boolean, default True 
Do not include columns whose entries are all NaN
df= pd.DataFrame(
    dict(departure =['SFO','SFO','LAX','LAX','JFK','SFO'],
         arrival=['ORD','DFW','DFW','ATL','ATL','ORD'],
         airlines=['Delta','JeTblue','Delta','AA','SouthWest','Delta'])
)
print(df)
df= pd.crosstab(index=[df['departure'],df['airlines']],
                columns=[df['arrival']],
                rownames=['departure','airline'],
                colnames=['arrival'],
                margins=True)
print(df)

2.8 fill填充

2.8.1 指定特殊值填充

• 如用0填充所有的缺失數據

    a = [[1, 2, 2],[3,None,6],[3, 7, None],[5,None,7]]
    data = DataFrame(a)
    print(data)
    '''
       0    1    2
    0  1  2.0  2.0
    1  3  NaN  6.0
    2  3  7.0  NaN
    3  5  NaN  7.0

結果如下

   print(data.fillna(0))
       0    1    2
    0  1  2.0  2.0
    1  3  0.0  6.0
    2  3  7.0  0.0
    3  5  0.0  7.0

• 用均值或者衆數填充缺失數據
如下面一組數據，對其nan值填充爲衆數

ind,Gender,Education,Load_Status
LP001155,Female,Not Graduate,Y
LP001156,Female,Not Graduate,Y
LP001157,Female,Not Graduate,Y
LP001158,Female,Not Graduate,Y
LP001159,,Graduate,Y
LP001160,Female,Graduate,N
LP001161,Female,Graduate,N
LP001162,Female,Graduate,
LP001163,Male,Not Graduate,N
LP001164,Male,Not Graduate,N
LP001165,,,N
LP0011637,Male,Graduate,N
LP001168,Male,Not Graduate,N

代碼如下：

import numpy as np
import pandas as pd
data= pd.read_csv('fill.csv')
print(data)
from scipy.stats import mode
def num_missing(x):
  return sum(x.isnull())
print(data.apply(num_missing, axis=0))
print('mode==',mode(data['Gender']))
print('mode[0]=',mode(data['Gender']).mode[0])
data['Gender'].fillna(mode(data['Gender']).mode[0], inplace=True)
data['Education'].fillna(mode(data['Education']).mode[0], inplace=True)
data['Load_Status'].fillna(mode(data['Load_Status']).mode[0], inplace=True)
print(data)
print(data.apply(num_missing, axis=0))

其他填充方式可參考鏈接：https://blog.csdn.net/pipisorry/article/details/49515215，原理幾乎大同小異

2.8.2 不同列使用不同的值

print(data.fillna({1:1,2:2}))
python
   0    1    2
0  1  2.0  2.0
1  3  1.0  6.0
2  3  7.0  2.0
3  5  1.0  7.0

   data.columns=['a','b','c']
   print(data.fillna({'b':data['b'].mean(),'c':2}))


       a    b    c
    0  1  2.0  2.0
    1  3  4.5  6.0
    2  3  7.0  2.0
    3  5  4.5  7.0

2.8.3 前向填充和後向填充

• 前向填充
使用默認是上一行的值,設置axis=1可以使用列進行填充

 print(data.fillna(method="ffill"))
        '''
       0    1    2
    0  1  2.0  2.0
    1  3  2.0  6.0
    2  3  7.0  6.0
    3  5  7.0  7.0
    '''

• 後向填充
使用下一行的值,不存在的時候就不填充
#後向填充，使用下一行的值,不存在的時候就不填充

   print(data.fillna(method="bfill"))
    '''
       0    1    2
    0  1  2.0  2.0
    1  3  7.0  6.0
    2  3  7.0  7.0
    3  5  NaN  7.0
    '''

2.9 add_prefix 添加前綴

df = df.add_prefix("Col:")
 I want to calculate the pointwise mutual information for each skipgram,
    which is basically a log of skipgram probability divided by the product 
    of its unigrams' probabilities. I wrote a function for that, which 
    iterates through the skipgram df and and it works exactly how I want, 
    but I have huge issues with performance, and I wanted to ask if there is 
    a way to improve my code to make it calculate the pmi faster.
unigram_df
    word            count       prob
0   we              109         0.003615
1   investigated    20          0.000663
2   the             1125        0.037315
3   potential       36          0.001194
4   of              1122        0.037215
skipgram_df
    word                      count         prob
0   (we, investigated)        5             0.000055
1   (we, the)                 31            0.000343
2   (we, potential)           2             0.000022
3   (investigated, the)       11            0.000122
4   (investigated, potential) 3             0.000033
def calculate_pmi(row):
    skipgram_prob = float(row[3])
    x_unigram_prob = float(unigram_df.loc[unigram_df['word'] == row[1][0]]
    ['prob'])
    y_unigram_prob = float(unigram_df.loc[unigram_df['word'] == row[1][1]]
    ['prob'])
    pmi = math.log10(float(skipgram_prob / (x_unigram_prob * y_unigram_prob)))
    result = str(str(row[1][0]) + ' ' + str(row[1][1]) + ' ' + str(pmi))
    return result 
pmi_list = list(map(calculate_pmi, skipgram_df.itertuples()))
import pandas as pd
import numpy as np
uni = pd.DataFrame([['we', 109, 0.003615], ['investigated', 20, 0.000663], ['the', 1125, 0.037315], ['potential', 36, 0.001194], ['of', 1122, 0.037215]], columns=['word', 'count', 'prob'])
skip = pd.DataFrame([[('we', 'investigated'), 5, 0.000055],
[('we', 'the'), 31, 0.000343],[('we', 'potential'), 2, 0.000022],[('investigated', 'the'), 11, 0.000122],
  [('investigated', 'potential'), 3, 0.000033]],columns=['word', 'count', 'prob'])
  # first split column of tuples in skip
skip[['word1', 'word2']] = skip['word'].apply(pd.Series)
# set index of uni to 'word'
uni = uni.set_index('word')
# merge prob1 & prob2 from uni to skip
skip['prob1'] = skip['word1'].map(uni['prob'].get)
skip['prob2'] = skip['word2'].map(uni['prob'].get)
# perform calculation and filter columns
skip['result'] = np.log(skip['prob'] / (skip['prob1'] * skip['prob2']))
skip = skip[['word', 'count', 'prob', 'result']]

3.0 value_counts vs numpy in1d

df[‘report_month’].value_counts()
np.in1d(normal_reports[‘report_month’],3).sum()

3.1 注意事項

pandas中的dataframe是個數據框，本身是個二維的數據框，使用tolist之後是個二維列表，
如果想要轉成一維的就需要用Series結構，他的結構是個一維的 tolist之後是個一維列表

4 os

import os
os.getcwd() #獲取當前工作目錄，即當前python腳本工作的目錄路徑
os.chdir("dirname")  #改變當前腳本工作目錄；相當於shell下cd
os.curdir #返回當前目錄: ('.')
os.pardir #獲取當前目錄的父目錄字符串名：('..')
os.makedirs('dirname1/dirname2')    #可生成多層遞歸目錄
os.removedirs('dirname1')    #若目錄爲空，則刪除，並遞歸到上一級目錄，如若也爲空，則刪除，依此類推
os.mkdir('dirname')    #生成單級目錄；相當於shell中mkdir dirname
os.rmdir('dirname')    #刪除單級空目錄，若目錄不爲空則無法刪除，報錯；相當於shell中rmdir dirname
os.listdir('dirname')    #列出指定目錄下的所有文件和子目錄，包括隱藏文件，並以列表方式打印
os.remove()  #刪除一個文件
os.rename("oldname","newname")  #重命名文件/目錄
os.stat('path/filename')  #獲取文件/目錄信息
os.linesep    #輸出當前平臺使用的行終止符，win下爲"\t\n",Linux下爲"\n"
os.pathsep    #輸出用於分割文件路徑的字符串
os.name    #輸出字符串指示當前使用平臺。win->'nt'; Linux->'posix'
os.system("bash command")  #運行shell命令，直接顯示
os.environ  #獲取系統環境變量

5 py_linq庫

to_list
count
sum
min
max
avg
median
any – uses count in algorithm
elementAt – has to store data in list to allow resetting of iterator
elemantAtOrDefault --uses elementAt
first --uses elementAt
first_or_default --uses first
last --uses first after sorting
last_or_default --uses last
contains --uses any
group_by – due to grouped iterables having to be saved to memory when iterating through itertools.groupby result
distinct – uses group by in algorithm
group_join – uses group by in algorithm
union – uses distinct in algorithm

二、python 語法小trick

1.profile

Profile是Python語言內置的性能分析工具，它能夠有效地描述程序運行的性能狀況，提供各種統計數據幫助程序員找出程序中的時間性能瓶頸。

import profile
def profileTest():
    Total = 1
    for i in range(10):
        Total = Total * (i + 1)
        print(Total)
    return Total
if __name__ == "__main__":
    profile.run("profileTest()")

執行結果

ncalls 函數的被調用次數
tottime 函數總計運行時間，這裏除去函數中調用的其他函數運行時間
percall 函數運行一次的平均時間，等於tottime/ncalls
cumtime 函數總計運行時間，這裏包含調用的其他函數運行時間
percall 函數運行一次的平均時間，等於cumtime/ncalls
filename:lineno(function) 函數所在的文件名，函數的行號，函數名

2.generate

先看一組圖

使用()得到的即是一個generator對象，所需要的內存空間與列表的大小無關，所以效率會更高。
but
set 操作

使用set()結果如下
for 操作
使用for()結果如下

大家謹慎使用

3. for循環優化包含多個判斷表達式的順序

對於 and，應該把滿足條件少的放在前面，對於 or，把滿足條件多的放在前面。
如下：

可見執行條件表達式的順序，對執行程序還是有一定的影響

4. set與list-交併差

set的union，intersection，difference操作要比list的迭代要快。因此如果涉及到求list交集，並集或者差的問題可以轉換爲set來操作
如：

5 python垃圾回收機制

import gc
df= pd.DataFrame()
df =...
del def
gc.collect()

6 with open 和open的區別

file = open("test.txt","r")
for line in file.readlines():
    print line
file.close()
和
with open("test.txt","r") as file:
    for line in file.readlines():
       print(line)

等價。
⚠️注意：
• close()是爲了釋放資源。
• 如果不close()，那就要等到垃圾回收時，自動釋放資源。
垃圾回收的時機是不確定的，也無法控制的。如果程序是一個命令，很快就執行完了，那麼可能影響不大（注意：並不是說就保證沒問題）。但如果程序是一個服務，或是需要很長時間才能執行完，或者很大併發執行，就可能導致資源被耗盡，也有可能導致死鎖。

file = open("test.txt","r")
for line in file.readlines():
     print (line)
file = open("test.txt","w")
file.write('dsvdfbd')
file = open("test.txt","r")
print('csdfv=\n',file.readline())

如這段代碼看不出啥錯誤～
• 上下文管理器是支持兩個方法的對象：enter__和__exit。
with語句實際上是一個非常通用的結構，允許你使用所謂的上下文管理器。
• 方法__enter__不接受任何參數，在進入with語句時被調用，其返回值被賦給關鍵字as後面的變量。
• 方法__exit__接受三個參數：異常類型、異常對象和異常跟蹤。它在離開方法時被調用（通過前述參數將引發的異常提供給它）。如果__exit__返回False，將抑制所有的異常。
• 文件也可用作上下文管理器。它們的方法__enter__返回文件對象本身，而方法__exit__關閉文件

file= open("test.txt","r")
try:
   for line in file.readlines():
       print line
except:
   print "error"
finally:
   file.close()

with語句作用效果相當於上面的try-except-finally

7 context manager

自定義一個上下文管理器類:

class MyResource:
   # __enter__ 返回的對象會被with語句中as後的變量接受
     def __init__(self, x, y):
        self.__x = x
        self.__y = y
   def __enter__(self):
       print('connect to resource')
       return self
   def __exit__(self, exc_type, exc_value, tb):
           print("代碼執行到了__exit__......")
        if exc_type == None:
            print('程序沒問題')
        else:
            print('程序有問題，如果你能你看懂，問題如下：')
            print('Type: ', exc_type)
            print('Value:', exc_value)
            print('TreacBack:', tb)
        return True
  def sqrt(self):
        print("代碼執行到了開更號")
        return math.sqrt(self.__x)

exit: with語句中的代碼塊執行結束或出錯, 會執行_exit__
執行結果如下
connect to resource
代碼執行到了開更號
代碼執行到了__exit__…
程序有問題，如果你能你看懂，問題如下：

Type:  <class 'ValueError'>
Value: math domain error
TreacBack: <traceback object at 0x10c45ca88>

• 一個簡化定義的方法
python提供了一個裝飾器contextmanager

from contextlib import contextmanager
class MyResource:
    def query(self):
        print('query data')
@contextmanager
def make_myresource():
    print('start to connect')
    yield MyResource()
    print('end connect')
    pass

被裝飾器裝飾的函數分爲三部分:
with語句中的代碼塊執行前執行函數中yield之前代碼
yield返回的內容複製給as之後的變量
with代碼塊執行完畢後執行函數中yield之後的代碼

8 包相對導入

• from . import spam # 導入當前目錄下的spam模塊（Python2: 當前目錄下的模塊, 直接導入即可）
• from .spam import name # 導入當前目錄下的spam模塊的name屬性（Python2: 當前目錄下的模塊, 直接導入即可，不用加.）
• from … import spam # 導入當前目錄的父目錄下的spam模塊

8.1 包相對導入與普通導入的區別

• from .string import * # 這裏導入的string模塊爲本目錄下的(不存在則導入失敗) 而不是sys.path路徑上的

四進程

1 joblib’s Parallel

from joblib import Parallel,delayed
def add_labels(filenam,df):
    list_name = list(df['name'])
    if filename in list_name:
 i = list_name.index(filename)
 return df['是否購買][i]
    else:
 return 'Nan'
 
def tmp_func(df1):
    df1['是否購買'] = df1['name'].apply(add_labels, args=(df2,))
    return df
def apply_parallel(df_grouped,func):
    results = Parallel(n_jobs=10)(delayed(func)(group) for name,group in df_grouped)
    return pd.concat(results)
df_grouped = df1.groupby(df1.index)
df1 = apply_parallel(df_grouped,tmp_func)

五算法相關

1、 KKT條件

考慮帶約束的優化問題，可以描述爲如下形式

其中f(x)是目標函數，g(x)爲不等式約束，h(x)爲等式約束。
若f(x)，h(x)，g(x)三個函數都是線性函數，則該優化問題稱爲線性規劃。若任意一個是非線性函數，則稱爲非線性規劃。
若目標函數爲二次函數，約束全爲線性函數，稱爲二次規劃。

若f(x)爲凸函數，g(x)爲凸函數，h(x)爲線性函數，則該問題稱爲凸優化。注意這裏不等式約束g(x)<=0則要求g(x)爲凸函數，若g(x)>=0則要求g(x)爲凹函數。

對於同時有多個等式約束和多個不等式約束，構造的拉格朗日函數就是在目標函數後面把這些約束相應的加起來，KKT條件也是如此

參考鏈接：
[1] https://www.cnblogs.com/liaohuiqiang/p/7805954.html
[2] https://www.jianshu.com/p/df10f536db20?from=timeline&isappinstalled=0
[3] https://www.zhihu.com/question/23311674

1.1 python包sympy求解帶約束優化的問題

題目如下：

# 導入sympy包，用於求導，方程組求解等等
from sympy import *

# 設置變量
x1 = symbols("x1")
x2 = symbols("x2")
alpha = symbols("alpha")
beta = symbols("beta")

# 構造拉格朗日等式
L = 10 - x1 * x1 - x2 * x2 + alpha * (x1 * x1 - x2) + beta * (x1 + x2)

# 求導，構造KKT條件
difyL_x1 = diff(L, x1)  # 對變量x1求導
difyL_x2 = diff(L, x2)  # 對變量x2求導
difyL_beta = diff(L, beta)  # 對乘子beta求導
dualCpt = alpha * (x1 * x1 - x2)  # 對偶互補條件

# 求解KKT等式
aa = solve([difyL_x1, difyL_x2, difyL_beta, dualCpt], [x1, x2, alpha, beta])

# 打印結果，還需驗證alpha>=0和不等式約束<=0
for i in aa:
    if i[2] >= 0:
        if (i[0] ** 2 - i[1]) <= 0:
            print(i)

python積銖累寸