
Machine learning for finance in python

Preparing data and a linear model

Explore the data with some EDA

Any time we begin a machine learning (ML) project, we need to first do some exploratory data analysis (EDA) to familiarize ourselves with the data. This includes things like:raw data plots histograms and more…
I typically begin with raw data plots and histograms. This allows us to understand our data’s distributions. If it’s a normal distribution, we can use things like parametric statistics.(非參數統計)
There are two stocks loaded for you into pandas DataFrames: lng_df and spy_df (LNG and SPY). We’ll use the closing prices and eventually volume as inputs to ML algorithms.

print(lng_df.head())  # examine the DataFrames
print(spy_df.head())  # examine the SPY DataFrame

# Plot the Adj_Close columns for SPY and LNG
spy_df['Adj_Close'].plot(label='SPY', legend=True)
lng_df['Adj_Close'].plot(label='LNG', legend=True, secondary_y=True)
plt.show()  # show the plot
plt.clf()  # clear the plot space

# Histogram of the daily price change percent of Adj_Close for LNG
plt.xlabel('adjusted close 1-day percent change')



Correlations are nice to check out before building machine learning models, because we can see which features correlate to the target most strongly. Pearson’s correlation coefficient is often used, which only detects linear relationships. It’s commonly assumed our data is normally distributed, which we can “eyeball” from histograms. Highly correlated variables have a Pearson correlation coefficient near 1 (positively correlated) or -1 (negatively correlated). A value near 0 means the two variables are not linearly correlated.
If we use the same time periods for previous price changes and future price changes, we can see if the stock price is mean-reverting (bounces around) or trend-following (goes up if it has been going up recently).

# Create 5-day % changes of Adj_Close for the current day, and 5 days in the future
lng_df['5d_future_close'] = lng_df['Adj_Close'].shift(-5)
lng_df['5d_close_future_pct'] = lng_df['5d_future_close'].pct_change(5)
lng_df['5d_close_pct'] = lng_df['Adj_Close'].pct_change(5)
# Calculate the correlation matrix between the 5d close pecentage changes (current and future)
corr = lng_df[['5d_close_pct', '5d_close_future_pct']].corr()
# Scatter the current 5-day percent change vs the future 5-day percent change
plt.scatter(lng_df['5d_close_pct'], lng_df['5d_future_close'])


Data transforms,features and targets

Create moving average and RSI features

最簡單的指示器是移動平均(moving average),另外常用RSI.


移動平均線,Moving Average,簡稱MA,MA是用統計分析的方法,將一定時期內的證券價格(指數)加以平均,並把不同時間的平均值連接起來,形成一根MA,用以觀察證券價格變動趨勢的一種技術指標。均線理論是當今應用最普遍的技術指標之一,它幫助交易者確認現有趨勢、判斷將出現的趨勢、發現過度延生即將反轉的趨勢。
移動平均線 , 常用線有5天、10天、30天、60天、120天和240天的指標。其中,5天和10天的短期移動平均線,是短線操作的參照指標,稱做日均線指標;30天和60天的是中期均線指標,稱做季均線指標;120天、240天的是長期均線指標,稱做年均線指標。對移動平均線的考查一般從幾個方面進行。


相對強弱指數RSI是根據一定時期內上漲點數和漲跌點數之和的比率製作出的一種技術曲線。能夠反映出市場在一定時期內的景氣程度。由威爾斯.威爾德(Welles Wilder)最早應用於期貨買賣,後來人們發現在衆多的圖表技術分析中,強弱指標的理論和實踐極其適合於股票市場的短線投資,於是被用於股票升跌的測量和分析中。該分析指標的設計是以三條線來反映價格走勢的強弱,這種圖形可以爲投資者提供操作依據,非常適合做短線差價操作。


feature_names = ['5d_close_pct']  # a list of the feature names for later

# Create moving averages and rsi for timeperiods of 14, 30, 50, and 200
for n in [14, 30, 50, 200]:

    # Create the moving average indicator and divide by Adj_Close
    lng_df['ma' + str(n)] = talib.SMA(lng_df['Adj_Close'].values,
                              timeperiod=n) / lng_df['Adj_Close']
    # Create the RSI indicator
    lng_df['rsi' + str(n)] = talib.RSI(lng_df['Adj_Close'].values, timeperiod=n)
    # Add rsi and moving average to the feature name list
    feature_names = feature_names + ['ma' + str(n), 'rsi' + str(n)]

Create features and targets


# Drop all na values
lng_df = lng_df.dropna()

# Create features and targets
# use feature_names for features; 5d_close_future_pct for targets
features = lng_df[feature_names]
targets = lng_df['5d_close_future_pct']

# Create DataFrame from target column and feature columns
feat_targ_df = lng_df[['5d_close_future_pct'] + feature_names]

# Calculate correlation matrix
corr = feat_targ_df.corr()

Check the correlations

Before we fit our first machine learning model, let’s look at the correlations between features and targets. Ideally we want large (near 1 or -1) correlations between features and targets. Examining correlations can help us tweak features to maximize correlation (for example, altering the timeperiod argument in the talib functions). It can also help us remove features that aren’t correlated to the target.

To easily plot a correlation matrix, we can use seaborn’s heatmap() function. This takes a correlation matrix as the first argument, and has many other options. Check out the annot option – this will help us turn on annotations.

# Plot heatmap of correlation matrix
sns.heatmap(corr, annot=True)
plt.yticks(rotation=0); plt.xticks(rotation=90)  # fix ticklabel directions
plt.tight_layout()  # fits plot area to the plot, "tightly"
plt.show()  # show the plot
plt.clf()  # clear the plot area

# Create a scatter plot of the most highly correlated variable with the target
plt.scatter(lng_df['ma200'], lng_df['5d_close_future_pct'])


Linear modeling

Create train and test features

# Import the statsmodels library with the alias sm
import statsmodels.api as sm

# Add a constant to the features
linear_features = sm.add_constant(features)

# Create a size for the training set that is 85% of the total number of samples
train_size = int(0.85 * features.shape[0])
train_features = linear_features[:train_size]
train_targets = targets[:train_size]
test_features = linear_features[train_size:]
test_targets = targets[train_size:]
print(linear_features.shape, train_features.shape, test_features.shape)

Fit a linear model

# Create the linear model and complete the least squares fit
model = sm.OLS(train_targets, train_features)
results = model.fit()  # fit the model

# examine pvalues
# Features with p <= 0.05 are typically considered significantly different from 0

# Make predictions from our model for train and test sets
train_predictions = results.predict(train_features)
test_predictions = results.predict(test_features)

ma14 1.317652e-01
rsi14 4.119023e-10
ma30 2.870964e-01
rsi30 1.315491e-11
ma50 6.542888e-08
rsi50 1.598367e-12
ma200 1.087610e-02
rsi200 2.559536e-11
dtype: float64

Evaluate our results

# Scatter the predictions vs the targets with 80% transparency
plt.scatter(train_predictions, train_targets, alpha=0.2, color='b', label='train')
plt.scatter(test_predictions, test_targets, alpha=0.2, color='r', label='test')

# Plot the perfect prediction line
xmin, xmax = plt.xlim()
plt.plot(np.arange(xmin, xmax, 0.01), np.arange(xmin, xmax, 0.01), c='k')

# Set the axis labels and show the plot
plt.legend()  # show the legend


還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.