線性迴歸股價預測

Machine learning for finance in python

Preparing data and a linear model

Explore the data with some EDA

Any time we begin a machine learning (ML) project, we need to first do some exploratory data analysis (EDA) to familiarize ourselves with the data. This includes things like:raw data plots histograms and more…
I typically begin with raw data plots and histograms. This allows us to understand our data’s distributions. If it’s a normal distribution, we can use things like parametric statistics.(非參數統計)
There are two stocks loaded for you into pandas DataFrames: lng_df and spy_df (LNG and SPY). We’ll use the closing prices and eventually volume as inputs to ML algorithms.

print(lng_df.head())  # examine the DataFrames
print(spy_df.head())  # examine the SPY DataFrame

# Plot the Adj_Close columns for SPY and LNG
spy_df['Adj_Close'].plot(label='SPY', legend=True)
lng_df['Adj_Close'].plot(label='LNG', legend=True, secondary_y=True)
plt.show()  # show the plot
plt.clf()  # clear the plot space

# Histogram of the daily price change percent of Adj_Close for LNG
lng_df['Adj_Close'].pct_change(1).plot.hist(bins=50)
plt.xlabel('adjusted close 1-day percent change')
plt.show()

在這裏插入圖片描述
在這裏插入圖片描述大致符合正態.日差異比較小.

Correlations

Correlations are nice to check out before building machine learning models, because we can see which features correlate to the target most strongly. Pearson’s correlation coefficient is often used, which only detects linear relationships. It’s commonly assumed our data is normally distributed, which we can “eyeball” from histograms. Highly correlated variables have a Pearson correlation coefficient near 1 (positively correlated) or -1 (negatively correlated). A value near 0 means the two variables are not linearly correlated.
If we use the same time periods for previous price changes and future price changes, we can see if the stock price is mean-reverting (bounces around) or trend-following (goes up if it has been going up recently).

# Create 5-day % changes of Adj_Close for the current day, and 5 days in the future
lng_df['5d_future_close'] = lng_df['Adj_Close'].shift(-5)
lng_df['5d_close_future_pct'] = lng_df['5d_future_close'].pct_change(5)
lng_df['5d_close_pct'] = lng_df['Adj_Close'].pct_change(5)
# Calculate the correlation matrix between the 5d close pecentage changes (current and future)
corr = lng_df[['5d_close_pct', '5d_close_future_pct']].corr()
print(corr)
# Scatter the current 5-day percent change vs the future 5-day percent change
plt.scatter(lng_df['5d_close_pct'], lng_df['5d_future_close'])
plt.show()

在這裏插入圖片描述

Data transforms,features and targets

Create moving average and RSI features

在這裏插入圖片描述
最簡單的指示器是移動平均(moving average),另外常用RSI.

MA

移動平均線,Moving Average,簡稱MA,MA是用統計分析的方法,將一定時期內的證券價格(指數)加以平均,並把不同時間的平均值連接起來,形成一根MA,用以觀察證券價格變動趨勢的一種技術指標。均線理論是當今應用最普遍的技術指標之一,它幫助交易者確認現有趨勢、判斷將出現的趨勢、發現過度延生即將反轉的趨勢。
移動平均線 , 常用線有5天、10天、30天、60天、120天和240天的指標。其中,5天和10天的短期移動平均線,是短線操作的參照指標,稱做日均線指標;30天和60天的是中期均線指標,稱做季均線指標;120天、240天的是長期均線指標,稱做年均線指標。對移動平均線的考查一般從幾個方面進行。
計算方法:N日移動平均線=N日收市價之和/N
加權移動平均線
加權的原因是基於移動平均線中,收盤價對未來價格波動的影響最大,因此賦予它較大的權值。

RSI

相對強弱指數RSI是根據一定時期內上漲點數和漲跌點數之和的比率製作出的一種技術曲線。能夠反映出市場在一定時期內的景氣程度。由威爾斯.威爾德(Welles Wilder)最早應用於期貨買賣,後來人們發現在衆多的圖表技術分析中,強弱指標的理論和實踐極其適合於股票市場的短線投資,於是被用於股票升跌的測量和分析中。該分析指標的設計是以三條線來反映價格走勢的強弱,這種圖形可以爲投資者提供操作依據,非常適合做短線差價操作。

數學原理
RSI的原理簡單來說是以數字計算的方法求出買賣雙方的力量對比,譬如有100個人面對一件商品,如果50個人以上要買,競相擡價,商品價格必漲。相反,如果50個人以上爭着賣出,價格自然下跌。
強弱指標理論認爲,任何市價的大漲或大跌,均在0-100之間變動,根據常態分配,認爲RSI值多在30-70之間變動,通常80甚至90時被認爲市場已到達超買狀態,至此市場價格自然會回落調整。當價格低跌至30以下即被認爲是超賣狀態,市價將出現反彈回升。

feature_names = ['5d_close_pct']  # a list of the feature names for later

# Create moving averages and rsi for timeperiods of 14, 30, 50, and 200
for n in [14, 30, 50, 200]:

    # Create the moving average indicator and divide by Adj_Close
    lng_df['ma' + str(n)] = talib.SMA(lng_df['Adj_Close'].values,
                              timeperiod=n) / lng_df['Adj_Close']
    # Create the RSI indicator
    lng_df['rsi' + str(n)] = talib.RSI(lng_df['Adj_Close'].values, timeperiod=n)
    
    # Add rsi and moving average to the feature name list
    feature_names = feature_names + ['ma' + str(n), 'rsi' + str(n)]
    
print(feature_names)

Create features and targets

在這裏插入圖片描述

# Drop all na values
lng_df = lng_df.dropna()

# Create features and targets
# use feature_names for features; 5d_close_future_pct for targets
features = lng_df[feature_names]
targets = lng_df['5d_close_future_pct']

# Create DataFrame from target column and feature columns
feat_targ_df = lng_df[['5d_close_future_pct'] + feature_names]

# Calculate correlation matrix
corr = feat_targ_df.corr()
print(corr)

Check the correlations

Before we fit our first machine learning model, let’s look at the correlations between features and targets. Ideally we want large (near 1 or -1) correlations between features and targets. Examining correlations can help us tweak features to maximize correlation (for example, altering the timeperiod argument in the talib functions). It can also help us remove features that aren’t correlated to the target.

To easily plot a correlation matrix, we can use seaborn’s heatmap() function. This takes a correlation matrix as the first argument, and has many other options. Check out the annot option – this will help us turn on annotations.

# Plot heatmap of correlation matrix
sns.heatmap(corr, annot=True)
plt.yticks(rotation=0); plt.xticks(rotation=90)  # fix ticklabel directions
plt.tight_layout()  # fits plot area to the plot, "tightly"
plt.show()  # show the plot
plt.clf()  # clear the plot area

# Create a scatter plot of the most highly correlated variable with the target
plt.scatter(lng_df['ma200'], lng_df['5d_close_future_pct'])
plt.show()

在這裏插入圖片描述

Linear modeling

Create train and test features

# Import the statsmodels library with the alias sm
import statsmodels.api as sm

# Add a constant to the features
linear_features = sm.add_constant(features)

# Create a size for the training set that is 85% of the total number of samples
train_size = int(0.85 * features.shape[0])
train_features = linear_features[:train_size]
train_targets = targets[:train_size]
test_features = linear_features[train_size:]
test_targets = targets[train_size:]
print(linear_features.shape, train_features.shape, test_features.shape)

Fit a linear model

# Create the linear model and complete the least squares fit
model = sm.OLS(train_targets, train_features)
results = model.fit()  # fit the model
print(results.summary())

# examine pvalues
# Features with p <= 0.05 are typically considered significantly different from 0
print(results.pvalues)

# Make predictions from our model for train and test sets
train_predictions = results.predict(train_features)
test_predictions = results.predict(test_features)

ma14 1.317652e-01
rsi14 4.119023e-10
ma30 2.870964e-01
rsi30 1.315491e-11
ma50 6.542888e-08
rsi50 1.598367e-12
ma200 1.087610e-02
rsi200 2.559536e-11
dtype: float64
都顯著,都可以用來預測股價。

Evaluate our results

# Scatter the predictions vs the targets with 80% transparency
plt.scatter(train_predictions, train_targets, alpha=0.2, color='b', label='train')
plt.scatter(test_predictions, test_targets, alpha=0.2, color='r', label='test')

# Plot the perfect prediction line
xmin, xmax = plt.xlim()
plt.plot(np.arange(xmin, xmax, 0.01), np.arange(xmin, xmax, 0.01), c='k')

# Set the axis labels and show the plot
plt.xlabel('predictions')
plt.ylabel('actual')
plt.legend()  # show the legend
plt.show()

在這裏插入圖片描述
但是用現行模型的預測結果不佳,還需要進一步複雜的模型進行處理。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章