# Machine Learning Nanodegree
## Capstone Project
### Project: Stock Price Prediction
**Discalimer**: all stock prices historical data were downloaded from Yahoo Finance.
**Discalimer**: lstm.py was provided as part of the project files.
---
# Definition
## Problem Statement
As already stated in the “Problem Statement” of the Capstone project description in this area, the task will be to build a predictor which will use historical data from online sources, to try to predict future prices. The input to the ML model prediction should be only the date range, and nothing else. The predicted prices should be compared against the available prices for the same date range in the testing period.
## Metrics
The metrics used for this project will be the R^2 scores between the actual prices in the testing period, and the predicted prices by the model in the same period.
There are also another set of metrics that could be used, that are indicative, which is the percent difference in absolute values between real prices and predicted ones. However, for machine learning purposes (training and testing), R^2 scores would be more reliable measures.
---
# Analysis
## Data Exploration
First, let's explore the data .. Downloading stock prices for Google.
For that purpose, I have built a special class called StockRegressor, that has the ability to download and store the data in a Pandas DataFrame.
First step, is to import the class.
```python
%matplotlib inline
import numpy as np
np.random.seed(0)
import time
import datetime
from calendar import monthrange
import pandas as pd
from IPython.display import display
from IPython.display import clear_output
from statsmodels.tsa.arima_model import ARIMA
from sklearn.metrics import mean_squared_error
import warnings
warnings.filterwarnings('ignore')
from StockRegressor import StockRegressor
from StockRegressor import StockGridSearch
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (15,8)
# initializing numpy seed so that we get reproduciable results, especially with Keras
```
### The First StockRegressor Object
Getting our first historical price data batch ...
After download the prices from the Yahoo Finance web services, the below StockRegressor instance will save the historical prices into the pricing_info DataFrame. As a first step of processing, we have changed the index of the DataFrame from 'dates' to 'timeline' which is an integer index.
The reason is that it is easier for processing, since the dates correspond to trading dates, and are not sequential: they do not include weekends or holidays, as seen by the gap below between 02 Sep 2016 and 06 Sep 2016, which must have corresponded to a long weekend (Labor Day?).
> **Note:** Please note that there might be a bug in the Pandas library, that is causing an intermitten error with the Yahoo Finance web call. The bug could be traced to the file in /anaconda/envs/**your_environment**/lib/python3.5/site-packages/pandas/core/indexes/datetimes.py, at line 1050:
This line is causing the error: "if this.freq is None:". Another if condition should be inserted before that, to test for the "freq" attribute, such as: "if hasattr(this, 'freq'):"
> **Note:** The fixed datetimes.py file is included with the submission
```python
stock = StockRegressor('GOOG', dates= ['2014-10-01', '2016-04-30'])
display(stock.pricing_info[484:488])
```
Getting pricing information for GOOG for the period 2014-10-01 to 2016-09-27
Found a pricing file with wide range of dates, reading ... Stock-GOOG-1995-12-27-2017-09-05.csv
<div>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>Open</th>
<th>High</th>
<th>Low</th>
<th>Close</th>
<th>Adj Close</th>
<th>Volume</th>
<th>dates</th>
<th>timeline</th>
</tr>
<tr>
<th>timeline</th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<th>484</th>
<td>769.250000</td>
<td>771.020020</td>
<td>764.299988</td>
<td>768.780029</td>
<td>768.780029</td>
<td>925100</td>
<td>2016-09-01</td>
<td>484</td>
</tr>
<tr>
<th>485</th>
<td>773.010010</td>
<td>773.919983</td>
<td>768.409973</td>
<td>771.460022</td>
<td>771.460022</td>
<td>1072700</td>
<td>2016-09-02</td>
<td>485</td>
</tr>
<tr>
<th>486</th>
<td>773.450012</td>
<td>782.000000</td>
<td>771.000000</td>
<td>780.080017</td>
<td>780.080017</td>
<td>1442800</td>
<td>2016-09-06</td>
<td>486</td>
</tr>
<tr>
<th>487</th>
<td>780.000000</td>
<td>782.729980</td>
<td>776.200012</td>
<td>780.349976</td>
<td>780.349976</td>
<td>893700</td>
<td>2016-09-07</td>
<td>487</td>
</tr>
</tbody>
</table>
</div>
```python
stock.adj_close_price['dates'].iloc[stock.testing_end_date]
```
Timestamp('2016-07-13 00:00:00')
### The Impact of the 'Volume' Feature
The next step would be to eliminate all the columns that are not needed. The columns 'Open', 'High', 'Low', 'Close' will all be discarded, because we will be working with the 'Adj Close' prices only.
For 'Volume', let's explore the relevance below.
From the below table and graph, we conclude that Volume has very little correlation with prices, and so we will drop it from discussion from now on.
There might be evidence that shows that there is some correlation between spikes in Volume and abrupt changes in prices. That might be logical since higher trading volumes might lead to higher prices fluctuations. However, these spikes in volume happen on the same day of the changes in prices, and so have little predictive power. This might be a topic for future exploration.
---
```python
from sklearn.preprocessing import MinMaxScaler
scaler_volume = MinMaxScaler(copy=True, feature_range=(0, 1))
scaler_price = MinMaxScaler(copy=True, feature_range=(0, 1))
prices = stock.pricing_info.copy()
prices = prices.drop(labels=['Open', 'High', 'Low', 'Close', 'dates', 'timeline'], axis=1)
scaler_volume.fit(prices['Volume'].reshape(-1, 1))
scaler_price.fit(prices['Adj Close'].reshape(-1, 1))
prices['Volume'] = scaler_volume.transform(prices['Volume'].reshape(-1, 1))
prices['Adj Close'] = scaler_price.transform(prices['Adj Close'].reshape(-1, 1))
print("\nCorrelation between Volume and Prices:")
display(prices.corr())
prices.plot(kind='scatter', x='Adj Close', y='Volume')
```
Correlation between Volume and Prices:
<div>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>Adj Close</th>
<th>Volume</th>
</tr>
</thead>
<tbody>
<tr>
<th>Adj Close</th>
<td>1.00000</td>
<td>-0.06493</td>
</tr>
<tr>
<th>Volume</th>
<td>-0.06493</td>
<td>1.00000</td>
</tr>
</tbody>
</table>
</div>
<matplotlib.axes._subplots.AxesSubplot at 0x1100e5ac8>

## Exploratory Visualization
Now let's explore the historical pricing .. For that purpose, we have built two special purpose functions into the StockRegressor class.
The first plotting function will show the "learning_df" DataFrame. This is the dataframe that will be used to store all "workspace" data, i.e. dates, indexes, prices, predictions of multiple algorithms.
The second plotting function which will be less frequently used is a function that plots prices with the Bollinger bands. This is for pricing exploration only.
Below, we call those two functions. As we haven't trained the StockRegressor, the plot_learning_data_frame() function will show the learning_df dataframe with only the pricing, and a vertical r