Series temporales y combinación de clasificadores

Esta práctica está dividida en dos partes:

  • En el primer ejercicio veremos como descomponer y componer series temporales para realizar predicciones a futuro.
  • En el segundo ejercicio estudiaremos diferentes métodos de combinación de clasificadores.
In [262]:
import pickle

import scipy.stats
import numpy as np
import pandas as pd
import matplotlib as mpl
from sklearn import svm
from sklearn import ensemble
from matplotlib import pyplot as plt
from sklearn.decomposition import PCA
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import OneHotEncoder
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
!pip install statsmodels
from statsmodels.tsa.statespace.sarimax import SARIMAX
from statsmodels.graphics.tsaplots import plot_pacf, plot_acf

%matplotlib inline
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: statsmodels in ./.local/lib/python3.5/site-packages (0.11.1)
Requirement already satisfied: numpy>=1.14 in /usr/local/lib/python3.5/dist-packages (from statsmodels) (1.14.3)
Requirement already satisfied: scipy>=1.0 in ./.local/lib/python3.5/site-packages (from statsmodels) (1.4.1)
Requirement already satisfied: pandas>=0.21 in ./.local/lib/python3.5/site-packages (from statsmodels) (0.24.2)
Requirement already satisfied: patsy>=0.5 in ./.local/lib/python3.5/site-packages (from statsmodels) (0.5.1)
Requirement already satisfied: python-dateutil>=2.5.0 in /usr/local/lib/python3.5/dist-packages (from pandas>=0.21->statsmodels) (2.6.1)
Requirement already satisfied: pytz>=2011k in /usr/local/lib/python3.5/dist-packages (from pandas>=0.21->statsmodels) (2017.2)
Requirement already satisfied: six in /usr/local/lib/python3.5/dist-packages (from patsy>=0.5->statsmodels) (1.11.0)
WARNING: You are using pip version 20.0.2; however, version 20.1.1 is available.
You should consider upgrading via the '/usr/bin/python3 -m pip install --upgrade pip' command.

1. Series temporales

En este primer ejercicio trabajaremos las series temporales. Para ello usaremos el dataset AirPassangers que contiene información del número de vuelos que se realizaron a lo largo de muchos años.

Empezaremos leyendo los datos y observando gráficamente su distribución. Como se puede apreciar es un claro caso de serie temporal, con heterocedasticidad, tendencia, periodo y ruido. A lo largo de este ejercicio trataremos cada uno de estos puntos.

In [263]:
data = pd.read_csv('AirPassengers.csv', parse_dates=['Month'], index_col='Month', header=0)
data.head()
Out[263]:
Passengers
Month
1949-01-01112
1949-02-01118
1949-03-01132
1949-04-01129
1949-05-01121
In [264]:
data.plot(figsize=(15, 5))
Out[264]:
<matplotlib.axes._subplots.AxesSubplot at 0x7ff985d21a20>

Antes de empezar a tratar las diferentes componenetes de una serie temporal, eliminaremos del dataset original un par de años de datos. Así cuando hagamos una predicción a futuro podremos comprobar si se ajusta a los datos reales.

In [265]:
TEST_SIZE = 24
train, test = data.iloc[:-TEST_SIZE], data.iloc[-TEST_SIZE:]
x_train, x_test = np.array(range(train.shape[0])), np.array(range(train.shape[0], data.shape[0]))
train.shape, x_train.shape, test.shape, x_test.shape
Out[265]:
((120, 1), (120,), (24, 1), (24,))
In [266]:
fig, ax = plt.subplots(1, 1, figsize=(15, 5))
ax.plot(x_train, train)
ax.plot(x_test, test)
Out[266]:
[<matplotlib.lines.Line2D at 0x7ff9ccd10fd0>]