Data preparation ¶

Throughout this practical assignment we will see how to apply different techniques for loading and preparing data:

Loading a dataset
Data analysis
2.1 Basic statistical analysis
2.2 Exploratory data analysis
Dimensionality reduction
Training and test

For that we will need the following libraries:

import numpy as np
import pandas as pd
import seaborn as sns
from sklearn import datasets
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.model_selection import train_test_split, cross_val_score

import matplotlib.pyplot as plt
import ssl

ssl._create_default_https_context = ssl._create_unverified_context


pd.set_option('display.max_columns', None)

%matplotlib inline

1. Loading the dataset¶

First, you must load the Wine recognition dataset (more information at the link https://archive.ics.uci.edu/ml/datasets/Wine). It can be downloaded from the internet or it can be loaded directly from the "scikit-learn" library, which includes a set of very well known and used datasets for data mining and machine learning https://scikit-learn.org/stable/datasets.html.

Exercise: Load the "Wine Recognition" dataset and show:
- the number and name of the attributes (variables that could be used to predict the response "wine_class")
- the number of rows in the dataset
- verify whether there are "missing values" or not and in which columns

Hint: If you use sklearn (sklearn.datasets.load_wine), explore the different 'keys' of the obtained object.Hint: It may be useful to pass the data (attributes + target) to a pandas dataframe.

from sklearn.datasets import load_wine
data = load_wine()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['classes'] = pd.Categorical.from_codes(data.target, data.target_names)
print("Primer análisis de los datos")
df.head()

Primer análisis de los datos

print("Descripción de los datos")
df.describe()

Descripción de los datos

print("El número de líneas es: " + str(df.shape[0]) + " y el número de columnas: "+ str(df.shape[1]))

El número de líneas es: 178 y el número de columnas: 14

print("No existe ningún null")

df.isnull().sum()

No existe ningún null

alcohol                         0
malic_acid                      0
ash                             0
alcalinity_of_ash               0
magnesium                       0
total_phenols                   0
flavanoids                      0
nonflavanoid_phenols            0
proanthocyanins                 0
color_intensity                 0
hue                             0
od280/od315_of_diluted_wines    0
proline                         0
classes                         0
dtype: int64

2. Data analysis¶

2.1 Basic statistical analysis¶

Exercise: Perform a basic statistical analysis:
- Variables categóricas: - Calculate the frequency. - Make a bar plot.
- Variables numéricas: - Calculate basic descriptive statistics: mean, median, standard deviation, ... - Make a histogram of the variables: alcohol, magnesium and color_intensity

Hint: you can use the 'pandas' library and its 'describe' and 'value_counts' functions

print('Cálculo de la frecuéncia')
cols = df.columns
num_cols_no_categoric = df._get_numeric_data().columns

col_categoric=list(set(cols) - set(num_cols_no_categoric))
print("Frecuéncia")
print( pd.value_counts(df[col_categoric].values.flatten()))
frecuencia_data =pd.value_counts(df[col_categoric].values.flatten())

sns.catplot(x="classes", kind="count", palette="ch:.25", data=df);

Cálculo de la frecuéncia
Frecuéncia
class_1    71
class_0    59
class_2    48
dtype: int64

print("Estadísticos descriptivos básicos:")
df[num_cols_no_categoric].describe()

Estadísticos descriptivos básicos:

print("Histograma de las variables: alcohol")

graf1  = sns.distplot(df['alcohol'],kde=0)

Histograma de las variables: alcohol

print("Histograma de las variables: color_intensity")
graf3  = sns.distplot(df['color_intensity'],kde=0,color="red")

Histograma de las variables: color_intensity

print("Histograma de las variables: magnesium")
graf2  = sns.distplot(df['magnesium'],kde=0,color="green")

Histograma de las variables: magnesium

Analysis: Comment on the results.

With respect to the target, the wine class, it should be noted that the number of elements is distributed equally among the 3 types of wine, that is, we do not have a large quantity of type1 and very little of the rest that would complicate the classification tasks.

It is also possible to analyse how alcohol is the most distributed element of the three graphs, that is, we have a large quantity of different samples which makes me deduce that it will be a difficult variable to use for classification. In another sense, color intensity is more grouped and magnesium is very grouped around 90.

2.2 Exploratory data analysis¶

In this exercise we will explore the relationship of some of the numerical attributes with the response variable ("wine_class"), both graphically and qualitatively, and we will analyse the different correlations. To start, we will select only 3 attributes to explore: alcohol, magnesium and color_intensity.

feats_to_explore = ['alcohol', 'magnesium', 'color_intensity']

Exercise: Using a graphical library, such as "matplotlib", make a histogram plot of values for each of the selected attributes, separated by the values of the response class. The three plots have to be overlaid, that is, for example, in the histogram of the "alcohol" feature there have to be in a single plot three histograms, one for each wine class. Add a legend to know which class corresponds to each histogram. The purpose is to observe how each of the attributes is distributed depending on the class they have, in order to identify visually and quickly whether some attributes allow the different classes of wines to be differentiated clearly.

Hint: you can use the "alpha" parameter in the plots so that the three histograms can be appreciated.

for i in [0,1,2]:
    sns.distplot(df['alcohol'][data.target==i],
                 kde=False,label='{
                }'.format(i))

plt.legend()

<matplotlib.legend.Legend at 0x1246dcf98>

for i in [0,1,2]:
    sns.distplot(df['color_intensity'][data.target==i],kde=False,label='{}'.format(i))

plt.legend()

<matplotlib.legend.Legend at 0x105b58898>

for i in [0,1,2]:
    sns.distplot(df['magnesium'][data.target==i],kde=False,
                 label='{
                }'.format(i))

plt.legend()

<matplotlib.legend.Legend at 0x105b6bfd0>

Analysis:
Looking at the histograms, which attribute seems to have more weight when classifying a wine? Which one seems to have less weight?

Color intensity, while the one with less weight is magnesium. In fact the color is more grouped and differentiated than the rest.

Exercise: Using the previous histograms, add a vertical line indicating the mean of each of the histograms (three per plot). Draw the lines in the same color as the histogram so that it is clear to which one they refer. Add to the legend the wine class and the corresponding standard deviation. The purpose is to verify numerically the differences identified previously in a visual way.

Hint: you can use "axvline", from matplotlib axis, for the vertical lines.

for i in [0,1,2]:
    std = round(df["alcohol"][data.target==i].std(),3)
    sns.distplot(df['alcohol'][data.target==i],
                 kde=1,label='{} con std {}'.format(i,std))

plt.legend()

<matplotlib.legend.Legend at 0x125295dd8>

for i in [0,1,2]:
    std = round(df["color_intensity"][data.target==i].std(),3)
    sns.distplot(df['color_intensity'][data.target==i],
                 kde=1,label='{} con std {}'.format(i,std))

plt.legend()

<matplotlib.legend.Legend at 0x1274e4048>

for i in [0,1,2]:
    std = round(df["magnesium"][data.target==i].std(),3)
    sns.distplot(df['magnesium'][data.target==i],
                 kde=1,label='{} con std {}'.format(i,std))

plt.legend()

<matplotlib.legend.Legend at 0x127615748>

Exercise: Calculate and show the correlation between the three variables that we are analysing.

fig, (ax) = plt.subplots(1, 1, figsize=(10,6))
corr = df[feats_to_explore].corr()

hm = sns.heatmap(corr,
                 ax=ax,           # Axes in which to draw the plot, otherwise use the currently-active Axes.
                 cmap="coolwarm", # Color Map.
                 #square=True,    # If True, set the Axes aspect to “equal” so each cell will be square-shaped.
                 annot=True,
                 fmt='.2f',       # String formatting code to use when adding annotations.
                 #annot_kws={"size": 14},
                 linewidths=.05)

fig.subplots_adjust(top=0.93)
fig.suptitle('Wine Attributes Correlation Heatmap',
              fontsize=14,
              fontweight='bold')

Text(0.5, 0.98, 'Wine Attributes Correlation Heatmap')

Exercise: Represent graphically the relationships between these variables (scatterplots). Differentiate the different classes with different colours. The purpose is to be able to observe and analyse the correlations graphically between some of the variables.

Hint: you can use the "pairplot" function from the 'seaborn' library with the "hue" parameter.

feats_to_explore.append("classes")
sns.pairplot( hue="classes", data= df[feats_to_explore])

<seaborn.axisgrid.PairGrid at 0x1277a1198>

Analysis:
Observing the correlations, which variables are the ones that have the strongest correlation? Does the numerical result fit with the obtained plot?

Yes, it fits that color and alcohol have a much clearer distribution of the data than the rest (third level left). Also in the correlation plot the highest value appeared, which confirms the visualised information.

3. Dimensionality reduction¶

In this exercise dimensionality reduction methods will be applied to the original dataset. The objective is to reduce the set of attributes to a new set with fewer dimensions. Thus instead of working with 3 variables chosen at random, we will use the information of all the attributes.

Exercise: Apply the Principal Component Analysis (PCA) dimensionality reduction method to reduce to 2 dimensions (the whole dataset with all the features). Generate a plot (in 2D) with the PCA result using different colours for each of the response classes (wine_class), with the objective of visualising whether it is possible to separate the classes efficiently with this method. NOTE: Take care not to include the objective variable "wine class" in the dimensionality reduction. We want to be able to explain the objective variable as a function of the rest of the variables reduced to two dimensions.

Hint: It is not necessary to program the algorithm, you can use the implementation available in the "scikit-learn" library.

from sklearn.preprocessing import StandardScaler

features = data.feature_names
# Separating out the features
x = df.loc[:, features].values
# Separating out the target
y = df.loc[:,['classes']].values
# Standardizing the features
x = StandardScaler().fit_transform(x)

from sklearn.decomposition import PCA
pca = PCA(n_components=2)
principalComponents = pca.fit_transform(x)
principalDf = pd.DataFrame(data = principalComponents
             , columns = ['principal component 1', 'principal component 2'])
principalDf

finalDf = pd.concat([principalDf, df[['classes']]], axis = 1)
finalDf

fig = plt.figure(figsize = (8,8))
ax = fig.add_subplot(1,1,1)
ax.set_xlabel('Principal Component 1', fontsize = 15)
ax.set_ylabel('Principal Component 2', fontsize = 15)
ax.set_title('2 component PCA', fontsize = 20)
targets = ['class_0', 'class_1', 'class_2']
colors = ['r', 'g', 'b']
for target, color in zip(targets,colors):
    indicesToKeep = finalDf['classes'] == target
    ax.scatter(finalDf.loc[indicesToKeep, 'principal component 1']
               , finalDf.loc[indicesToKeep, 'principal component 2']
               , c = color
               , s = 50)
ax.legend(targets)
ax.grid()

Exercise: Repeat the dimensionality reduction, but in this case using TSNE. You can find more information about this algorithm at the link: [https://distill.pub/2016/misread-tsne/](https://distill.pub/2016/misread-tsne/) As before, generate a plot (in 2D) with the PCA result using different colours for each of the response classes (wine_class), with the objective of visualising whether it is possible to separate the classes efficiently with this method.

Hint: It is not necessary to program the algorithm, you can use the implementation available in the "scikit-learn" library. Hint: Apart from specifying the number of components, try using the "perplexity" parameter.

features = data.feature_names
# Separating out the features
x = df.loc[:, features].values
y=data.target
from sklearn.manifold import TSNE

tsne_results = TSNE(n_components=2).fit_transform(x)

target_ids = range(len([0,1,2]))


from matplotlib import pyplot as plt
colors = 'r', 'g', 'b', 'c', 'm', 'y', 'k', 'w', 'orange', 'purple'
for i, c, label in zip(target_ids, colors, [0,1,2]):
    plt.scatter(tsne_results[y == i, 0], tsne_results[y == i, 1], c=c, label=label)
plt.legend()
plt.show()


sns.scatterplot(tsne_results[:,0], tsne_results[:,1], hue=y, legend='full')

<matplotlib.axes._subplots.AxesSubplot at 0x127690780>

Analysis:
Observing the two plots, do you think the dimensionality reduction has worked well? Has it managed to separate the classes correctly? Which of the two methods has worked better? Why do we obtain such different results?

Yes, I think it has worked correctly in the first case. Using PCA I have managed to observe clearly the distribution of the dataset into 3 classes of wines whereas in the second they are more mixed and it is more difficult to differentiate them.

The results are different because the algorithms are optimised for different datasets. While PCA is very useful for datasets with similar characteristics between classes, TSNE is for the opposite.

4. Training and test¶

In this last exercise it is a matter of applying a supervised learning method, specifically the Random Forest classifier, to predict the class to which each wine belongs and evaluate the accuracy obtained with the model. For that we will use:

- The original dataset with all the attributes
- The dataset reduced to only 2 attributes with PCA
- The dataset reduced to only 2 attributes with TSNE

Exercise: Using the original dataset: - Divide the dataset into train and test. - Define a Random Forest model (setting n_estimators=10 to keep the model simple). - Apply cross validation with the defined model and the train dataset (with cv=5 it is enough). - Calculate the mean and the standard deviation of the cross validation.

Hint: To separate between train and test you can use train_test_split from sklearn. Hint: To train a random forest model you can use 'RandomForestClassifier' from sklearn. Hint: To apply cross validation you can use 'cross_val_score' from sklearn.

Exercise: Repeat the same procedure as in the previous exercise with the dataset reduced to 2 dimensions with PCA.

Exercise: Repeat the same procedure as in the previous exercise with the dataset reduced to 2 dimensions with TSNE.

# Spot Check Algorithms
train_data = []

from sklearn.ensemble import RandomForestClassifier



features = data.feature_names
# Separating out the features
x = df.loc[:, features].values
from sklearn.manifold import TSNE

tsne_results = TSNE(n_components=2).fit_transform(x)

from sklearn.model_selection import train_test_split

# Separating out the features

train_data.append(('Completo',df.loc[:, features].values))
train_data.append((' 2 dimensiones con PCA.',principalDf))
train_data.append(('2 dimensiones con TSNE.',tsne_results))


# Separating out the target
df['classes'] = data.target

y = df['classes']


from sklearn import preprocessing
y = preprocessing.label_binarize(y, classes=[0,1, 2])

# Split the data into training and testing sets



# evaluate each model in turn
names = []
for name, x in train_data:
    print('*******')
    print(name)
    train_features, test_features, train_labels, test_labels = train_test_split(x, y, test_size = 0.25, random_state = 42)
    print('Training Features Shape:', train_features.shape)
    print('Training Labels Shape:', train_labels.shape)
    print('Testing Features Shape:', test_features.shape)
    print('Testing Labels Shape:', test_labels.shape)
    rf = RandomForestClassifier(n_estimators = 10)
    cv_results = cross_val_score(rf, train_features, train_labels, cv=5, scoring='roc_auc')
    names.append(name)
    msg = "%s:  %f (%f)" % (name, cv_results.mean(), cv_results.std())
    print(msg)

*******
Completo
Training Features Shape: (133, 13)
Training Labels Shape: (133, 3)
Testing Features Shape: (45, 13)
Testing Labels Shape: (45, 3)
Completo:  0.992782 (0.007549)
*******
 2 dimensiones con PCA.
Training Features Shape: (133, 2)
Training Labels Shape: (133, 3)
Testing Features Shape: (45, 2)
Testing Labels Shape: (45, 3)
 2 dimensiones con PCA.:  0.992468 (0.005299)
*******
2 dimensiones con TSNE.
Training Features Shape: (133, 2)
Training Labels Shape: (133, 3)
Testing Features Shape: (45, 2)
Testing Labels Shape: (45, 3)
2 dimensiones con TSNE.:  0.842401 (0.031264)

Analysis: With which data has it worked better? Does it make sense? Does it fit with the results that we have seen in exercise 3?

Measures	Complete	PCA	TSNE
Mean	0.992468	0.992782	0.809156
Std	0.014336	0.011073	0.036769

It has worked better with PCA, although with the complete one it also has a very good result. It fits with the results of the previous exercise and therefore it makes sense.

Exercise: With the best model that you have obtained: - Generate predictions on the test dataset. - Calculate the accuracy of the obtained predictions and the associated confusion matrix.

Hint: To calculate the accuracy and the confusion matrix you can use the functions inside the sklearn "metrics" module.

x=df.loc[:, features].values

x=principalDf

y = preprocessing.label_binarize(y, classes=[0,1, 2])
train_features, test_features, train_labels, test_labels = train_test_split(x, y, test_size = 0.25, random_state = 42)

# Import the model we are using
from sklearn.ensemble import RandomForestClassifier
# Instantiate model with 1000 decision trees
rf = RandomForestClassifier(n_estimators = 10)
# Train the model on training data
rf.fit(train_features, train_labels);

# Use the forest's predict method on the test data
predictions = rf.predict(test_features)


import matplotlib.pyplot as plt

from sklearn.metrics import confusion_matrix
CM =  confusion_matrix( test_labels.argmax(axis=1), predictions.argmax(axis=1))


# Visualize it as a heatmap
import seaborn
seaborn.heatmap(CM, annot=True, fmt="d")
plt.show()

Exercise: The random forest model depends on many parameters. In this exercise we have only specified the number of estimators (n_estimators) and we have let it use the rest of the parameters by default. Two very useful parameters in random forest (and in any model that uses trees) are max_depth and min_samples_split. These parameters help to control overfitting. Train the previous models using different combinations of the parameters: - n_estimators - max_depth - min_samples_split Have you managed to improve the model? How has each parameter helped to improve it, that is, what is the purpose of each one of them?

n_estimators: Number of generated trees.
max_depth: Depth of the trees.
min_samples_split: Number of cases necessary to split a node.

test = [10,50,100,200]
conf = ['n_estimators','max_depth','min_samples_split']

x=df.loc[:, features].values

x=principalDf

y = preprocessing.label_binarize(y, classes=[0,1, 2])
train_features, test_features, train_labels, test_labels = train_test_split(x, y, test_size = 0.25, random_state = 42)

# Import the model we are using
from sklearn.ensemble import RandomForestClassifier
# Instantiate model with 1000 decision trees

for num in test:
        print('*******')
        print('n_estimators' + str(num))
        rf = RandomForestClassifier(n_estimators = num)
        cv_results = cross_val_score(rf, train_features, train_labels, cv=5, scoring='roc_auc')
        names.append(name)
        msg = "%s:  %f (%f)" % (name, cv_results.mean(), cv_results.std())
        print(msg)



for num in test:
        print('*******')
        print('max_depth' + str(num))
        rf = RandomForestClassifier(max_depth = num)
        cv_results = cross_val_score(rf, train_features, train_labels, cv=5, scoring='roc_auc')
        names.append(name)
        msg = "%s:  %f (%f)" % (name, cv_results.mean(), cv_results.std())
        print(msg)



for num in test:
        print('*******')
        print('min_samples_split' + str(num))
        rf = RandomForestClassifier(min_samples_split = num)
        cv_results = cross_val_score(rf, train_features, train_labels, cv=5, scoring='roc_auc')
        names.append(name)
        msg = "%s:  %f (%f)" % (name, cv_results.mean(), cv_results.std())
        print(msg)

*******
n_estimators10
min_samples_split:  0.989470 (0.012881)
*******
n_estimators50
min_samples_split:  0.987855 (0.014393)
*******
n_estimators100
min_samples_split:  0.990644 (0.009293)
*******
n_estimators200
min_samples_split:  0.991977 (0.007641)
*******
max_depth10
min_samples_split:  0.992803 (0.005915)
*******
max_depth50
min_samples_split:  0.993425 (0.005470)
*******
max_depth100
min_samples_split:  0.994343 (0.004828)
*******
max_depth200
min_samples_split:  0.992940 (0.007179)
*******
min_samples_split10
min_samples_split:  0.989566 (0.006284)
*******
min_samples_split50
min_samples_split:  0.978533 (0.010368)
*******
min_samples_split100
min_samples_split:  0.500000 (0.000000)
*******
min_samples_split200
min_samples_split:  0.500000 (0.000000)

On the one hand it is observed that the increase in the value of the n_estimators and max_depth variables generates more accuracy at the cost of greater computational cost. On the contrary, the increase in min_samples_split implies a decrease in the effectiveness of the algorithm and greater execution speed.

	alcohol	malic_acid	ash	alcalinity_of_ash	magnesium	total_phenols	flavanoids	nonflavanoid_phenols	proanthocyanins	color_intensity	hue	od280/od315_of_diluted_wines	proline	classes
0	14.23	1.71	2.43	15.6	127.0	2.80	3.06	0.28	2.29	5.64	1.04	3.92	1065.0	class_0
1	13.20	1.78	2.14	11.2	100.0	2.65	2.76	0.26	1.28	4.38	1.05	3.40	1050.0	class_0
2	13.16	2.36	2.67	18.6	101.0	2.80	3.24	0.30	2.81	5.68	1.03	3.17	1185.0	class_0
3	14.37	1.95	2.50	16.8	113.0	3.85	3.49	0.24	2.18	7.80	0.86	3.45	1480.0	class_0
4	13.24	2.59	2.87	21.0	118.0	2.80	2.69	0.39	1.82	4.32	1.04	2.93	735.0	class_0

	alcohol	malic_acid	ash	alcalinity_of_ash	magnesium	total_phenols	flavanoids	nonflavanoid_phenols	proanthocyanins	color_intensity	hue	od280/od315_of_diluted_wines	proline
count	178.000000	178.000000	178.000000	178.000000	178.000000	178.000000	178.000000	178.000000	178.000000	178.000000	178.000000	178.000000	178.000000
mean	13.000618	2.336348	2.366517	19.494944	99.741573	2.295112	2.029270	0.361854	1.590899	5.058090	0.957449	2.611685	746.893258
std	0.811827	1.117146	0.274344	3.339564	14.282484	0.625851	0.998859	0.124453	0.572359	2.318286	0.228572	0.709990	314.907474
min	11.030000	0.740000	1.360000	10.600000	70.000000	0.980000	0.340000	0.130000	0.410000	1.280000	0.480000	1.270000	278.000000
25%	12.362500	1.602500	2.210000	17.200000	88.000000	1.742500	1.205000	0.270000	1.250000	3.220000	0.782500	1.937500	500.500000
50%	13.050000	1.865000	2.360000	19.500000	98.000000	2.355000	2.135000	0.340000	1.555000	4.690000	0.965000	2.780000	673.500000
75%	13.677500	3.082500	2.557500	21.500000	107.000000	2.800000	2.875000	0.437500	1.950000	6.200000	1.120000	3.170000	985.000000
max	14.830000	5.800000	3.230000	30.000000	162.000000	3.880000	5.080000	0.660000	3.580000	13.000000	1.710000	4.000000	1680.000000

	alcohol	malic_acid	ash	alcalinity_of_ash	magnesium	total_phenols	flavanoids	nonflavanoid_phenols	proanthocyanins	color_intensity	hue	od280/od315_of_diluted_wines	proline
count	178.000000	178.000000	178.000000	178.000000	178.000000	178.000000	178.000000	178.000000	178.000000	178.000000	178.000000	178.000000	178.000000
mean	13.000618	2.336348	2.366517	19.494944	99.741573	2.295112	2.029270	0.361854	1.590899	5.058090	0.957449	2.611685	746.893258
std	0.811827	1.117146	0.274344	3.339564	14.282484	0.625851	0.998859	0.124453	0.572359	2.318286	0.228572	0.709990	314.907474
min	11.030000	0.740000	1.360000	10.600000	70.000000	0.980000	0.340000	0.130000	0.410000	1.280000	0.480000	1.270000	278.000000
25%	12.362500	1.602500	2.210000	17.200000	88.000000	1.742500	1.205000	0.270000	1.250000	3.220000	0.782500	1.937500	500.500000
50%	13.050000	1.865000	2.360000	19.500000	98.000000	2.355000	2.135000	0.340000	1.555000	4.690000	0.965000	2.780000	673.500000
75%	13.677500	3.082500	2.557500	21.500000	107.000000	2.800000	2.875000	0.437500	1.950000	6.200000	1.120000	3.170000	985.000000
max	14.830000	5.800000	3.230000	30.000000	162.000000	3.880000	5.080000	0.660000	3.580000	13.000000	1.710000	4.000000	1680.000000

	principal component 1	principal component 2
0	3.316751	-1.443463
1	2.209465	0.333393
2	2.516740	-1.031151
3	3.757066	-2.756372
4	1.008908	-0.869831
...	...	...
173	-3.370524	-2.216289
174	-2.601956	-1.757229
175	-2.677839	-2.760899
176	-2.387017	-2.297347
177	-3.208758	-2.768920

	principal component 1	principal component 2	classes
0	3.316751	-1.443463	class_0
1	2.209465	0.333393	class_0
2	2.516740	-1.031151	class_0
3	3.757066	-2.756372	class_0
4	1.008908	-0.869831	class_0
...	...	...	...
173	-3.370524	-2.216289	class_2
174	-2.601956	-1.757229	class_2
175	-2.677839	-2.760899	class_2
176	-2.387017	-2.297347	class_2
177	-3.208758	-2.768920	class_2

Wine dataset analysis: preparation, EDA, PCA and classification

What problem does this analysis solve?

1. Load and validate

2. Explore distributions

3. Reduce dimensions

4. Train and evaluate

Main takeaway

Data Mining Example.