Python · Wine dataset · Data mining

Wine dataset analysis: preparation, EDA, PCA and classification

This post turns the original notebook into a practical data-analysis workflow: load the Wine dataset, inspect variables, scale the data, reduce dimensionality and validate a classifier with a train/test split.

What problem does this analysis solve?

The Wine dataset contains chemical measurements from different wine cultivars. The goal is to understand which variables separate the classes and whether a supervised model can classify the origin of a wine from those measurements.

1. Load and validate

Check rows, columns, feature names, target classes and missing values before modeling.

2. Explore distributions

Use descriptive statistics and plots to detect scale differences, outliers and class separation.

3. Reduce dimensions

Compare PCA and t-SNE as complementary views: PCA explains variance, t-SNE helps inspect local structure.

4. Train and evaluate

Split train/test data and use metrics such as accuracy and confusion matrix to avoid judging the model by intuition.

Main takeaway

The important lesson is not only the final classifier. The real value is the complete sequence: understand the data, transform it when needed, visualize it from several angles and only then train a model.

Data Mining Example.

 

Data preparation

Throughout this practical assignment we will see how to apply different techniques for loading and preparing data:

  1. Loading a dataset
  2. Data analysis
    2.1 Basic statistical analysis
    2.2 Exploratory data analysis
  3. Dimensionality reduction
  4. Training and test

For that we will need the following libraries:

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn import datasets
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.model_selection import train_test_split, cross_val_score

import matplotlib.pyplot as plt
import ssl

ssl._create_default_https_context = ssl._create_unverified_context


pd.set_option('display.max_columns', None)

%matplotlib inline

1. Loading the dataset

First, you must load the Wine recognition dataset (more information at the link https://archive.ics.uci.edu/ml/datasets/Wine). It can be downloaded from the internet or it can be loaded directly from the "scikit-learn" library, which includes a set of very well known and used datasets for data mining and machine learning https://scikit-learn.org/stable/datasets.html.

Exercise: Load the "Wine Recognition" dataset and show:
- the number and name of the attributes (variables that could be used to predict the response "wine_class")
- the number of rows in the dataset
- verify whether there are "missing values" or not and in which columns
Hint: If you use sklearn (sklearn.datasets.load_wine), explore the different 'keys' of the obtained object.Hint: It may be useful to pass the data (attributes + target) to a pandas dataframe.
In [2]:
from sklearn.datasets import load_wine
data = load_wine()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['classes'] = pd.Categorical.from_codes(data.target, data.target_names)
print("Primer análisis de los datos")
df.head()
Primer análisis de los datos
Out[2]:
alcoholmalic_acidashalcalinity_of_ashmagnesiumtotal_phenolsflavanoidsnonflavanoid_phenolsproanthocyaninscolor_intensityhueod280/od315_of_diluted_winesprolineclasses
014.231.712.4315.6127.02.803.060.282.295.641.043.921065.0class_0
113.201.782.1411.2100.02.652.760.261.284.381.053.401050.0class_0
213.162.362.6718.6101.02.803.240.302.815.681.033.171185.0class_0
314.371.952.5016.8113.03.853.490.242.187.800.863.451480.0class_0
413.242.592.8721.0118.02.802.690.391.824.321.042.93735.0class_0
In [3]:
print("Descripción de los datos")
df.describe()
Descripción de los datos
Out[3]:
alcoholmalic_acidashalcalinity_of_ashmagnesiumtotal_phenolsflavanoidsnonflavanoid_phenolsproanthocyaninscolor_intensityhueod280/od315_of_diluted_winesproline
count178.000000178.000000178.000000178.000000178.000000178.000000178.000000178.000000178.000000178.000000178.000000178.000000178.000000
mean13.0006182.3363482.36651719.49494499.7415732.2951122.0292700.3618541.5908995.0580900.9574492.611685746.893258
std0.8118271.1171460.2743443.33956414.2824840.6258510.9988590.1244530.5723592.3182860.2285720.709990314.907474
min11.0300000.7400001.36000010.60000070.0000000.9800000.3400000.1300000.4100001.2800000.4800001.270000278.000000
25%12.3625001.6025002.21000017.20000088.0000001.7425001.2050000.2700001.2500003.2200000.7825001.937500500.500000
50%13.0500001.8650002.36000019.50000098.0000002.3550002.1350000.3400001.5550004.6900000.9650002.780000673.500000
75%13.6775003.0825002.55750021.500000107.0000002.8000002.8750000.4375001.9500006.2000001.1200003.170000985.000000
max14.8300005.8000003.23000030.000000162.0000003.8800005.0800000.6600003.58000013.0000001.7100004.0000001680.000000
In [4]:
print("El número de líneas es: " + str(df.shape[0]) + " y el número de columnas: "+ str(df.shape[1]))
El número de líneas es: 178 y el número de columnas: 14
In [5]:
print("No existe ningún null")

df.isnull().sum()
No existe ningún null
Out[5]:
alcohol                         0
malic_acid                      0
ash                             0
alcalinity_of_ash               0
magnesium                       0
total_phenols                   0
flavanoids                      0
nonflavanoid_phenols            0
proanthocyanins                 0
color_intensity                 0
hue                             0
od280/od315_of_diluted_wines    0
proline                         0
classes                         0
dtype: int64

2. Data analysis

2.1 Basic statistical analysis

Exercise: Perform a basic statistical analysis:
- Variables categóricas: - Calculate the frequency. - Make a bar plot.
- Variables numéricas: - Calculate basic descriptive statistics: mean, median, standard deviation, ... - Make a histogram of the variables: alcohol, magnesium and color_intensity
Hint: you can use the 'pandas' library and its 'describe' and 'value_counts' functions
In [6]:
print('Cálculo de la frecuéncia')
cols = df.columns
num_cols_no_categoric = df._get_numeric_data().columns

col_categoric=list(set(cols) - set(num_cols_no_categoric))
print("Frecuéncia")
print( pd.value_counts(df[col_categoric].values.flatten()))
frecuencia_data =pd.value_counts(df[col_categoric].values.flatten())

sns.catplot(x="classes", kind="count", palette="ch:.25", data=df);
Cálculo de la frecuéncia
Frecuéncia
class_1    71
class_0    59
class_2    48
dtype: int64
In [7]:
print("Estadísticos descriptivos básicos:")
df[num_cols_no_categoric].describe()
Estadísticos descriptivos básicos:
Out[7]:
alcoholmalic_acidashalcalinity_of_ashmagnesiumtotal_phenolsflavanoidsnonflavanoid_phenolsproanthocyaninscolor_intensityhueod280/od315_of_diluted_winesproline
count178.000000178.000000178.000000178.000000178.000000178.000000178.000000178.000000178.000000178.000000178.000000178.000000178.000000
mean13.0006182.3363482.36651719.49494499.7415732.2951122.0292700.3618541.5908995.0580900.9574492.611685746.893258
std0.8118271.1171460.2743443.33956414.2824840.6258510.9988590.1244530.5723592.3182860.2285720.709990314.907474
min11.0300000.7400001.36000010.60000070.0000000.9800000.3400000.1300000.4100001.2800000.4800001.270000278.000000
25%12.3625001.6025002.21000017.20000088.0000001.7425001.2050000.2700001.2500003.2200000.7825001.937500500.500000
50%13.0500001.8650002.36000019.50000098.0000002.3550002.1350000.3400001.5550004.6900000.9650002.780000673.500000
75%13.6775003.0825002.55750021.500000107.0000002.8000002.8750000.4375001.9500006.2000001.1200003.170000985.000000
max14.8300005.8000003.23000030.000000162.0000003.8800005.0800000.6600003.58000013.0000001.7100004.0000001680.000000
In [8]:
print("Histograma de las variables: alcohol")

graf1  = sns.distplot(df['alcohol'],kde=0)
Histograma de las variables: alcohol
In [9]:
print("Histograma de las variables: color_intensity")
graf3  = sns.distplot(df['color_intensity'],kde=0,color="red")
Histograma de las variables: color_intensity
In [10]:
print("Histograma de las variables: magnesium")
graf2  = sns.distplot(df['magnesium'],kde=0,color="green")
Histograma de las variables: magnesium
Analysis: Comment on the results.

With respect to the target, the wine class, it should be noted that the number of elements is distributed equally among the 3 types of wine, that is, we do not have a large quantity of type1 and very little of the rest that would complicate the classification tasks.

It is also possible to analyse how alcohol is the most distributed element of the three graphs, that is, we have a large quantity of different samples which makes me deduce that it will be a difficult variable to use for classification. In another sense, color intensity is more grouped and magnesium is very grouped around 90.

2.2 Exploratory data analysis

In this exercise we will explore the relationship of some of the numerical attributes with the response variable ("wine_class"), both graphically and qualitatively, and we will analyse the different correlations. To start, we will select only 3 attributes to explore: alcohol, magnesium and color_intensity.

In [11]:
feats_to_explore = ['alcohol', 'magnesium', 'color_intensity']
Exercise: Using a graphical library, such as "matplotlib", make a histogram plot of values for each of the selected attributes, separated by the values of the response class. The three plots have to be overlaid, that is, for example, in the histogram of the "alcohol" feature there have to be in a single plot three histograms, one for each wine class. Add a legend to know which class corresponds to each histogram. The purpose is to observe how each of the attributes is distributed depending on the class they have, in order to identify visually and quickly whether some attributes allow the different classes of wines to be differentiated clearly.
Hint: you can use the "alpha" parameter in the plots so that the three histograms can be appreciated.
In [12]:
for i in [0,1,2]:
    sns.distplot(df['alcohol'][data.target==i],
                 kde=False,label='{
                }'.format(i))

plt.legend()
Out[12]:
<matplotlib.legend.Legend at 0x1246dcf98>
In [13]:
for i in [0,1,2]:
    sns.distplot(df['color_intensity'][data.target==i],kde=False,label='{}'.format(i))

plt.legend()
Out[13]:
<matplotlib.legend.Legend at 0x105b58898>
In [14]:
for i in [0,1,2]:
    sns.distplot(df['magnesium'][data.target==i],kde=False,
                 label='{
                }'.format(i))

plt.legend()
Out[14]:
<matplotlib.legend.Legend at 0x105b6bfd0>
Analysis:
Looking at the histograms, which attribute seems to have more weight when classifying a wine? Which one seems to have less weight?

Color intensity, while the one with less weight is magnesium. In fact the color is more grouped and differentiated than the rest.

Exercise: Using the previous histograms, add a vertical line indicating the mean of each of the histograms (three per plot). Draw the lines in the same color as the histogram so that it is clear to which one they refer. Add to the legend the wine class and the corresponding standard deviation. The purpose is to verify numerically the differences identified previously in a visual way.
Hint: you can use "axvline", from matplotlib axis, for the vertical lines.
In [15]:
for i in [0,1,2]:
    std = round(df["alcohol"][data.target==i].std(),3)
    sns.distplot(df['alcohol'][data.target==i],
                 kde=1,label='{} con std {}'.format(i,std))

plt.legend()
Out[15]:
<matplotlib.legend.Legend at 0x125295dd8>
In [16]:
for i in [0,1,2]:
    std = round(df["color_intensity"][data.target==i].std(),3)
    sns.distplot(df['color_intensity'][data.target==i],
                 kde=1,label='{} con std {}'.format(i,std))

plt.legend()
Out[16]:
<matplotlib.legend.Legend at 0x1274e4048>
In [17]:
for i in [0,1,2]:
    std = round(df["magnesium"][data.target==i].std(),3)
    sns.distplot(df['magnesium'][data.target==i],
                 kde=1,label='{} con std {}'.format(i,std))

plt.legend()
Out[17]:
<matplotlib.legend.Legend at 0x127615748>
Exercise: Calculate and show the correlation between the three variables that we are analysing.
In [18]:
fig, (ax) = plt.subplots(1, 1, figsize=(10,6))
corr = df[feats_to_explore].corr()

hm = sns.heatmap(corr,
                 ax=ax,           # Axes in which to draw the plot, otherwise use the currently-active Axes.
                 cmap="coolwarm", # Color Map.
                 #square=True,    # If True, set the Axes aspect to “equal” so each cell will be square-shaped.
                 annot=True,
                 fmt='.2f',       # String formatting code to use when adding annotations.
                 #annot_kws={"size": 14},
                 linewidths=.05)

fig.subplots_adjust(top=0.93)
fig.suptitle('Wine Attributes Correlation Heatmap',
              fontsize=14,
              fontweight='bold')
Out[18]:
Text(0.5, 0.98, 'Wine Attributes Correlation Heatmap')
Exercise: Represent graphically the relationships between these variables (scatterplots). Differentiate the different classes with different colours. The purpose is to be able to observe and analyse the correlations graphically between some of the variables.
Hint: you can use the "pairplot" function from the 'seaborn' library with the "hue" parameter.
In [19]:
feats_to_explore.append("classes")
sns.pairplot( hue="classes", data= df[feats_to_explore])
Out[19]:
<seaborn.axisgrid.PairGrid at 0x1277a1198>
Analysis:
Observing the correlations, which variables are the ones that have the strongest correlation? Does the numerical result fit with the obtained plot?

Yes, it fits that color and alcohol have a much clearer distribution of the data than the rest (third level left). Also in the correlation plot the highest value appeared, which confirms the visualised information.

3. Dimensionality reduction

In this exercise dimensionality reduction methods will be applied to the original dataset. The objective is to reduce the set of attributes to a new set with fewer dimensions. Thus instead of working with 3 variables chosen at random, we will use the information of all the attributes.

Exercise: Apply the Principal Component Analysis (PCA) dimensionality reduction method to reduce to 2 dimensions (the whole dataset with all the features). Generate a plot (in 2D) with the PCA result using different colours for each of the response classes (wine_class), with the objective of visualising whether it is possible to separate the classes efficiently with this method. NOTE: Take care not to include the objective variable "wine class" in the dimensionality reduction. We want to be able to explain the objective variable as a function of the rest of the variables reduced to two dimensions.
Hint: It is not necessary to program the algorithm, you can use the implementation available in the "scikit-learn" library.
In [20]:
from sklearn.preprocessing import StandardScaler

features = data.feature_names
# Separating out the features
x = df.loc[:, features].values
# Separating out the target
y = df.loc[:,['classes']].values
# Standardizing the features
x = StandardScaler().fit_transform(x)

from sklearn.decomposition import PCA
pca = PCA(n_components=2)
principalComponents = pca.fit_transform(x)
principalDf = pd.DataFrame(data = principalComponents
             , columns = ['principal component 1', 'principal component 2'])
principalDf
Out[20]:
principal component 1principal component 2
03.316751-1.443463
12.2094650.333393
22.516740-1.031151
33.757066-2.756372
41.008908-0.869831
.........
173-3.370524-2.216289
174-2.601956-1.757229
175-2.677839-2.760899
176-2.387017-2.297347
177-3.208758-2.768920

178 rows × 2 columns

In [21]:
finalDf = pd.concat([principalDf, df[['classes']]], axis = 1)
finalDf
Out[21]:
principal component 1principal component 2classes
03.316751-1.443463class_0
12.2094650.333393class_0
22.516740-1.031151class_0
33.757066-2.756372class_0
41.008908-0.869831class_0
............
173-3.370524-2.216289class_2
174-2.601956-1.757229class_2
175-2.677839-2.760899class_2
176-2.387017-2.297347class_2
177-3.208758-2.768920class_2

178 rows × 3 columns

In [22]:
fig = plt.figure(figsize = (8,8))
ax = fig.add_subplot(1,1,1)
ax.set_xlabel('Principal Component 1', fontsize = 15)
ax.set_ylabel('Principal Component 2', fontsize = 15)
ax.set_title('2 component PCA', fontsize = 20)
targets = ['class_0', 'class_1', 'class_2']
colors = ['r', 'g', 'b']
for target, color in zip(targets,colors):
    indicesToKeep = finalDf['classes'] == target
    ax.scatter(finalDf.loc[indicesToKeep, 'principal component 1']
               , finalDf.loc[indicesToKeep, 'principal component 2']
               , c = color
               , s = 50)
ax.legend(targets)
ax.grid()
Exercise: Repeat the dimensionality reduction, but in this case using TSNE. You can find more information about this algorithm at the link: [https://distill.pub/2016/misread-tsne/](https://distill.pub/2016/misread-tsne/) As before, generate a plot (in 2D) with the PCA result using different colours for each of the response classes (wine_class), with the objective of visualising whether it is possible to separate the classes efficiently with this method.
Hint: It is not necessary to program the algorithm, you can use the implementation available in the "scikit-learn" library. Hint: Apart from specifying the number of components, try using the "perplexity" parameter.
In [23]:
features = data.feature_names
# Separating out the features
x = df.loc[:, features].values
y=data.target
from sklearn.manifold import TSNE

tsne_results = TSNE(n_components=2).fit_transform(x)

target_ids = range(len([0,1,2]))


from matplotlib import pyplot as plt
colors = 'r', 'g', 'b', 'c', 'm', 'y', 'k', 'w', 'orange', 'purple'
for i, c, label in zip(target_ids, colors, [0,1,2]):
    plt.scatter(tsne_results[y == i, 0], tsne_results[y == i, 1], c=c, label=label)
plt.legend()
plt.show()


sns.scatterplot(tsne_results[:,0], tsne_results[:,1], hue=y, legend='full')
Out[23]:
<matplotlib.axes._subplots.AxesSubplot at 0x127690780>
Analysis:
Observing the two plots, do you think the dimensionality reduction has worked well? Has it managed to separate the classes correctly? Which of the two methods has worked better? Why do we obtain such different results?

Yes, I think it has worked correctly in the first case. Using PCA I have managed to observe clearly the distribution of the dataset into 3 classes of wines whereas in the second they are more mixed and it is more difficult to differentiate them.

The results are different because the algorithms are optimised for different datasets. While PCA is very useful for datasets with similar characteristics between classes, TSNE is for the opposite.

4. Training and test

In this last exercise it is a matter of applying a supervised learning method, specifically the Random Forest classifier, to predict the class to which each wine belongs and evaluate the accuracy obtained with the model. For that we will use:

- The original dataset with all the attributes
- The dataset reduced to only 2 attributes with PCA
- The dataset reduced to only 2 attributes with TSNE
Exercise: Using the original dataset: - Divide the dataset into train and test. - Define a Random Forest model (setting n_estimators=10 to keep the model simple). - Apply cross validation with the defined model and the train dataset (with cv=5 it is enough). - Calculate the mean and the standard deviation of the cross validation.
Hint: To separate between train and test you can use train_test_split from sklearn. Hint: To train a random forest model you can use 'RandomForestClassifier' from sklearn. Hint: To apply cross validation you can use 'cross_val_score' from sklearn.
Exercise: Repeat the same procedure as in the previous exercise with the dataset reduced to 2 dimensions with PCA.
Exercise: Repeat the same procedure as in the previous exercise with the dataset reduced to 2 dimensions with TSNE.
In [26]:
# Spot Check Algorithms
train_data = []

from sklearn.ensemble import RandomForestClassifier



features = data.feature_names
# Separating out the features
x = df.loc[:, features].values
from sklearn.manifold import TSNE

tsne_results = TSNE(n_components=2).fit_transform(x)

from sklearn.model_selection import train_test_split

# Separating out the features

train_data.append(('Completo',df.loc[:, features].values))
train_data.append((' 2 dimensiones con PCA.',principalDf))
train_data.append(('2 dimensiones con TSNE.',tsne_results))


# Separating out the target
df['classes'] = data.target

y = df['classes']


from sklearn import preprocessing
y = preprocessing.label_binarize(y, classes=[0,1, 2])

# Split the data into training and testing sets



# evaluate each model in turn
names = []
for name, x in train_data:
    print('*******')
    print(name)
    train_features, test_features, train_labels, test_labels = train_test_split(x, y, test_size = 0.25, random_state = 42)
    print('Training Features Shape:', train_features.shape)
    print('Training Labels Shape:', train_labels.shape)
    print('Testing Features Shape:', test_features.shape)
    print('Testing Labels Shape:', test_labels.shape)
    rf = RandomForestClassifier(n_estimators = 10)
    cv_results = cross_val_score(rf, train_features, train_labels, cv=5, scoring='roc_auc')
    names.append(name)
    msg = "%s:  %f (%f)" % (name, cv_results.mean(), cv_results.std())
    print(msg)
*******
Completo
Training Features Shape: (133, 13)
Training Labels Shape: (133, 3)
Testing Features Shape: (45, 13)
Testing Labels Shape: (45, 3)
Completo:  0.992782 (0.007549)
*******
 2 dimensiones con PCA.
Training Features Shape: (133, 2)
Training Labels Shape: (133, 3)
Testing Features Shape: (45, 2)
Testing Labels Shape: (45, 3)
 2 dimensiones con PCA.:  0.992468 (0.005299)
*******
2 dimensiones con TSNE.
Training Features Shape: (133, 2)
Training Labels Shape: (133, 3)
Testing Features Shape: (45, 2)
Testing Labels Shape: (45, 3)
2 dimensiones con TSNE.:  0.842401 (0.031264)
Analysis: With which data has it worked better? Does it make sense? Does it fit with the results that we have seen in exercise 3?
MeasuresCompletePCATSNE
Mean0.9924680.9927820.809156
Std0.0143360.0110730.036769

It has worked better with PCA, although with the complete one it also has a very good result. It fits with the results of the previous exercise and therefore it makes sense.

Exercise: With the best model that you have obtained: - Generate predictions on the test dataset. - Calculate the accuracy of the obtained predictions and the associated confusion matrix.
Hint: To calculate the accuracy and the confusion matrix you can use the functions inside the sklearn "metrics" module.
In [27]:
x=df.loc[:, features].values

x=principalDf

y = preprocessing.label_binarize(y, classes=[0,1, 2])
train_features, test_features, train_labels, test_labels = train_test_split(x, y, test_size = 0.25, random_state = 42)

# Import the model we are using
from sklearn.ensemble import RandomForestClassifier
# Instantiate model with 1000 decision trees
rf = RandomForestClassifier(n_estimators = 10)
# Train the model on training data
rf.fit(train_features, train_labels);

# Use the forest's predict method on the test data
predictions = rf.predict(test_features)


import matplotlib.pyplot as plt

from sklearn.metrics import confusion_matrix
CM =  confusion_matrix( test_labels.argmax(axis=1), predictions.argmax(axis=1))


# Visualize it as a heatmap
import seaborn
seaborn.heatmap(CM, annot=True, fmt="d")
plt.show()
Exercise: The random forest model depends on many parameters. In this exercise we have only specified the number of estimators (n_estimators) and we have let it use the rest of the parameters by default. Two very useful parameters in random forest (and in any model that uses trees) are max_depth and min_samples_split. These parameters help to control overfitting. Train the previous models using different combinations of the parameters: - n_estimators - max_depth - min_samples_split Have you managed to improve the model? How has each parameter helped to improve it, that is, what is the purpose of each one of them?
  • n_estimators: Number of generated trees.
  • max_depth: Depth of the trees.
  • min_samples_split: Number of cases necessary to split a node.
In [41]:
test = [10,50,100,200]
conf = ['n_estimators','max_depth','min_samples_split']

x=df.loc[:, features].values

x=principalDf

y = preprocessing.label_binarize(y, classes=[0,1, 2])
train_features, test_features, train_labels, test_labels = train_test_split(x, y, test_size = 0.25, random_state = 42)

# Import the model we are using
from sklearn.ensemble import RandomForestClassifier
# Instantiate model with 1000 decision trees

for num in test:
        print('*******')
        print('n_estimators' + str(num))
        rf = RandomForestClassifier(n_estimators = num)
        cv_results = cross_val_score(rf, train_features, train_labels, cv=5, scoring='roc_auc')
        names.append(name)
        msg = "%s:  %f (%f)" % (name, cv_results.mean(), cv_results.std())
        print(msg)



for num in test:
        print('*******')
        print('max_depth' + str(num))
        rf = RandomForestClassifier(max_depth = num)
        cv_results = cross_val_score(rf, train_features, train_labels, cv=5, scoring='roc_auc')
        names.append(name)
        msg = "%s:  %f (%f)" % (name, cv_results.mean(), cv_results.std())
        print(msg)



for num in test:
        print('*******')
        print('min_samples_split' + str(num))
        rf = RandomForestClassifier(min_samples_split = num)
        cv_results = cross_val_score(rf, train_features, train_labels, cv=5, scoring='roc_auc')
        names.append(name)
        msg = "%s:  %f (%f)" % (name, cv_results.mean(), cv_results.std())
        print(msg)
*******
n_estimators10
min_samples_split:  0.989470 (0.012881)
*******
n_estimators50
min_samples_split:  0.987855 (0.014393)
*******
n_estimators100
min_samples_split:  0.990644 (0.009293)
*******
n_estimators200
min_samples_split:  0.991977 (0.007641)
*******
max_depth10
min_samples_split:  0.992803 (0.005915)
*******
max_depth50
min_samples_split:  0.993425 (0.005470)
*******
max_depth100
min_samples_split:  0.994343 (0.004828)
*******
max_depth200
min_samples_split:  0.992940 (0.007179)
*******
min_samples_split10
min_samples_split:  0.989566 (0.006284)
*******
min_samples_split50
min_samples_split:  0.978533 (0.010368)
*******
min_samples_split100
min_samples_split:  0.500000 (0.000000)
*******
min_samples_split200
min_samples_split:  0.500000 (0.000000)

On the one hand it is observed that the increase in the value of the n_estimators and max_depth variables generates more accuracy at the cost of greater computational cost. On the contrary, the increase in min_samples_split implies a decrease in the effectiveness of the algorithm and greater execution speed.