Unsupervised methods

This assignment develops a practical introduction to unsupervised learning with Python. Instead of focusing on a single dataset, the notebook moves through several scenarios to show how clustering quality depends on geometry, preprocessing, representation and the final application.

Quick guide

Clustering is useful when there is no target variable and the goal is to discover structure. In this post the decision is practical: normalize first, test whether k-means is compatible with the geometry of the data, use dimensionality reduction as a diagnostic tool and interpret clusters only when they map to a meaningful business or analytical question.

1. Classic clustering: k-means and the elbow rule

The first exercise works with a retail-style customer dataset described by variables such as weekly visits, monthly purchases and average monthly spending. The notebook begins by visualizing and normalizing these features so that no variable dominates the others purely because of scale.

Once the data is prepared, the clustering process is studied with k-means and the elbow rule. The notebook measures the sum of squared errors for different values of k and uses the shape of the curve to identify a reasonable number of groups.

SSE=i=1KxCieuclidean(x,ci)2SSE = \sum_{i=1}^{K} \sum_{x \in C_i} euclidean(x, c_i)^2

2. Clustering with shapes and feature engineering

The second part shows why standard k-means is not always enough. Some datasets contain elongated, curved or density-based structures that are not well described by spherical centroids. For that reason, the notebook compares several strategies before drawing conclusions from the clusters.

  • k-means on transformed spaces: testing whether a better representation improves separation.
  • Density-based algorithms: going beyond centroid methods when clusters depend on local density.
  • Hierarchical clustering: studying nested structure and alternative grouping criteria.
  • Feature engineering: creating or reshaping variables so that the clustering objective is closer to the real structure of the data.

This section makes the main methodological point of the assignment: clustering is not just about choosing an algorithm, but also about choosing the right representation of the problem.

3. Dimensionality reduction as analytical support

The practice also uses dimensionality-reduction techniques such as t-SNE to create lower dimensional views of the data. These reduced spaces help interpret whether clusters are actually separated and whether preprocessing decisions are helping or hurting the structure we want to recover.

In this notebook, reduction is not presented as an isolated topic. It is used as a support tool for understanding the behavior of the clustering pipeline.

4. Application: image compression

The last exercise applies unsupervised learning to a concrete task: compressing images. Here clustering is no longer only descriptive. It becomes a practical mechanism to reduce the number of representative colors while preserving as much visual information as possible.

This closes the notebook with a useful real-world example: the same ideas used for grouping customers or abstract data points can also be used to simplify a signal and reduce storage requirements.

Main takeaway

The practical conclusion of this assignment is that there is no universal clustering recipe. The final result depends on the scale of the variables, the geometry of the data, the meaning of the features and the quality of the representation used before training. The full original notebook, with the complete code and figures, remains below in Spanish.