[ST1511] AI & Machine Learning

Hopefully this page can help to sort out the information in my brain... GOD Machine Learning is already difficult enough and I can't seem to grasp the idea of implementing it.

Resources:

Topic 05: Model Improvement

Learning Outcomes

  • Apply Cross-Validation

    • Explain leave-one-out cross-validation

    • Explain k-fold cross-validation

  • Apply Hyper-parameter tuning

    • Apply grid search cv

    • Apply random search cv

  • Apply Regularization

    • Curse of Dimensionality (CoD)

    • Explain L1 regularization

    • Explain L2 regularization

  • Data Scaling (Feature Scaling)

  • Apply Ensemble Learning

    • Explain bagging and random forest

    • Explain boosting and gradient boosted trees

    • Explain stacking

Cross-Validation

Training/Testing data split

Training data is used to fit parameters

Testing data is used to assess how the classifier generalizes to new data

What if the classifier has "non-tunable" parameters?

A parameter is "non-tunable" if tuning(or training) it on the training data leads to overfitting.

Leave-One-Out Cross-Validation (LOOCV)

LOOCV is the case of cross-validation where just a single observation is held out for validation.

LOOCV

LOOCV

K-Fold Cross Validation

Since data are oftern scarce, there might not be enough to set aside for a validation sample.

To work around this issue, k-fold CV works as follows:

  1. Split the sample into k subsets of equal size

  2. For each fold, estimate a model on all the subsets except one

  3. Use the left out subset to test the model, by calculating a CV metric of choice

  4. Average the CV metric across the subsets to get the CV error.

This has the advantage of using all the data for estimating the model, however finding a good value for k can be tricky.

Hyperparameter Tuning

Hyper-parameters are parameters that are not directly learned within estimators

In scikit-learn, they are passed as arguments to the constructor of the estimator classes.

Using sklearn GridSearchCV

  • GridSearchCV exhaustively considers all parameter combinations (aka Brute forcing)

Using sklearn RandomSearchCV

  • RandomSearchCV can sample a given number of candidates from a parameter space with a specified distribution.

Tuning the hyperparameters

A search consists of:

  • an estimator

  • a parameter space

  • a method for searching or sampling candidates

  • a cross-validation scheme

  • a score function

Regularization

Curse of Dimensionality (CoD)

Curse of Dimensionality refers to a set of problems that arise when working with high-dimensional data.

The dimension of a dataset corresponds to the number of attributes/features that exist in a dataset. A dataset with a large number of attributes, generally of the order of a hundred or more, is referred to as high dimensional data.

  1. If features > observations:

    • We run the risk of massively overfitting the model - this would result in terrible testing performance.

  2. When we have too many features, observations become harder to cluster

    • Too many dimensions can cause every observation in the dataset to appear equidistant from all the others.

Reducing Overfitting

Overfitting happens when the model learns signal as well as noise in the training data and wouldn't perform well on new data on which the model wasn't trained on.

There are a few ways to avoid overfitting the model on training data like cross-validation sampling, reducing the number of features, pruning, regularization etc.

Regularization and CoD

Regularization is one way to avoid overfitting

In models where regularization is not applicable, such as decision trees and KNNs, we can use feature selection and dimensionality reduction techniques to help us avoid the Curse of Dimensionality.

Essentially, regularization penalizes the classifier for paying attention to features that don't help much toward answering the question at hand, while favoring features that provide relevant information. This filters out the irrelevant and sparse sections of the data, leaving the classifier to only pay attention to the robust features of the dataset.

Regularization basically adds the penalty as model complexity increases.

J(θ)=12m[i=1m(hθ(xi)yi)2+λj=1nθj2]J(\theta) = \frac{1}{2m}[\sum_{i = 1}^m (h_\theta (x^i) - y^i)^2 + \lambda \sum_{j = 1}^n \theta^2_j]

Regularization parameter (lambda) penalizes all the parameters except intercept so that the model generalizes the data and won't overfit.

L1 Regularization (Lasso Regression)

A regression model that uses L1 regularization techniques is called Lasso Regression.

Lasso Regression (Least Absolute Shrinkage and Selection Operator) adds "absolute value of magnitude" of coefficient as penalty term to the loss function.

i=1n(Yij=1pXijβj)2+λj=1pβj\sum^n_{i=1} (Y_i - \sum^p_{j=1} X_{ij}\beta_j)^2 + \lambda\sum^p_{j=1}\mid\beta_j\mid

The key difference between these techniques is that Lasso shrinks the less important feature's coefficient to zero thus, removing some features altogether. So, this works well for feature selection in case we have a huge number of features.

Topic 06: Unsupervised Machine Learning

Learning Outcomes

  • Understand Unsupervised Learning

    • Discuss the different types of unsupervised learning

    • Explain the challenges of unsupervised learning

    • Explain pre-processing and scaling

    • Understand Clustering

    • Explain dimensionality reduction

    • Anomaly Detection

What is Unsupervised Learning?

In unsupervised learning, data points have no labels associated with them. Instead, the goal of an unsupervised learning algorithm is to organize the data in some way or to describe its structure.

This can mean grouping it into clusters or finding different ways of looking at complex data so that it appears simpler or more organized. (Dimension Reduction)

Challenges of Unsupervised Learning

Unsupervised learning is much more subjective than supervised learning, as there is no simple goal for the analysis, such as the prediction of response.

It is more difficult than supervised learning because there is no gold standard(like an expected outcome/target variable) and no single objective (test set accuracy).

The data points are often unlabelled. Initial values for algorithms such as k-means number of centroids, k, is unknown.

Scaling of variables/features matters

Pre-processing & Feature Scaling

Standardizing the features so that they are centered around 0 with a standard deviation of 1 is not only important if we are comparing measurements that have different units, but it is also a general requirement for many machine learning algorithms.

The result of standardization (or Z-score normalization) is that the features will be rescaled so that they'll have the properties of a standard normal distribution with

μ=0 and σ=1\mu = 0\space and \space \sigma = 1

where μ is the mean (average) and σ is the standard deviation from the mean; standard scores (also called z-score) of the sample are calculated as follows:

z=xμσz = \frac{x - \mu}{\sigma}

Some examples of algorithms where feature scaling matters are:

  • K-nearest neighbors with a Euclidean distance measure if you want all features to contribute equally.

  • K-means (see K-nearest neighbors)

  • Logistics Regression, SVMs, Perceptrons, Neural Networks, etc. If you are using gradient descent/ascent-based optimization, otherwise some weights will update much faster than others.

  • Linear Discriminant Analysis, Principal Component Analysis, Kernel Principal Component Analysis since you want to find directions of maximizing the variance (under the constraints that those directions/eigenvectors/principal components are orthogonal); you want to have features on the same scale since you'd emphasize variables on "larger measurement scales" more.

In addition, we'd also want to think about whether we want to "standardize" or "normalize" our data. Some algorithms assume that our data is centered at 0. For example, if we initialize the weights of a small multi-layered perceptron with tanh activation units to 0 or small random values centered around zero, we want to update the model weights "equally". As a rule of thumb: when in doubt, just standardize the data, it shouldn't hurt.

About Min-Max Scaling

An alternative approach to Z-score normalization (or standardization) is Min-Max scaling (often simply called "normalization" - a common cause for ambiguities.).

In this approach, the data is scaled to a fixed range - usually 0 to 1.

The cost of having this bounded range - in contrast to standardization - is that we will end up with smaller standard deviations, which can suppress the effect of outliers.

A Min-Max scaling is typically done via the following equation:

Xnorm=XXminXmaxXminX_{norm} = \frac{X - X_{min}}{X_{max} - X_{min}}

Z-Score Standardization or Min-Max Scaling?

There is no obvious answer to this question as it really depends on the application.

For example, in clustering analysis, standardization may be especially crucial in order to compare similarities between features based on certain distance measures. Another prominent example is Principal Component Analysis, where we usually prefer standardization over Min-Max scaling since we are interested in the components that maximize the variance.

However, that doesn't mean that Min-Max scaling is not useful at all. A popular application is image processing, where pixel intensities have to be normalized to fit within a certain range. Also, typical neural network algorithms require data that are on a 0-1 scale.

Standardization & Scaling in Scikit-Learn

from sklearn import preprocessing

std_scale = preprocessing.StandardScaler().fit(df[['Alcohol', 'Malic Acid']])
df_std = std_scale.transform(df[['Alcohol', 'Malic Acid']])

minmax_scale = preprocessing.MinMaxScaler().fit(df[['Alcohol', 'Malic Acid']])
df_minmax = minmax_scale.transform(df[['Alcohol', 'Malic Acid']])

What is Clustering?

Clustering, in machine learning, is a method of grouping data points into similar clusters. It is also called Segmentation.

Over the years, many clustering algorithms have been developed. Almost all clustering algorithms use the features of individual items to find similar items.

For example, you might apply clustering to find similar people by demographics. You might use clustering with text analysis to group sentences with similar topics or sentiments.

Clustering is called a non-supervised learning technique because it can be used in unlabeled data. Indeed, clustering is a useful first step for discovering new patterns and requires little prior knowledge about how the data might be structured or how items are related.

Clustering is often used for exploration of data prior to analysis with other more predictive algorithms.

Comparison of different Clustering algorithms

K-Means Clustering

When you configure a clustering model using the k-means method, you must specify a target number k indicating the number of centroids you want in the model.

The centroid is a point that is representative of each cluster.

The K-means algorithm assigns each incoming data point to one of the clusters by minimizing the within-cluster sum of squares.

K-Nearest Neighbor

K-Means Algorithm

Given an initial set of k means m1(1),...,mk(1), the algorithm proceeds by alternating between two steps:

  • Assignment step: Assign each observation to the cluster whose mean has the least squared Euclidean distance, this is intuitively the "nearest" mean.

  • Update step: Calculate the new means to be centroids of the observations in the new clusters.

Demonstration of the Standard K-means algorithm.

k initial "means" (in this case k=3) are randomly generated within the data domain (shown in color).

The 3 different means.

Gaussian Mixture Model(GMM) Clustering

GMM clustering aims to partition n observations into k clusters. K-means define heard assignment: the samples are to be and only to be associated to one cluster.

GMM, however, defines a soft assignment for each sample.

Each sample has a probability to be associated with each cluster.

GMM Clustering

Hierarchical Clustering

Hierarchical partitions can be visualized using a tree structure (a dendrogram).

It does not need the number of clusters as an input and the partitions can be viewed at different levels of granularities. (can refine/coarsen clusters) using different K.

Dimension Reduction

We generally do not want to feed a large number of features directly into a machine learning algorithm since some features may be irrelevant or the "intrinsic" dimensionality may be smaller than the number of features.

Principal Component Analysis (PCA), Singular Value Decomposition (SVD), and Linear Discriminant Analysis (LDA) all can be used to perform dimension reduction.

Principal Component Analysis (PCA)

PCA is an unsupervised clustering method that maps the original data space into a lower-dimensional space while preserving as much information as possible

PCA basically finds a subspace that most preserves the data variance, with the subspace defined by the dominant eigenvectors of the data's covariance matrix.

Singular Value Decomposition (SVD)

The SVD is related to PCA in the sense that SVD of the centered data matrix (features versus samples) provides the dominant left singular vectors that define the same subspace as found by PCA.

In linear algebra, the Singular-Value decomposition (SVD) is a factorization of a real or complex matrix.

However, SVD is a more versatile technique as it can also do things that PCA may not do. For example, the SVD of a user-versus-movie matrix is able to extract the user profiles which can be used in a recommendation system. In addition, SVD is also widely used as a topic modeling tool, known as latent semantic analysis, in natural language processing (NLP).

Linear Discriminant Analysis (LDA)

A classifier with a linear decision boundary, generated by fitting class conditional densities to the data and using Bayes' rule.

The model fits a Gaussian density to each class, assuming that all classes share the same covariance matrix. The fitted model can also be used to reduce the dimensionality of the input by projecting it to the most discriminative directions.

Note: Unlike PCA, LDA is not an unsupervised learning algorithm and requires labelled data.

LDA is based upon the concept of searching for a linear combination of variables (predictors) that best separates two classes (targets).

To capture the notion of separability, Fisher defined the following score function:

Z=β1x1+β2x2+...+βdxdZ = \beta_1x_1 + \beta_2x_2 + ... + \beta_dx_d
S(β)=βTμ1βTμ2βTCβS(\beta) = \frac{\beta^{T} \mu_1 - \beta^{T} \mu_2}{\beta^{T}C\beta}
S(β)=Z1ˉZ2ˉVariance of Z within groupsS(\beta) = \frac{\bar{Z_1} - \bar{Z_2} }{\text{Variance of Z within groups}}

Given the score function, the problem is to estimate the linear coefficients that maximize the score.

Comparison of LDA vs PCA 2D projection of Iris dataset

The Iris dataset represents 3 kind of iris flowers (setosa, versicolour and virginica) with 4 attributes: sepal length, sepal width, petal length, petal width.

Principal Component Analysis(PCA) applied to this data identifies the combination of attributes (principal components, or directions in the feature space) that account for the most variance in the data. Here we plot the different samples on the first 2 principal components.

Linear Discriminant Analysis(LDA) tries to identify attributes that account for the most variance between the classes. In particular, LDA, in contrast to PCA, is a supervised method, using known class labels.

LDA vs PCA

LDA vs PCA analysis of Iris dataset
Difference between PCA and LDA

Anomaly Detection

Anomaly detection encompasses many important tasks in Machine Learning.

  • Identifying transactions that are potentially fraudulent

  • Learning patterns that indicate that a network intrusion has occurred

  • Finding abnormal clusters of patients

  • Checking values entered into a system

Because anomalies are rare events by definition, it can be difficult to collect a representative sample of data to use for modeling. The algorithms included in this category have been specially designed to address the core challenges of building and training models by using imbalanced data sets.

One-Class SVM

In one-class SVM, the support vector model is trained on data that has only one class, which is the "normal" class.

It infers the properties of normal cases and from these properties can predict which examples are unlike the normal examples.

This is useful for anomaly detection because the scarcity of training examples is what defines anomalies: that is, typically there are very few examples of network intrusion, fraud, or other anomalous behaviors.

One-class SVM Novelty detection

PCA-based Anomaly detection

The PCA-based Anomaly Detection module solves the problem by analyzing available features to determine what constitutes a "normal" class and applying distance metrics to identify cases that represent anomalies.

Last updated

Was this helpful?