[ST1511] AI & Machine Learning
Hopefully this page can help to sort out the information in my brain... GOD Machine Learning is already difficult enough and I can't seem to grasp the idea of implementing it.
Resources:
Topic 05: Model Improvement
Learning Outcomes
Apply Cross-Validation
Explain leave-one-out cross-validation
Explain k-fold cross-validation
Apply Hyper-parameter tuning
Apply grid search cv
Apply random search cv
Apply Regularization
Curse of Dimensionality (CoD)
Explain L1 regularization
Explain L2 regularization
Data Scaling (Feature Scaling)
Apply Ensemble Learning
Explain bagging and random forest
Explain boosting and gradient boosted trees
Explain stacking
Cross-Validation
Training/Testing data split
What if the classifier has "non-tunable" parameters?
A parameter is "non-tunable" if tuning(or training) it on the training data leads to overfitting.
Leave-One-Out Cross-Validation (LOOCV)
K-Fold Cross Validation
Since data are oftern scarce, there might not be enough to set aside for a validation sample.
To work around this issue, k-fold CV works as follows:
Split the sample into k subsets of equal size
For each fold, estimate a model on all the subsets except one
Use the left out subset to test the model, by calculating a CV metric of choice
Average the CV metric across the subsets to get the CV error.
This has the advantage of using all the data for estimating the model, however finding a good value for k can be tricky.
Hyperparameter Tuning
In scikit-learn, they are passed as arguments to the constructor of the estimator classes.
Using sklearn GridSearchCV
GridSearchCV exhaustively considers all parameter combinations (aka Brute forcing)
Using sklearn RandomSearchCV
RandomSearchCV can sample a given number of candidates from a parameter space with a specified distribution.
Tuning the hyperparameters
A search consists of:
an estimator
a parameter space
a method for searching or sampling candidates
a cross-validation scheme
a score function
Regularization
Curse of Dimensionality (CoD)
The dimension of a dataset corresponds to the number of attributes/features that exist in a dataset. A dataset with a large number of attributes, generally of the order of a hundred or more, is referred to as high dimensional data.
If features > observations:
We run the risk of massively overfitting the model - this would result in terrible testing performance.
When we have too many features, observations become harder to cluster
Too many dimensions can cause every observation in the dataset to appear equidistant from all the others.
Reducing Overfitting
Overfitting happens when the model learns signal as well as noise in the training data and wouldn't perform well on new data on which the model wasn't trained on.
There are a few ways to avoid overfitting the model on training data like cross-validation sampling, reducing the number of features, pruning, regularization etc.
Regularization and CoD
Regularization is one way to avoid overfitting
In models where regularization is not applicable, such as decision trees and KNNs, we can use feature selection and dimensionality reduction techniques to help us avoid the Curse of Dimensionality.
Essentially, regularization penalizes the classifier for paying attention to features that don't help much toward answering the question at hand, while favoring features that provide relevant information. This filters out the irrelevant and sparse sections of the data, leaving the classifier to only pay attention to the robust features of the dataset.
Regularization basically adds the penalty as model complexity increases.
Regularization parameter (lambda) penalizes all the parameters except intercept so that the model generalizes the data and won't overfit.
L1 Regularization (Lasso Regression)
Lasso Regression (Least Absolute Shrinkage and Selection Operator) adds "absolute value of magnitude" of coefficient as penalty term to the loss function.
The key difference between these techniques is that Lasso shrinks the less important feature's coefficient to zero thus, removing some features altogether. So, this works well for feature selection in case we have a huge number of features.
Topic 06: Unsupervised Machine Learning
Learning Outcomes
Understand Unsupervised Learning
Discuss the different types of unsupervised learning
Explain the challenges of unsupervised learning
Explain pre-processing and scaling
Understand Clustering
Explain dimensionality reduction
Anomaly Detection
What is Unsupervised Learning?
In unsupervised learning, data points have no labels associated with them. Instead, the goal of an unsupervised learning algorithm is to organize the data in some way or to describe its structure.
This can mean grouping it into clusters or finding different ways of looking at complex data so that it appears simpler or more organized. (Dimension Reduction)
Challenges of Unsupervised Learning
Unsupervised learning is much more subjective than supervised learning, as there is no simple goal for the analysis, such as the prediction of response.
It is more difficult than supervised learning because there is no gold standard(like an expected outcome/target variable) and no single objective (test set accuracy).
The data points are often unlabelled. Initial values for algorithms such as k-means number of centroids, k, is unknown.
Scaling of variables/features matters
Pre-processing & Feature Scaling
Standardizing the features so that they are centered around 0 with a standard deviation of 1 is not only important if we are comparing measurements that have different units, but it is also a general requirement for many machine learning algorithms.
The result of standardization (or Z-score normalization) is that the features will be rescaled so that they'll have the properties of a standard normal distribution with
where μ is the mean (average) and σ is the standard deviation from the mean; standard scores (also called z-score) of the sample are calculated as follows:
Some examples of algorithms where feature scaling matters are:
K-nearest neighbors with a Euclidean distance measure if you want all features to contribute equally.
K-means (see K-nearest neighbors)
Logistics Regression, SVMs, Perceptrons, Neural Networks, etc. If you are using gradient descent/ascent-based optimization, otherwise some weights will update much faster than others.
Linear Discriminant Analysis, Principal Component Analysis, Kernel Principal Component Analysis since you want to find directions of maximizing the variance (under the constraints that those directions/eigenvectors/principal components are orthogonal); you want to have features on the same scale since you'd emphasize variables on "larger measurement scales" more.
In addition, we'd also want to think about whether we want to "standardize" or "normalize" our data. Some algorithms assume that our data is centered at 0. For example, if we initialize the weights of a small multi-layered perceptron with tanh activation units to 0 or small random values centered around zero, we want to update the model weights "equally". As a rule of thumb: when in doubt, just standardize the data, it shouldn't hurt.
About Min-Max Scaling
An alternative approach to Z-score normalization (or standardization) is Min-Max scaling (often simply called "normalization" - a common cause for ambiguities.).
The cost of having this bounded range - in contrast to standardization - is that we will end up with smaller standard deviations, which can suppress the effect of outliers.
A Min-Max scaling is typically done via the following equation:
Z-Score Standardization or Min-Max Scaling?
There is no obvious answer to this question as it really depends on the application.
For example, in clustering analysis, standardization may be especially crucial in order to compare similarities between features based on certain distance measures. Another prominent example is Principal Component Analysis, where we usually prefer standardization over Min-Max scaling since we are interested in the components that maximize the variance.
However, that doesn't mean that Min-Max scaling is not useful at all. A popular application is image processing, where pixel intensities have to be normalized to fit within a certain range. Also, typical neural network algorithms require data that are on a 0-1 scale.
Standardization & Scaling in Scikit-Learn
from sklearn import preprocessing
std_scale = preprocessing.StandardScaler().fit(df[['Alcohol', 'Malic Acid']])
df_std = std_scale.transform(df[['Alcohol', 'Malic Acid']])
minmax_scale = preprocessing.MinMaxScaler().fit(df[['Alcohol', 'Malic Acid']])
df_minmax = minmax_scale.transform(df[['Alcohol', 'Malic Acid']])
What is Clustering?
Over the years, many clustering algorithms have been developed. Almost all clustering algorithms use the features of individual items to find similar items.
For example, you might apply clustering to find similar people by demographics. You might use clustering with text analysis to group sentences with similar topics or sentiments.
Clustering is called a non-supervised learning technique because it can be used in unlabeled data. Indeed, clustering is a useful first step for discovering new patterns and requires little prior knowledge about how the data might be structured or how items are related.
Clustering is often used for exploration of data prior to analysis with other more predictive algorithms.
Comparison of different Clustering algorithms
K-Means Clustering
When you configure a clustering model using the k-means method, you must specify a target number k indicating the number of centroids you want in the model.
The K-means algorithm assigns each incoming data point to one of the clusters by minimizing the within-cluster sum of squares.

K-Means Algorithm
Given an initial set of k means m1(1),...,mk(1), the algorithm proceeds by alternating between two steps:
Assignment step: Assign each observation to the cluster whose mean has the least squared Euclidean distance, this is intuitively the "nearest" mean.
Update step: Calculate the new means to be centroids of the observations in the new clusters.
Demonstration of the Standard K-means algorithm.
k initial "means" (in this case k=3) are randomly generated within the data domain (shown in color).

Gaussian Mixture Model(GMM) Clustering
GMM clustering aims to partition n observations into k clusters. K-means define heard assignment: the samples are to be and only to be associated to one cluster.
Each sample has a probability to be associated with each cluster.

Hierarchical Clustering
It does not need the number of clusters as an input and the partitions can be viewed at different levels of granularities. (can refine/coarsen clusters) using different K.
Dimension Reduction
We generally do not want to feed a large number of features directly into a machine learning algorithm since some features may be irrelevant or the "intrinsic" dimensionality may be smaller than the number of features.
Principal Component Analysis (PCA), Singular Value Decomposition (SVD), and Linear Discriminant Analysis (LDA) all can be used to perform dimension reduction.
Principal Component Analysis (PCA)
PCA is an unsupervised clustering method that maps the original data space into a lower-dimensional space while preserving as much information as possible
PCA basically finds a subspace that most preserves the data variance, with the subspace defined by the dominant eigenvectors of the data's covariance matrix.
Singular Value Decomposition (SVD)
The SVD is related to PCA in the sense that SVD of the centered data matrix (features versus samples) provides the dominant left singular vectors that define the same subspace as found by PCA.
However, SVD is a more versatile technique as it can also do things that PCA may not do. For example, the SVD of a user-versus-movie matrix is able to extract the user profiles which can be used in a recommendation system. In addition, SVD is also widely used as a topic modeling tool, known as latent semantic analysis, in natural language processing (NLP).
Linear Discriminant Analysis (LDA)
The model fits a Gaussian density to each class, assuming that all classes share the same covariance matrix. The fitted model can also be used to reduce the dimensionality of the input by projecting it to the most discriminative directions.
Note: Unlike PCA, LDA is not an unsupervised learning algorithm and requires labelled data.
LDA is based upon the concept of searching for a linear combination of variables (predictors) that best separates two classes (targets).
To capture the notion of separability, Fisher defined the following score function:
Given the score function, the problem is to estimate the linear coefficients that maximize the score.
Comparison of LDA vs PCA 2D projection of Iris dataset
The Iris dataset represents 3 kind of iris flowers (setosa, versicolour and virginica) with 4 attributes: sepal length, sepal width, petal length, petal width.
Principal Component Analysis(PCA) applied to this data identifies the combination of attributes (principal components, or directions in the feature space) that account for the most variance in the data. Here we plot the different samples on the first 2 principal components.
Linear Discriminant Analysis(LDA) tries to identify attributes that account for the most variance between the classes. In particular, LDA, in contrast to PCA, is a supervised method, using known class labels.
LDA vs PCA


Anomaly Detection
Identifying transactions that are potentially fraudulent
Learning patterns that indicate that a network intrusion has occurred
Finding abnormal clusters of patients
Checking values entered into a system
Because anomalies are rare events by definition, it can be difficult to collect a representative sample of data to use for modeling. The algorithms included in this category have been specially designed to address the core challenges of building and training models by using imbalanced data sets.
One-Class SVM
It infers the properties of normal cases and from these properties can predict which examples are unlike the normal examples.
This is useful for anomaly detection because the scarcity of training examples is what defines anomalies: that is, typically there are very few examples of network intrusion, fraud, or other anomalous behaviors.

PCA-based Anomaly detection
The PCA-based Anomaly Detection module solves the problem by analyzing available features to determine what constitutes a "normal" class and applying distance metrics to identify cases that represent anomalies.

Last updated
Was this helpful?