Demystifying  AI/ML algorithms – Part II: Supervised algorithms

About the series

This is the second part of my series on Demystifying AI/ML algorithms. This series is intended for curious people, who missed the buzz around AI/ ML until GenAI largely captured their attention. Some of the contents I have already shared years back, but feel it is important to revisit them before plunging into GenAI. I traced the origin of AI/ML in my first part of the current series and discussed how Good-Old-Fashioned-AI gave a real start to AI and still remains relevant (https://ai-positive.com/2024/08/28/understanding-gofai-rules-rule-and-symbolic-reasoning-in-ai/).

Patterns and Meaning

What makes us human is the need for us to search for meaning. If you want to get clarity from chaos, you try to identify patterns among chaos. Patterns are observations organized into meaningful categories. Charles Darwin’s theory of evolution and Gregor Mendel’s laws of heredity are outcome of careful observations of nature around. Patterns can be derived from observations of numbers, people’s behaviours, musical scores, and even our thoughts. We need a large number of observations to identify patterns. Data from observations and eliciting patterns from them brings out clarity of the real world represented by the data and enables predictability. Statistics, considered by many as a boring part of mathematics provides methods to derive patterns from data.

Machine learning algorithms are rooted in statistics. Statistical foundations of these algorithms enable them to learn from data, adapt, and generalize:

  • Learn from Data: They identify patterns and relationships in data without needing specific ‘if-then-else’ rules.
  • Adapt and Improve: They can adapt to new data and improve their performance through training and validation.
  • Generalize: They aim to generalize from the training data to make accurate predictions on unseen data inputs.

When it comes to learning, there comes a teacher. There are also self-learners. There emerge two subcategories of these machine learning algorithms which I refer to as ‘Seen it before’ and ‘Selfies.’ In literature they are classified as Supervised and Un-Supervised algorithms.

Seen it before Algorithms

This category of supervised algorithms revolves around:

  • Learning to see similarities between situations and thereby inferring other similarities, like if two patients have similar symptoms, they may have the same disease.
  • The key is judging how similar two things are and which similarities to take forward and how to combine them to make new predictions.

They help solve real-word problems thru:

  • Regression – deriving extent of relationship between set of data points reflecting the problem to predict a new value in the problem context.
  • Classification – sorts data from problem context into distinct groups and helps predicting whether a new data point belongs to a particular group or not.

These algorithms need a label to group a set of data points during training to create a model that helps predicting the group for a new set of data points, the reason they are referred as Supervised Algorithms.

Linear Regression

If you want to predict to what extent you will feel relaxed when you sleep for a particular number of hours on a specific day, a line drawn between number of hours of sleep data on one axis and extent of relaxation on the other axis gathered from a good number of observations will become the pattern that would help to model using Linear Regression.

Used for predicting continuous values, Linear Regression models the relationship between a dependent variable and one or more independent variables from among the data to elicit a pattern.

Typical use-cases:

  • Used for predicting price to be offered for apartments by builders based on features like location, number of rooms and other factors.
  • Businesses use it forecast future sales based on historical sales data, market spend and economic indicators.
  • I have used it for estimating efforts for software testing based on the characteristics of application under test.

Logistic Regression

If you want to predict whether your favourite IPL team will win a particular match or not, logistic regression helps to determine the probability of this result happening based on factors like home advantage, strength of teams and weather conditions.

Logistic Regression uses past data to give a percentage chance of an outcome and then making a yes/ no prediction based whether the probability is at least more than 50% or not. Used for binary classification problems, it predicts the probability of binary outcome unlike Linear Regression which works on continuous values.

Typical use-cases:

  • Predicting whether a patient has a certain disease based on factors such as medical history, age, weight, and lifestyle.
  • Predicting whether a customer will buy a product or not based on past behaviours.

Decision Trees

Used for both regression to predict numerical value as in Linear Regression and for classification like ‘yes/ no’ as in Logistic Regression, decision trees split the data into branches based on various values in the data creating a tree structure to produce an output.

Referred as non-parametric models, decision trees make fewer assumptions about data distribution unlike Linear Regression or Logistic Regression models which assume a normal distribution. While Decision Trees are flexible to adopt the pattern of underlying data, they are more complex and require more data to achieve satisfactory results. Decision Trees are better choice when there exist complex interactions between various fields in the data and in scenarios where interpretability the prediction process is key.

Typical use-cases:

  • Marketing teams to segment customers based on purchasing behaviour, demographics, and engagement, when data consists of label.
  • Credit scoring agencies to identify riskier applicants based on income, credit history and employment status.

Support Vector Machines (SVM)

Used for both classification and regression problems like Decision Trees, SVMs are better when the number of features (data fields) runs to hundreds. Suppose if the problem is to make a robot sort between apples and oranges based on various characteristics of apple and orange, SVM identifies the best ‘straight line’ between them. If there is any overlap, SVM performs ‘kernel trick’ to transform data into a 3D space and separate them thru hyperplane easily.

While SVM algorithms manage high dimensional spaces, Decision Trees are simpler and better interpretable when the data fields are in tens and not in hundreds. Overfitting can happen in Decision Trees when the number of features increase in which case SVM is a better choice.

Typical use-cases:

  • SVM works well for problems that can be solved by classification such as identifying objects in photos and detecting faces in images.
  • Sentiment analysis in social media postings such as whether it contains hate speech largely depend on text categorization capability of SVM algorithm.
  • SVM algorithms are also used in speech recognition applications as it can be used to recognize spoke words and convert them into text.

k-Nearest Neighbours (kNN)

A simple, instance-based learning algorithm k – Nearest Neighbours (kNN) can be used for both classification and regression. It classifies a data point based on the majority class among its k nearest neighbours.

Suppose there is a party related to musical awards event and there are fans of major composers in the party.  In general, we can expect fans of a particular composer to get close to each other and engage in animated discussions. If a new person enters the party hall and gets settled closer to one of those groups, then it is quite possible that the new person is a fan of the same composer whose fans are close to each other. kNN does this in a mathematical way finding the ‘distance’ between data points and using the majority vote of the nearest neighbours to make predictions.

Typical use-cases:

  • Recommendation systems like how Netflix recommends movies based on your viewing history.
  • Anomaly detection like in ‘fraud detection and network security’ detecting unusual data points in data sets.
  • Speech recognition applications such as identifying and classifying speech patterns to activate voice-based systems use kNN.

kNN can be used for simple to moderate sized data sets. It works on the entire data set and finds out k nearest neighbours, k being the size of elements in the group to be recognized as neighbours. It is less complex and there is no need for training but computationally expensive as it needs the entire data set in memory unlike other algorithms which create a model out of training data.

Extreme Gradient Boosting (XGBoost)

All the above algorithms handle problems that can be solved by classification and regression. Choosing any of them for a particular problem is based on the data set on hand.

Ensemble methods combine multiple models to improve prediction accuracy and robustness. It is like a committee of multiple experts working co-operatively together to arrive at a decision.

Considered as a rock star, XGBoost is the most powerful and efficient among the ensemble methods. XGBoost builds models sequentially, where each new model corrects the errors of previous ones. It also uses smart ways to deal with missing data and thereby eliminates preprocessing.

Typical use-cases:

  • Credit scoring agencies use XGBoost to predict the probability of loan default and assess credit worthiness of loan applicants based on age, income, existing loan, previous defaults, and other details.
  • Banks use XGBoost to detect fraudulent transactions by identifying usual patterns among data such as the type of transaction, time of transaction, and location.
  • Telecom companies use XGBoost to predict customer churn based on usage patterns and customer activities.

Key Take-aways

Seen-it-before algorithms

  • can be used for any prediction problem that can be solved using classification or regression technique and has underlying big data from which patterns can be elicited as in several use-cases cited.
  • can be used individually or combined to form an ensemble to improve predictability and performance.


Leave a comment