Deconstructed Learning System


Home Demystifying the seemingly complicated

Machine Learning
You’re Welcome


A Machine Learning Guide And Reference

What Is Machine Learning ?


Machine Learning, in a nutshell, is the idea of creating a program which receives some kind of input that it learns from then attempts to make predictions based on the input.

A Quick Deconstruced View

When you get into Machine Learning you are forced to learn a ton of concepts, mostly related to statistics. The concepts are useful but here are some things to consider.

The process of “Machine Learning”

The above modules already have all of the code in them for you to spin up a ML model that can take input and make predictions. This means that your going to have to know the libraries a little and be familiar with object oriented programming because your gonna access all of the library code you import using dot notation.

Ok cool. We imported our ML libraries. Now what ? Now comes the part where understanding the theory behind ML and statistics is going to help.

The Questions!

Preprocessing

  • You might need to convert the data, clean the data (cause it might have a bunch of crazy characters in it so gotta get rid off all that junk first), enrich the data (Pull data from multiple places and collect or combine it in an effort to make the data have more details or be more robust) or normalize the data ( make it into a decimal form to get better calculations or to represent the data in a certain way).

The Model

Once you have the above questions answered and based on them

  • Import the dataset first or initialize the data you’re gonna feed into this model, get it ready.
  • Decide which parts of the library code you’re going to use to create a model because the kind of model you are going to create is going to be dependent upon the answers to the above questions.

Training and Testing

You have to decide, based on the data and the model you used, how you want to go about the training and testing process.

  • You have to train the model on the data so that the model understands the data so you need to feed it certain amounts of the data. How much data do you want to feed it ? All the data you got ?
  • You have to test the model somehow also. So you trained the thing but how do you know if it works ? Maybe its a good idea to reserve a certain portion of the data to be used to see how well the model can make predictions based on it.

Evaluating

Ok now we trained and tested our model what are some different techniques to show how good or bad this thing is performing.

  • It will also help to generate some kind of report that will help you get a visual on the models perofrmance based on how it evaluated.

Final Thoughts

At this point you’re going to have to make a decision.

  • Do you need to run the data through the ML process again but this time tweak the way you did things ?

Click here to see an example of what this process could look like in code form

To get a more in depth understanding read-on!


Terminology


In Machine learning there are alot of terms. Many times there are multiple terms for the same thing. This is something that makes understanding machine learning a pain and a big barrier

In this section I do my best to organize terminology that you’re going to hear a lot in machine learning talk.

This term has an entire section dedicated to it.

Bias Variance Click Here

Numerical: Numbers

  • Things that are numbers (self explanitory) Categorical: Categories ie.
  • Plants
  • Customer A
  • Customer B
  • Dangerous
  • Safe
  • Something Binary ie. ( 0 or 1, yes or no etc..)

Column(s) with the data in the column(s) including the column(s) name(s)

Just a row in a data set.

(A feature)

(AKA) -> Predictor, Input Variable, Vector

The thing that gets created or is influenced by the independent variables.

(AKA) -> Response / Outcome / Target Variables, The Target, Output Supervisory Signal, The Label.

When the value of a dependent variable is already known from a data set.

When the value of a dependent variable is predicted based on Features/Independent Variables in a data set.

The model has not been trained on enough data to be able to make accurate enough predictions. The model is unable to match the input data to the target data.

The model has been trained on too much data that is too similar to eachother and as a result the model does not do good at making predictions on outliers or data that is too dissimilar to the data it was trained on.

  • Occurs when the model is not good at making prediction on data that is has not been trained with.
  • The model can recognize things that are very closely similar to the stuff its been trained with but as soon as you give it something that is outside of that scope it cant make accurate predictions on what its taget value should be.
  • The model is overly trained to the dataset, which may capture noise and produce a non generalized model.
  • countable, individualized, and nondivisible figures in statistics.
  • These data points exist only in set increments.
  • Data analysts and statisticians visualize discrete data using bar graphs, line charts, histograms, and pie charts.

For example, if you track the number of push-ups you do each day for a month, an underlying goal is to evaluate your progress and the rate of improvement. With that said, your daily tally is a discrete, isolated number.

  • Data that can be categorized in a range.
  • Data that has the possibility of going on forever many time data measured in (Time, Days, Hours, Minutes etc..)

Supervised Learning:


Supervised Learning Techniques: (There are two)

What is it ?

It’s a machine larning approach for working with data that already shares some kind of relationship with another piece of data.

Supervised Learning is very much like teaching someone something by holding their hand. You create a kind of tether for them to cling to.

EX.

Age and health are linked (they have a relationship). As you get older your health gets worse, unless you’re not human or discovered some kind of magic you’re not sharing with the rest of us.

For the above example of age and health a data table might look something like this:

Age Health
60 Fair

Bias Variance Note:

The Bias-Variance Tradeoff is relevant for supervised machine learning, meaning it’s something you need to consider when using the technique.

– specifically for predictive modeling. It’s a way to diagnose the performance of an algorithm by breaking down its prediction error.

How these examples are structured

For these examples we want to view data as if its in some kind of table structure like excell or something similar. Just assume it has the following layout for all the examples.

Some Column Title Some Column Title Yet Another Column Title
data for this column data for this column data for this column

Where “Some column title” is an actual column title like “Age” or “Height” and where “ data for this column” is 33 for age and 5,8” for height.

Ex.

Suppose we want to predict daily screen time usage for cell phone owners.

Screen Time
The data we want to predict

The dependent variable in a supervised learning model is the thing we want to predict.

Why is the thing called a dependent variable ?

Because the data can vary and it will vary based on, or “depending on”, other factors or other “variables” ie . other things that vary

Screen time could vary based on:

In order to take the supervised learning approach to be able to predict screentime we may need some, or all of the above “features”.

In Supervised Learning we need a data set that already has all of the Label outputs to start training a model. What does this look like?

Ex.

Independent variable Independent variable Dependent variable
data from variable data from variable data want predict

Label Output

Means the thing has some data. The column above which says “ data want predict” has to have some kind of data in it. It cant all be blank otherwise we can’t train the model. We need data to train the model on to start with for supervised learning.

Basically for supervised learning we have a bunch of data thats already been collected for something and we feed that data into a specific algorithm. The algorithm then can try and predict what a new output might look like based on different independent values.

Dependent Variable: RED

TARGET (The thing we want to predict): BLUE

Age Weight Sex Income Education Marital Status Screen Time
21 180lbs M 25k HighSchool Not Married 6hrs 22mins

Now with one hundred rows of data:

So ok what does this really mean ?

It means:

We can now use this data to train a machine learning model in attempt to try and get some predicitons out of it.

This is supervised learning. Its supervised because we are going to teach the machine learning model by holding its hand, in the sense that, like a baby it doesnt already know how to connect the dots so its our job to show it how based on related things ie. independent variables and dependent variables.

Of course we dont know how to go about that process yet but we will…


Unsupervised Learning:

Definition (Click here for definition) A kind of machine learning where a model must look for patterns in a dataset with no labels and with minimal human supervision. This is in contrast with supervised learning because an unsupervised learning model wont necessarily know that the data shares any kind of similarities and its going to have to look for things in the data that are similar in order to group similar data together.With unsupervised learning its more that we are trying to predict what things are similar and with supervised learning its more that we are trying to predict things based on the output of known data which we already have. Basically unsupervised machine learning tries to bring order to a dataset and make sense of it.

Unsupervised process:

  1. Explore the structure of the information and detect distinct patterns;
  2. Extract valuable insights;
  3. Implement this into its operation in order to increase the efficiency of the decision-making process

A practical example of unsupervised learning:

Applicable in:

Accuracy:

Algorithms:

Use Cases:


Bias Variance Trade Off

Deeper underlying understanding of algorithms and machine learning models.

Bias:

Low Bias:

High Bias:

Variance:

Break Down:

Characteristics of a high variance model include:

Models with high bias will have low variance. Models with high variance will have a low bias.

Low variance (high bias):

Low bias (high variance):

This tradeoff in complexity is why there’s a tradeoff in bias and variance – an algorithm cannot simultaneously be more complex and less complex.

Types of regression models

Simple Linear regression

Ex. Predict CO2 emission vs enginesize of all cars.

Multiple linear regression

Independent variables effectiveness on prediction Ex. Does revision time, test anxiety , lecture attendance and gender have any effect on the exam performance of the student.

Predicting impacts of changes:

  • Understanding how the dependent variable changes when we change the independent variables.


Linear Regression

SideNote: Theta is also called the parameters or weight vector of the regression equation

Prereq

There are two types of linear regression:(when there is only one independent variable present)

NOTE: In the line equation (y = mx + c), m is a slope and c is the y-intercept of the line In the given equation, theta-0 is the y-intercept and theta-1 is the slope of the regression line.

Formula of a line aka how to draw a straight line through a sample

How do we get (a) ? How do we get (b) ?

X Y X^2 (X)(Y)  
2 3 4 6  
4 7 16 28  
6 5 36 30  
8 10 64 80  
20 25 120 144 Σ

(a)

a = (((ΣY)*(ΣX^2))-((ΣX)*(ΣXY))) / n(rows)*(ΣX^2)-(ΣX)^2 = The Intercept

or

a = ((25*120) - (20*144)) / (4*120-(20)^2)  = For the table above

(b)

b = ((n*(ΣXY))-((ΣX)*(ΣY))) / (n*(ΣX^2))-(ΣX)^2 = The Slope

or

b = ((4*144)-(20*25)) / (4*120 - (20)^2) = For the table above


Model Evaluation Approaches

Train and test on the same data set:

This is taking the entire data set

  • building a training model based on it

Then to test the accuracy of the model

  • Take a small sample size from the data set without the labels
  • Build a test training set with the small sample.

The labels are called actual values of the test set. Finally after we run our test model:

  • Check the new predicted values with the actual values to get an idea of our models accuracy.
  • The error of the model is calculated by the average difference between the predicted and actual values for all of the rows.

Training accuracy:

  • is the percentage of correct predictions that the model makes when using the dataset.
  • However a high training accuracy is not always a good thing.

Out of Sample Accuracy:

Tran/Test Split:

Training the model on only a portion of the data and omitting a portion of the data to be used in a second test model.

  • This results in a higher level of out of sample accuracy because the original training set has no record of the data in the test set which means we can get a better idea if the model is actully doing its job by comparing the values produced in both training models.
  • So in essence this is truly out of sample testing.

K-Fold cross validation:

Is another evaluation model which resolves alot of the issues which are left behind in the train/test split evaluation method.

Consider that each row in the table below represents one fold in K-Fold cross validation

25% 25% 25% 25%
TESTING TRAINING TRAINING TRAINING
TESTING TESTING TRAINING TRAINING
TRAINING TRAINING TESTING TRAINING
TRAINING TRAINING TRAINING TESTING


Evaluation Metrics

(In the context of regression the error of the model is the difference between data points and the trend line generated by the algorithm and with multiple data points an error can be determined in multiple ways.)

Mean absolute error

Mean squared error

Root mean squared error

Relative squared error


Multiple Linear Regression

(Where Y(dependent variable) is a linear combination of independent variables(X,X…))

(is a method of predicting a continuous variable. It uses multiple variables called independent variables or predictors that best predict the value of the target variable, ‘the dependent variable’.)

Examples:

Have any effect on the exam performance of a student ?

  • Consider the independent variables effectiveness on prediction

Question?

What is the dependent and what are the independent variables in the above example?

  • The dependent variable or label is the (performance of a student) - also called the outcome variable
  • The independent variables or features are (anxiety, lecture attendance and gender)

Predicting impacts of changes

Estimating multiple linear regression parameters

We want to be able to find the best parameters(theta, independent variables) to feed into our multiple linear regression model so we can generate the most accurate predictions in our outcome variable.

Question?

How do we find the parameter or coefficients for multiple linear regression?

Ordinary least squares

(used on data sets with less than 10,000 lines or smaller data sets)

  • Attempts to estimate the values of the coefficients by minimizing the mean squar error MSE.
  • This approach uses the data as a matrix and uses #Linear Algebra# operations to estimate the optimal values for the theta(independent variable).

Optimization approach

(used on larger data sets)

  • Some kind of optimization algorithm
  • Gradient Descent

Note: After we find the parameters of the linear equation we can move onto the prediction phase.

Making predictions with multiple linear regression

(The goal of regression is to accurately predict an unknown case to this end we have to perform regression evaluation aftetr building the mode)

  1. Train and test on the same data set

We compare the actual values Y with the predicted values Y hat.

  • This is the simplest evaluation approach

results:

Training accuracy

  • is the percentage of correct predicitons that the model makes when using the test data set.

Caveats:

  • High training accuracy is not necessarily a good thing as it can result in over fitting

Over Fit:

  • The model is overly trained to the dataset, which may capture noise and produce a non-generalized model.

Out of sample accuracy

  • The percentage of accurate predcitions that the model makes on data that it has not been trained on.
  • Its important to obtain high out of sample accuracy because the purpose of our model is to make correct predictions on unknown data.
  1. Train/test split

Involves splitting the data set into training and testing sets respectively, which are mutually exclusive.

  • After which you train with the training set and test with the testing set.
  • This will provide a more accurate evaluation on out of sample accuracy because the testing data set is not part of the data set that has been used to train the model.

Caveats:

  • Highly dependent on the data sets by which the data was trained and tested.

K fold cross validation (In reference to Multiple Linear Regression)

  • Resolves most of the issues with train/test split model evaluation method.

How to fix a high variation which results from a dependency?..You avg it.

  • Split the data up into 4 folds:
  • 1st fold: Use the first 25% of the data for testing and the rest for training.The model is built using the training set and is evaluated using the test set.
  • 2nd fold: use the second 25% of the dataset for testing and the rest for training the model.
  • 3rd fold: use the third 25% of the dataset……..
  • 4th fold: etc……
  • Finally: The result of all 4 evaluations are averaged.

Regression evaluation methods

Accuracy metrics for model evaluation(Evaluation metrics in regression models)

Regression accuracy:

  • Evaluation metrics are used to explain the performance of a model.
  • Basically we can compare the actual values and predicted values to calculate the accuracy of a regression model.

What is an error in the context of regression ?

  • The difference between the data points and the trend line generated by the algorithm.
  • Measure of how far the data is from the fitted regression line.
  • Since multiple data points exist an error can be determined in multiple ways.

Classification

(A supervised learning approach, categorizing some unknown items into a discrete set of categories or “classes” classification attempts to learn the relationship between a set of featured variables and a target variable of interest.)

How classification and classifiers work

Given a set of training data points along with the target labels classification determines the class label for an unlabeled test case.

The goal of a loan default predictor is to use existing loan default data, which is, info about the customers such as age, income and education to build a classifier, pass a new customer or a potential future defaulter to the model and then label it ie. the data points as defaulter or not defaulter or 0 or 1.


Multiclass classification

Example

Data collected on a group of patients that had the same illness and responded to one of three different types of medications they took during the course of their treatment.

  • This kind of labeled dataset can be used with a classification algorithm to build a classification model.
  • Then you can use it to find out which drug might be effective for future patients with the same illness.

K-Nearest Neighbors algorithm (KNN a specific type of classification)

The K-Nearest Neighbors algorithm is a classification algorithm that takes a bunch of labeled points and uses them to learn how to label other points.

How does it classify data

The algorithm takes a bunch of “labeled” points and uses them to learn how to label other points. Labeled points in this case ,based on (Example Below)

These groups/categories/classes or “labeled points’ are:

Lets pretend:

Predicting:

Now lets say we have a new customer on the phone and we want to know what class/category this customer may fall into. In order to find out we might first want to

  • Look at some data/information from customers in all of the service categories
  • So lets start with age and income.

Suppose we find out our new customer is under 30 and makes 33k a year.

  • With this data can we make a guess as to which service group they may fall into?

Yes, we can!


Classification Evaluation Metrics

Steps ( J(X,Y) = X∩Y / X∪Y ) (Formula)
  1. Count the number of members which are shared between both sets.

  2. Count the total number of members in both sets (shared and un-shared).

  3. Divide the number of shared members (1) by the total number of members (2).

  4. Multiply the number you found in (3) by 100.

  • This percentage tells you how similar the two sets are.

Jaccard Index Caveat

Although it’s easy to interpret, it is extremely sensitive to small samples sizes and may give erroneous results,especially with very small samples or data sets with missing observations.

  • Combines the precision and recall scores of a model.
  • The accuracy metric computes how many times a model made a correct prediction across the entire dataset.

To understand the calculation of the F1 score, we first need to look at a Confusion Matrix.

  • (A matrix of numbers that tell us where a model gets confused)
  • a class-wise distribution of the predictive performance of a classification model
  • The confusion matrix is an organized way of mapping the predictions to the original classes to which the data belong.

For a binary class dataset (which consists of, suppose, “positive” and “negative” classes), a confusion matrix has four essential components:

  1. True Positives (TP): Number of samples correctly predicted as “positive.”

  2. False Positives (FP): Number of samples wrongly predicted as “positive.”

  3. True Negatives (TN): Number of samples correctly predicted as “negative.”

  4. False Negatives (FN): Number of samples wrongly predicted as “negative.”

The F1 score is defined based on the precision and recall scores, which are mathematically defined as follows:

Caveat

This can be a reliable metric only if the dataset is class-balanced, meaning each class of the dataset has the same number of samples.

Cross Entropy

The difference between two probability distributions.

Log Loss (cross entropy loss)


Decision Trees

‘How is one built based on the data’?

Decision Trees are classification algorithms Built using recursive partitioning (breaking up the data further and furter down the line)

Ex.


Age
├──Young
├──Mid Aged
└──Senior

Decision trees are built by splitting the training set into distinct nodes.

Patients with a illness that have all recieved two types of medication

  • Drug (A)
  • Drug (B)

Feature sets or categories we can start looking at:

  • Age (Young, Middle Aged, Senior)
  • Sex (M, F)
  • Blood Pressure (Normal, High, Low)
  • Cholesterol (Normal, High, Low)

Basically all patients will have all of these attributes and our target is the drug that they responded to meaning since all of the patients were given both drugs we have a list of which patients responded to either drug and we want to group these patients to find out how likely someone not in the sample set will respond to either of the medications.

Some examples of things we might find:

Patient Age Sex BP Cholesterol Drug Response
1 23 F High High Drug(A)
2 47 M Low High Drug(B)
3 47 M Low High Drug(B)
4 28 F Norm High Drug(A)
5 61 F Low High Drug(A)
6 22 F Norm High Drug(A)
7 49 F Norm High Drug(B)
8 41 M Low High Drug(B)
9 60 M Norm High Drug(B)

Regression Trees

A regression tree is a decision tree that can take continuous values as the target variable instead of a discrete value.

Use Cases

It seems like regression trees are good to use in situations where you want to be able to predict the range of something or for problemsthat deal with categorical sequences.

Truly though, regression trees are used for dependent varaibles with continuous values and classification trees are used for dependent variables with discrete values.

A Leaf

In a regression tree each leaf represents a numeric value.

Ex.

Drug effectivness based on different categories


Age > 50 |
         ├──4.2% Effective]
         |
         ├──Dosage >= 29ml|
                          ├──[29% Effective]
                          |
                          └──[Sex|
                                 ├──[Male 100%]
                                 |
                                 └──[Female 50%]


Intro to Logistic Regression

More often used in binary classification problems. Can be more effective for these cases than linear regression.

Sigmoid Function

What is logistic regression?

Note:

  • Notice that in all these examples not only do we predict the class of each case,we also measure the probability of a case belonging to a specific Yes or No class.

What kind of problems can be solved using it?

  • Could be used in binary classification
  • Multi-class classification.

In which situations should we use it?

(Logistic Regression predicts the probability score between zero and one for a given sample of data)

Parameters of Logistic Regression ( Things it needs to work)

The training process

  1. Initialize Θ Theta
  2. Calculate ŷ = σ(ΘT X) for a customer
  3. Compare the output of ŷ with actual output of customer, Y, and record it as error.
  4. Calculate the error for all customers.
  5. Change the theta to reduce the cost.
  6. Go back to step 2.

How can we change the value of theta so that the cost is reduced across iterations?

  • There are different ways to change the value of theta but one of the most popular ways is gradient descent.

When should we stop the iterations?

  • By calculating the accuracy of your model and stopping it when its satisfactory

Training a logistic regression model and how to change the parameters of the model to better estimate the outcome.

The main objective of training in logistic regression is to change the parameters of the model so as to be the best estimation of the labels of the samples in the dataset.

Question

How do we find the best weights or parameters that minimize the cost function?

Answer

We should calculate the minimum point of this cost function and it will show us the best parameters for our model.

Basically we are going to use the minus log function -log. So the idea behind this is

  • suppose we want a value of 1 which is our desired output
  • this means we need a cost function that wiil return us 0.
  • -log(ŷ) What this means is if our predicted output is 1 and our actual value is 1 than our cost function is 0 meaning there is no error basically.
  • If our predicted value is less than 1 and our actual value is 1 than our cost function is going to give us a value greater than 0.

Minimizing the cost function of the model (recap)

How to find the best parameters for our model?

  • Minimize the cost function

How to minimize the cost function?

  • Use gradient descent.

What is gradient descent?

  • An iterative approach to finding the minimum of a function.
  • A technique to use the deriviative of a cost function to change the parameter values, in order to minimize the cost.

Using Gradient descent to minimize the cost.

How can gradient descent do this?

Training algorithm recap

  1. Initialize the parameters randomly.
  2. Feed the cost function with training set, and calculate the error.
  3. Calculate the gradient of the cost function.
  4. Update weights with new values.
  5. Go to step 2 until the cost is small enough. We continue this loop until we reach a short value of cost or some limited number of iterations.
  6. Predict the new customer X. The parameter should be roughly found after some iterations.

Support Vector Machines


Multi Class Prediction


Clustering (Intro to clustering)

Definition

Clustering is an unsupervied machine learning method of identifying and grouping similar data points in large data sets without concern for the specific outcome.

Caveat

Applications of Clustering in different fields

Why Clustering?

Clustering is very important as it determines the intrinsic grouping among the unlabelled data present.

A Practical Example

Suppose we have a customer data base and we want to find some similarites between these customers. Now suppose we create a machine learning model and apply a clustering algorithm to the data we input into our model.

The clustering algorithm might return three groups of customers that we see have been grouped by demographic data. Groups:

  • (A). Affluent Middle Aged People
  • (B). Young Educated, Mid Ranged Income People
  • (C). Young and Low Income People

Clustering is often used to make recommendations to users based on similars users taste or based on similar habits.

A clustering algorithm might recognize that you interacted with a particular add for 2 minutes and then based on other people who exhibited the same behaviour from its records it might also recommend you a shampoo or something because people that fall into the category of interacting with that specific add for that length of time typically bought this one shampoo shortly after.

So basically clustering algorithms be like:

  • Hey bro I see you did like these other peeps, that like this one thing, who then after also like this other thing. Maybe you like this other thing too?

Uses

Generally clustering can be used for one of the following purposes:

Different Clustering Algorithms


K-Means

(K-Medians (Fuzzy c-Means))

Intro to K-Means: (Is an iterative algorithm)

K-means clustering algorithm

K-means divides the data into non-overlapping subsets (clusters) without any cluster internal structure.

  • Examples within a cluster are very similar.
  • Objects across different clusters are very diffferent or dissimilar.
  • Note: This make sense since there are non-overlapping subsets.

Questions:

  • How can we fid similarity of samples in clustering?
  • How do we measure how similar two customers are with regard to their demographics?

Answers:

  • Sometimes in order to do this we can, instead of measuring how similar samples are. We can instead measure how different they are or how dissimilar they are.

K-Means tries to minimize the amount of differences inside of a cluster and maxamize the amount of differences for clusters outside of a cluster.

  • ie. Intra cluster distances are minimized
  • Inter-cluster distances are maximized

In order to calculate the distance we use the Euclidean Distance or the Minkowski distance

  • We can use this dame distance matrix for multidimensional vectors after we normalize our feature set to get an accurate dissimilarity measurement.

Other Dissimilarity measurements that can be used

  • Euclidean
  • Cosine Similarity
  • Average Distance

The K-Means clustering process

  1. Initialize K (Decide the cluster size you want to set for K)

(Define the centroid of each cluster):

Centroids should be of same size as our feature set.

  • ie. How many clusters of groups the algorithm is going to group data into or how many groups you want it to group data into.
  • Centroids are really just central points within certain cluster groups that are going to serve as a kind of model of what similar data needs to be in order to be grouped in the same cluster as a specific centroid.

Two approaches to choose a centroid:

  • i. Randomly choose observations out of the dataset then use these observations as the initial means (averages).
  • ii. Create random points as centroids of the clusters.
  1. Distance Calculation:

Calculate the distance of each datapoint from the centroid points.

  • Ultimately this process is going to produce a matrix in where each point will represent the distance of a data point from each centroid aka (Distance Matrix).
  1. Assign each point to the its closest centroid.

K-Means Error is the total distance of each point from its centroid.

  • Sum of Squares error.
  1. Compute the new centroids for each cluster. (To improve the error)

In this step each centroid gets updated to the mean for data points in its cluster.

  • Each centroid is going to move according to their cluster members.

To break this down a little:

Based on all of the points within a cluster

  • The algorithm is going to get an average of them
  • Say ok this is the middle of all of those points in this neighborhood so actually this is a better centroid to use for this specific cluster.
  • This process continues until the centroids stop moving or until the algorithm decides it has found a solid enough central point on this graph between all these data points that live in this cluster. KEEEP IN MIND:
  • Every time the centroid moves each point in relation to the centroid needs to be measured again.
  1. Repeat the process untill there are no more changes. (Iteratively) Steps 2-4 until the algorithm converges.

Caveat:

There is no guarantee that the algorithm is going to converge to the global optimum and the results may depend on the initial clusters.

Which means the result of this algorithm may not produce the best possible outcome.

Solution:

It is common to run the whole process multiple times with different starting conditions. This means with randomized starting centroids it may produce a better outcome.

  • Since this algorithm is usually very fast it should not be a problem to run it multiple times.

K-Means accuracy and characteristics

  1. Works by randomly placing k centroids, one for each cluster.

    The farther apart the clusters a placed the better.

  2. Calculate the disance of each point from each centroid.
  3. Assign each data point (object) to its closest centroid, creating clusters or groups.
  4. Recalculate the position of the K centroid.

How can we evaluate the goodness of the clusters formed by K-means?

Choosing K:

The value of K is ambigous because it is dependent on the shape and scale of the distributions of points in a dataset.

There are some solutions to this problem

  • One of them is creating different values for K and then get an average of all of the models ran with different values for K to see which value for K fits the best. This metric can be mean, distance between data points and their clusters centroid.

So basically everytime you run a model with a different value for K you then measure the distance between the different clusters centroids and the points inside of the cluster.

The problem is that with increasing the number of clusters the distance of data points to centroids will always reduce.

  • This means that increasing the value of K will always decrease the error.
  • So, the value of the metric as a function of K is plotted and the elbow point is determined where the rate of decrease sharply shifts.
  • In other words we repeat this process on a graph untill we see the margin of error decrease drastically.
  • We then choose the value of K based on this. This method is called the elbow method.

Drawback is that we need to pre specify the number of clusters which is ultimately not an easy task.