Machine Learning |
Machine Learning, in a nutshell, is the idea of creating a program which receives some kind of input that it learns from then attempts to make predictions based on the input.
A Quick Deconstruced View
When you get into Machine Learning you are forced to learn a ton of concepts, mostly related to statistics. The concepts are useful but here are some things to consider.
- ie. Library Code , Modules
The process of “Machine Learning”
- Scipy
- Scikit-learn
- TensorFlow
- Theano
- Keras
- PyTorch
The above modules already have all of the code in them for you to spin up a ML model that can take input and make predictions. This means that your going to have to know the libraries a little and be familiar with object oriented programming because your gonna access all of the library code you import using dot notation.
Ok cool. We imported our ML libraries. Now what ? Now comes the part where understanding the theory behind ML and statistics is going to help.
The Questions!
Preprocessing
- You might need to convert the data, clean the data (cause it might have a bunch of crazy characters in it so gotta get rid off all that junk first), enrich the data (Pull data from multiple places and collect or combine it in an effort to make the data have more details or be more robust) or normalize the data ( make it into a decimal form to get better calculations or to represent the data in a certain way).
The Model
Once you have the above questions answered and based on them
- Import the dataset first or initialize the data you’re gonna feed into this model, get it ready.
- Decide which parts of the library code you’re going to use to create a model because the kind of model you are going to create is going to be dependent upon the answers to the above questions.
Training and Testing
You have to decide, based on the data and the model you used, how you want to go about the training and testing process.
- You have to train the model on the data so that the model understands the data so you need to feed it certain amounts of the data. How much data do you want to feed it ? All the data you got ?
- You have to test the model somehow also. So you trained the thing but how do you know if it works ? Maybe its a good idea to reserve a certain portion of the data to be used to see how well the model can make predictions based on it.
Evaluating
Ok now we trained and tested our model what are some different techniques to show how good or bad this thing is performing.
- It will also help to generate some kind of report that will help you get a visual on the models perofrmance based on how it evaluated.
Final Thoughts
At this point you’re going to have to make a decision.
- Do you need to run the data through the ML process again but this time tweak the way you did things ?
Click here to see an example of what this process could look like in code form
To get a more in depth understanding read-on!
In Machine learning there are alot of terms. Many times there are multiple terms for the same thing. This is something that makes understanding machine learning a pain and a big barrier
In this section I do my best to organize terminology that you’re going to hear a lot in machine learning talk.
This term has an entire section dedicated to it.
Numerical: Numbers
- Things that are numbers (self explanitory) Categorical: Categories ie.
- Plants
- Customer A
- Customer B
- Dangerous
- Safe
- Something Binary ie. ( 0 or 1, yes or no etc..)
- Evaluation metrics provide insight into areas that might require improvement.
- Help us determine how accurate a given model is by comparing the actual values in a test set with the values predicted by the model.
- In plain english: They are ways of calcluating how well the ML model is performing.
Column(s) with the data in the column(s) including the column(s) name(s)
Just a row in a data set.
(A feature)
(AKA) -> Predictor, Input Variable, Vector
The thing that gets created or is influenced by the independent variables.
(AKA) -> Response / Outcome / Target Variables, The Target, Output Supervisory Signal, The Label.
When the value of a dependent variable is already known from a data set.
When the value of a dependent variable is predicted based on Features/Independent Variables in a data set.
The model has not been trained on enough data to be able to make accurate enough predictions. The model is unable to match the input data to the target data.
The model has been trained on too much data that is too similar to eachother and as a result the model does not do good at making predictions on outliers or data that is too dissimilar to the data it was trained on.
- Occurs when the model is not good at making prediction on data that is has not been trained with.
- The model can recognize things that are very closely similar to the stuff its been trained with but as soon as you give it something that is outside of that scope it cant make accurate predictions on what its taget value should be.
- The model is overly trained to the dataset, which may capture noise and produce a non generalized model.
- countable, individualized, and nondivisible figures in statistics.
- These data points exist only in set increments.
- Data analysts and statisticians visualize discrete data using bar graphs, line charts, histograms, and pie charts.
For example, if you track the number of push-ups you do each day for a month, an underlying goal is to evaluate your progress and the rate of improvement. With that said, your daily tally is a discrete, isolated number.
- Data that can be categorized in a range.
- Data that has the possibility of going on forever many time data measured in (Time, Days, Hours, Minutes etc..)
Supervised Learning Techniques: (There are two)
What is it ?
It’s a machine larning approach for working with data that already shares some kind of relationship with another piece of data.
Supervised Learning is very much like teaching someone something by holding their hand. You create a kind of tether for them to cling to.
EX.
Age and health are linked (they have a relationship). As you get older your health gets worse, unless you’re not human or discovered some kind of magic you’re not sharing with the rest of us.
For the above example of age and health a data table might look something like this:
Age | Health |
---|---|
60 | Fair |
Bias Variance Note:
The Bias-Variance Tradeoff is relevant for supervised machine learning, meaning it’s something you need to consider when using the technique.
– specifically for predictive modeling. It’s a way to diagnose the performance of an algorithm by breaking down its prediction error.
How these examples are structured
For these examples we want to view data as if its in some kind of table structure like excell or something similar. Just assume it has the following layout for all the examples.
Some Column Title | Some Column Title | Yet Another Column Title |
---|---|---|
data for this column | data for this column | data for this column |
Where “Some column title” is an actual column title like “Age” or “Height” and where “ data for this column” is 33 for age and 5,8” for height.
Ex.
Suppose we want to predict daily screen time usage for cell phone owners.
Screen Time |
---|
The data we want to predict |
The dependent variable in a supervised learning model is the thing we want to predict.
Why is the thing called a dependent variable ?
Because the data can vary and it will vary based on, or “depending on”, other factors or other “variables” ie . other things that vary
Screen time could vary based on:
In order to take the supervised learning approach to be able to predict screentime we may need some, or all of the above “features”.
In Supervised Learning we need a data set that already has all of the Label outputs to start training a model. What does this look like?
Ex.
Independent variable | Independent variable | Dependent variable |
---|---|---|
data from variable | data from variable | data want predict |
Label Output
Means the thing has some data. The column above which says “ data want predict” has to have some kind of data in it. It cant all be blank otherwise we can’t train the model. We need data to train the model on to start with for supervised learning.
Basically for supervised learning we have a bunch of data thats already been collected for something and we feed that data into a specific algorithm. The algorithm then can try and predict what a new output might look like based on different independent values.
Dependent Variable: RED
TARGET (The thing we want to predict): BLUE
Age | Weight | Sex | Income | Education | Marital Status | Screen Time |
---|---|---|---|---|---|---|
21 | 180lbs | M | 25k | HighSchool | Not Married | 6hrs 22mins |
Now with one hundred rows of data:
So ok what does this really mean ?
It means:
We can now use this data to train a machine learning model in attempt to try and get some predicitons out of it.
This is supervised learning. Its supervised because we are going to teach the machine learning model by holding its hand, in the sense that, like a baby it doesnt already know how to connect the dots so its our job to show it how based on related things ie. independent variables and dependent variables.
Of course we dont know how to go about that process yet but we will…
Unsupervised process:
A practical example of unsupervised learning:
Which is conducted to make clusters of friends depending on the frequency of a connection between them.
Applicable in:
Accuracy:
Algorithms:
Use Cases:
Deeper underlying understanding of algorithms and machine learning models.
Bias:
Low Bias:
High Bias:
Characteristics of a high bias model include:
- Failure to capture proper data trends
- Potential towards underfitting
- More generalized/overly simplified
- High error rate
Variance:
Break Down:
Characteristics of a high variance model include:
Models with high bias will have low variance. Models with high variance will have a low bias.
Low variance (high bias):
Linear or parametric algorithms such as regression and naive Bayes.
Low bias (high variance):
Non-linear or non-parametric algorithms such as decision trees and nearest neighbors.
This tradeoff in complexity is why there’s a tradeoff in bias and variance – an algorithm cannot simultaneously be more complex and less complex.
Simple Linear regression
Ex. Predict CO2 emission vs enginesize of all cars.
- (x) Engine Size
- (y) CO2emissions
Multiple linear regression
Independent variables effectiveness on prediction Ex. Does revision time, test anxiety , lecture attendance and gender have any effect on the exam performance of the student.
Predicting impacts of changes:
- Understanding how the dependent variable changes when we change the independent variables.
SideNote: Theta is also called the parameters or weight vector of the regression equation
Prereq
However, the Independent variable can be measured on either a categorical or continuous measurement scale
There are two types of linear regression:(when there is only one independent variable present)
Predict co2emission vs EngineSize of all cars
- Independent Variable (x): Engine Size
- Dependent variable (y): co2emission
Predict co2emissions vs EngineSize and Cylinders of all cars
- Independent variable (x): Enginesize, cylinders, etc..
- Dependent variable (y): co2emission
NOTE: In the line equation (y = mx + c), m is a slope and c is the y-intercept of the line In the given equation, theta-0 is the y-intercept and theta-1 is the slope of the regression line.
Formula of a line aka how to draw a straight line through a sample
- a = The intercept (Y intercept)
- b = The slope of the line
- x = The independent variable(s)
- y = The dependent variable
- Σ = The sum of multiple items
How do we get (a) ? How do we get (b) ?
X | Y | X^2 | (X)(Y) | |
---|---|---|---|---|
2 | 3 | 4 | 6 | |
4 | 7 | 16 | 28 | |
6 | 5 | 36 | 30 | |
8 | 10 | 64 | 80 | |
20 | 25 | 120 | 144 | Σ |
(a)
a = (((ΣY)*(ΣX^2))-((ΣX)*(ΣXY))) / n(rows)*(ΣX^2)-(ΣX)^2 = The Intercept
or
a = ((25*120) - (20*144)) / (4*120-(20)^2) = For the table above
(b)
b = ((n*(ΣXY))-((ΣX)*(ΣY))) / (n*(ΣX^2))-(ΣX)^2 = The Slope
or
b = ((4*144)-(20*25)) / (4*120 - (20)^2) = For the table above
Train and test on the same data set:
This is taking the entire data set
- building a training model based on it
Then to test the accuracy of the model
- Take a small sample size from the data set without the labels
- Build a test training set with the small sample.
The labels are called actual values of the test set. Finally after we run our test model:
- Check the new predicted values with the actual values to get an idea of our models accuracy.
- The error of the model is calculated by the average difference between the predicted and actual values for all of the rows.
Training accuracy:
- is the percentage of correct predictions that the model makes when using the dataset.
- However a high training accuracy is not always a good thing.
Out of Sample Accuracy:
Doing a train and test on the same data set will likely have a low out of sample accuracy due to the liklihood of being over-fit. Its important that our models have high out of sample accuracy because we want the model to be able to make predictions on unknown data.
Tran/Test Split:
Training the model on only a portion of the data and omitting a portion of the data to be used in a second test model.
- This results in a higher level of out of sample accuracy because the original training set has no record of the data in the test set which means we can get a better idea if the model is actully doing its job by comparing the values produced in both training models.
- So in essence this is truly out of sample testing.
K-Fold cross validation:
Is another evaluation model which resolves alot of the issues which are left behind in the train/test split evaluation method.
- You take the entire dataset and split it into 4 portions or 4 folds.
- You use the first 25% of the data for testing
- The rest gets used for training.
- Then you take the next 25% of the data and do the same thing untill you are at the last 25% of the data.
- Finally the results of all 4 evaluations are averaged that is the average of each fold is averaged. Keeping the data distinct where no training data is used in another.
Consider that each row in the table below represents one fold in K-Fold cross validation
25% | 25% | 25% | 25% |
---|---|---|---|
TESTING | TRAINING | TRAINING | TRAINING |
TESTING | TESTING | TRAINING | TRAINING |
TRAINING | TRAINING | TESTING | TRAINING |
TRAINING | TRAINING | TRAINING | TESTING |
K-Fold cross validation in its simplest form performs multiple train/test splits.
Perhaps the best approach for most accurate results in a training model.
(In the context of regression the error of the model is the difference between data points and the trend line generated by the algorithm and with multiple data points an error can be determined in multiple ways.)
Mean absolute error
Mean squared error
Root mean squared error
Relative squared error
(Where Y(dependent variable) is a linear combination of independent variables(X,X…))
(is a method of predicting a continuous variable. It uses multiple variables called independent variables or predictors that best predict the value of the target variable, ‘the dependent variable’.)
- in a one dimensional space is the equation of a line.
- Its what is used in simple linear regression.
- This is kind of scenario is used for multiple linear regression.
Examples:
Have any effect on the exam performance of a student ?
- Consider the independent variables effectiveness on prediction
Question?
What is the dependent and what are the independent variables in the above example?
- The dependent variable or label is the (performance of a student) - also called the outcome variable
- The independent variables or features are (anxiety, lecture attendance and gender)
Predicting impacts of changes
Estimating multiple linear regression parameters
We want to be able to find the best parameters(theta, independent variables) to feed into our multiple linear regression model so we can generate the most accurate predictions in our outcome variable.
Question?
How do we find the parameter or coefficients for multiple linear regression?
Ordinary least squares
(used on data sets with less than 10,000 lines or smaller data sets)
- Attempts to estimate the values of the coefficients by minimizing the mean squar error MSE.
- This approach uses the data as a matrix and uses #Linear Algebra# operations to estimate the optimal values for the theta(independent variable).
Optimization approach
(used on larger data sets)
- Some kind of optimization algorithm
- Gradient Descent
Note: After we find the parameters of the linear equation we can move onto the prediction phase.
Making predictions with multiple linear regression
(The goal of regression is to accurately predict an unknown case to this end we have to perform regression evaluation aftetr building the mode)
We compare the actual values Y with the predicted values Y hat.
- This is the simplest evaluation approach
results:
Training accuracy
- is the percentage of correct predicitons that the model makes when using the test data set.
Caveats:
- High training accuracy is not necessarily a good thing as it can result in over fitting
Over Fit:
- The model is overly trained to the dataset, which may capture noise and produce a non-generalized model.
Out of sample accuracy
- The percentage of accurate predcitions that the model makes on data that it has not been trained on.
- Its important to obtain high out of sample accuracy because the purpose of our model is to make correct predictions on unknown data.
Involves splitting the data set into training and testing sets respectively, which are mutually exclusive.
- After which you train with the training set and test with the testing set.
- This will provide a more accurate evaluation on out of sample accuracy because the testing data set is not part of the data set that has been used to train the model.
Caveats:
- Highly dependent on the data sets by which the data was trained and tested.
K fold cross validation (In reference to Multiple Linear Regression)
- Resolves most of the issues with train/test split model evaluation method.
How to fix a high variation which results from a dependency?..You avg it.
- Split the data up into 4 folds:
- 1st fold: Use the first 25% of the data for testing and the rest for training.The model is built using the training set and is evaluated using the test set.
- 2nd fold: use the second 25% of the dataset for testing and the rest for training the model.
- 3rd fold: use the third 25% of the dataset……..
- 4th fold: etc……
- Finally: The result of all 4 evaluations are averaged.
Regression evaluation methods
Accuracy metrics for model evaluation(Evaluation metrics in regression models)
Regression accuracy:
- Evaluation metrics are used to explain the performance of a model.
- Basically we can compare the actual values and predicted values to calculate the accuracy of a regression model.
What is an error in the context of regression ?
- The difference between the data points and the trend line generated by the algorithm.
- Measure of how far the data is from the fitted regression line.
- Since multiple data points exist an error can be determined in multiple ways.
(A supervised learning approach, categorizing some unknown items into a discrete set of categories or “classes” classification attempts to learn the relationship between a set of featured variables and a target variable of interest.)
How classification and classifiers work
Given a set of training data points along with the target labels classification determines the class label for an unlabeled test case.
Loan default prediction:
- Previous loan default data can be used to predict which customers are likely to have problems paying loans.
- High risk customers can either be turned down or offered other products.
The goal of a loan default predictor is to use existing loan default data, which is, info about the customers such as age, income and education to build a classifier, pass a new customer or a potential future defaulter to the model and then label it ie. the data points as defaulter or not defaulter or 0 or 1.
We can also build classifier models for both binary and multi class classification.
Example
Data collected on a group of patients that had the same illness and responded to one of three different types of medications they took during the course of their treatment.
- This kind of labeled dataset can be used with a classification algorithm to build a classification model.
- Then you can use it to find out which drug might be effective for future patients with the same illness.
K-Nearest Neighbors algorithm (KNN a specific type of classification)
The K-Nearest Neighbors algorithm is a classification algorithm that takes a bunch of labeled points and uses them to learn how to label other points.
How does it classify data
The algorithm takes a bunch of “labeled” points and uses them to learn how to label other points. Labeled points in this case ,based on (Example Below)
These groups/categories/classes or “labeled points’ are:
Lets pretend:
- Customers tend to be either over the age of 45 or have an income under 25k or over 125k
- Customers tend to be under 30 years old with an income of under 40k
- Customers tend to be over 30 years old and under 45 years old with an income above 40k
- Customers tend to be between the ages of 20 - 36 years old with an income above 60k
Predicting:
Now lets say we have a new customer on the phone and we want to know what class/category this customer may fall into. In order to find out we might first want to
- Look at some data/information from customers in all of the service categories
- So lets start with age and income.
Suppose we find out our new customer is under 30 and makes 33k a year.
- With this data can we make a guess as to which service group they may fall into?
Yes, we can!
Jaccard Index (Jaccard similarity coefficient)
Steps ( J(X,Y) = X∩Y / X∪Y ) (Formula)
Count the number of members which are shared between both sets.
Count the total number of members in both sets (shared and un-shared).
Divide the number of shared members (1) by the total number of members (2).
Multiply the number you found in (3) by 100.
- This percentage tells you how similar the two sets are.
Jaccard Index Caveat
Although it’s easy to interpret, it is extremely sensitive to small samples sizes and may give erroneous results,especially with very small samples or data sets with missing observations.
- Combines the precision and recall scores of a model.
- The accuracy metric computes how many times a model made a correct prediction across the entire dataset.
To understand the calculation of the F1 score, we first need to look at a Confusion Matrix.
- (A matrix of numbers that tell us where a model gets confused)
- a class-wise distribution of the predictive performance of a classification model
- The confusion matrix is an organized way of mapping the predictions to the original classes to which the data belong.
For a binary class dataset (which consists of, suppose, “positive” and “negative” classes), a confusion matrix has four essential components:
True Positives (TP): Number of samples correctly predicted as “positive.”
False Positives (FP): Number of samples wrongly predicted as “positive.”
True Negatives (TN): Number of samples correctly predicted as “negative.”
False Negatives (FN): Number of samples wrongly predicted as “negative.”
The F1 score is defined based on the precision and recall scores, which are mathematically defined as follows:
Caveat
This can be a reliable metric only if the dataset is class-balanced, meaning each class of the dataset has the same number of samples.
Cross Entropy
The difference between two probability distributions.
Log Loss (cross entropy loss)
‘How is one built based on the data’?
Decision Trees are classification algorithms Built using recursive partitioning (breaking up the data further and furter down the line)
Ex.
Age
├──Young
├──Mid Aged
└──Senior
Decision trees are built by splitting the training set into distinct nodes.
Patients with a illness that have all recieved two types of medication
- Drug (A)
- Drug (B)
Feature sets or categories we can start looking at:
- Age (Young, Middle Aged, Senior)
- Sex (M, F)
- Blood Pressure (Normal, High, Low)
- Cholesterol (Normal, High, Low)
Basically all patients will have all of these attributes and our target is the drug that they responded to meaning since all of the patients were given both drugs we have a list of which patients responded to either drug and we want to group these patients to find out how likely someone not in the sample set will respond to either of the medications.
Some examples of things we might find:
Patient | Age | Sex | BP | Cholesterol | Drug Response |
---|---|---|---|---|---|
1 | 23 | F | High | High | Drug(A) |
2 | 47 | M | Low | High | Drug(B) |
3 | 47 | M | Low | High | Drug(B) |
4 | 28 | F | Norm | High | Drug(A) |
5 | 61 | F | Low | High | Drug(A) |
6 | 22 | F | Norm | High | Drug(A) |
7 | 49 | F | Norm | High | Drug(B) |
8 | 41 | M | Low | High | Drug(B) |
9 | 60 | M | Norm | High | Drug(B) |
A regression tree is a decision tree that can take continuous values as the target variable instead of a discrete value.
Use Cases
It seems like regression trees are good to use in situations where you want to be able to predict the range of something or for problemsthat deal with categorical sequences.
Truly though, regression trees are used for dependent varaibles with continuous values and classification trees are used for dependent variables with discrete values.
A Leaf
In a regression tree each leaf represents a numeric value.
Ex.
Drug effectivness based on different categories
Age > 50 |
├──4.2% Effective]
|
├──Dosage >= 29ml|
├──[29% Effective]
|
└──[Sex|
├──[Male 100%]
|
└──[Female 50%]
More often used in binary classification problems. Can be more effective for these cases than linear regression.
Sigmoid Function
What is logistic regression?
- Logistic Regression is analogous to linear regression but tries to predict a categorical or discrete target field instead of a numeric one.
- (Yes, No)
- (True, False)
- (Success, Failure)
- (Pregnant, Not Pregnant)
- (Probability of heart attack: Yes, No)
- (Chance of mortality in an injured patient)
- (Liklihood of a customer purchsing a product)
- (Liklihood of a customer cancelling a subscription based service)
Note:
- Notice that in all these examples not only do we predict the class of each case,we also measure the probability of a case belonging to a specific Yes or No class.
What kind of problems can be solved using it?
- Could be used in binary classification
- Multi-class classification.
In which situations should we use it?
(Logistic Regression predicts the probability score between zero and one for a given sample of data)
Parameters of Logistic Regression ( Things it needs to work)
The training process
How can we change the value of theta so that the cost is reduced across iterations?
- There are different ways to change the value of theta but one of the most popular ways is gradient descent.
When should we stop the iterations?
- By calculating the accuracy of your model and stopping it when its satisfactory
Training a logistic regression model and how to change the parameters of the model to better estimate the outcome.
Is the difference between the actual values of Y and our model output or predicted values Y-hat
- The cost function is basically the error.
The main objective of training in logistic regression is to change the parameters of the model so as to be the best estimation of the labels of the samples in the dataset.
Question
How do we find the best weights or parameters that minimize the cost function?
Answer
We should calculate the minimum point of this cost function and it will show us the best parameters for our model.
Basically we are going to use the minus log function -log. So the idea behind this is
- suppose we want a value of 1 which is our desired output
- this means we need a cost function that wiil return us 0.
- -log(ŷ) What this means is if our predicted output is 1 and our actual value is 1 than our cost function is 0 meaning there is no error basically.
- If our predicted value is less than 1 and our actual value is 1 than our cost function is going to give us a value greater than 0.
This concept is part of the Logistic Regression cost function.
- Remember that ŷ does not return a class but rather a value of either 0 or 1 which should be assumed as a probability.
Minimizing the cost function of the model (recap)
How to find the best parameters for our model?
- Minimize the cost function
How to minimize the cost function?
- Use gradient descent.
What is gradient descent?
- An iterative approach to finding the minimum of a function.
- A technique to use the deriviative of a cost function to change the parameter values, in order to minimize the cost.
Using Gradient descent to minimize the cost.
How can gradient descent do this?
Training algorithm recap
- Initialize the parameters randomly.
- Feed the cost function with training set, and calculate the error.
- Calculate the gradient of the cost function.
- Update weights with new values.
- Go to step 2 until the cost is small enough. We continue this loop until we reach a short value of cost or some limited number of iterations.
- Predict the new customer X. The parameter should be roughly found after some iterations.
Definition
Clustering is an unsupervied machine learning method of identifying and grouping similar data points in large data sets without concern for the specific outcome.
Caveat
Applications of Clustering in different fields
It can be used to characterize & discover customer segments for marketing purposes.
It can be used for classification among different species of plants and animals.
It is used in clustering different books on the basis of topics and information.
It is used to acknowledge the customers, their policies and identifying the frauds.
It is used to make groups of houses and to study their values based on their geographical locations and other factors present.
By learning the earthquake-affected areas we can determine the dangerous zones.
It us used to train a model into recognizing or distinguishing object from one another.
Cross selling strategies:
ie. Splitting up customer groups or finding them and labeling them. The practice of grouping customers based on similar attributes.
Note: A general segmentation process is not usually feasible for large volumes of varied data, therefore you need an analytical approach to deriving segments and groups from large datasets.
Why Clustering?
Clustering is very important as it determines the intrinsic grouping among the unlabelled data present.
For instance:
- Finding representatives for homogeneous groups (data reduction)
- Finding “natural clusters” and describe their unknown properties (“natural” data types)
- Finding useful and suitable groupings (“useful” data classes)
- Finding unusual data objects (outlier detection).
This algorithm must make some assumptions that constitute the similarity of points and each assumption make different and equally valid clusters.
A Practical Example
Suppose we have a customer data base and we want to find some similarites between these customers. Now suppose we create a machine learning model and apply a clustering algorithm to the data we input into our model.
The clustering algorithm might return three groups of customers that we see have been grouped by demographic data. Groups:
- (A). Affluent Middle Aged People
- (B). Young Educated, Mid Ranged Income People
- (C). Young and Low Income People
Clustering is often used to make recommendations to users based on similars users taste or based on similar habits.
A clustering algorithm might recognize that you interacted with a particular add for 2 minutes and then based on other people who exhibited the same behaviour from its records it might also recommend you a shampoo or something because people that fall into the category of interacting with that specific add for that length of time typically bought this one shampoo shortly after.
So basically clustering algorithms be like:
- Hey bro I see you did like these other peeps, that like this one thing, who then after also like this other thing. Maybe you like this other thing too?
Uses
Generally clustering can be used for one of the following purposes:
Different Clustering Algorithms
- Group of clustering algorithms that produce sphere-like clusters. These algorithms are relatively efficient and are used for medium and large sized databases.
These methods partition the objects into k clusters and each partition forms one cluster. This method is used to optimize an objective criterion similarity function such as when the distance is a major parameter Examples:
- K means,CLARANS (Clustering Large Applications based upon Randomized Search), etc.
Hierarchical clustering is a very useful way of segmentation. The advantage of not having to pre-define the number of clusters gives it quite an edge over k-Means. However, it doesn’t work well when we have huge amount of data.
- Produce Trees of Clusters. This group of algorithms are very intuitive and are generally good for use with small size datasets.
The clusters formed in this method form a tree-type structure based on the hierarchy.
- New clusters are formed using the previously formed one.
- It is divided into two categories:
Examples:
- CURE (Clustering Using Representatives), BIRCH (Balanced Iterative Reducing Clustering and using Hierarchies), etc.
Produce arbitrary shaped clusters. Especially good when dealing with spatial clusters or when there is noise in your data set.
- DBSCAN: Density-based Spatial Clustering of Applications with Noise
- These data points are clustered by using the basic concept that the data point lies within the given constraint from the cluster center.
Various distance methods and techniques are used for the calculation of the outliers.
- These methods consider the clusters as the dense region having some similarities and differences from the lower dense region of the space.
- These methods have good accuracy and the ability to merge two clusters. Examples:
- DBSCAN (Density-Based Spatial Clustering of Applications with Noise),OPTICS (Ordering Points to Identify Clustering Structure), etc.
In this method, the data space is formulated into a finite number of cells that form a grid-like structure.
- All the clustering operations done on these grids are fast and independent of the number of data objects Examples:
- STING (Statistical Information Grid), wave cluster, CLIQUE (CLustering In Quest), etc.
(K-Medians (Fuzzy c-Means))
Intro to K-Means: (Is an iterative algorithm)
K-means clustering algorithm
K-means algorithm partitions n observations into k clusters where each observation belongs to the cluster with the nearest mean serving as a prototype of the cluster.
K-means divides the data into non-overlapping subsets (clusters) without any cluster internal structure.
- Examples within a cluster are very similar.
- Objects across different clusters are very diffferent or dissimilar.
- Note: This make sense since there are non-overlapping subsets.
Questions:
- How can we fid similarity of samples in clustering?
- How do we measure how similar two customers are with regard to their demographics?
Answers:
- Sometimes in order to do this we can, instead of measuring how similar samples are. We can instead measure how different they are or how dissimilar they are.
K-Means tries to minimize the amount of differences inside of a cluster and maxamize the amount of differences for clusters outside of a cluster.
- ie. Intra cluster distances are minimized
- Inter-cluster distances are maximized
In order to calculate the distance we use the Euclidean Distance or the Minkowski distance
- We can use this dame distance matrix for multidimensional vectors after we normalize our feature set to get an accurate dissimilarity measurement.
Other Dissimilarity measurements that can be used
- Euclidean
- Cosine Similarity
- Average Distance
The K-Means clustering process
(Define the centroid of each cluster):
Centroids should be of same size as our feature set.
- ie. How many clusters of groups the algorithm is going to group data into or how many groups you want it to group data into.
- Centroids are really just central points within certain cluster groups that are going to serve as a kind of model of what similar data needs to be in order to be grouped in the same cluster as a specific centroid.
Two approaches to choose a centroid:
- i. Randomly choose observations out of the dataset then use these observations as the initial means (averages).
- ii. Create random points as centroids of the clusters.
Calculate the distance of each datapoint from the centroid points.
- Ultimately this process is going to produce a matrix in where each point will represent the distance of a data point from each centroid aka (Distance Matrix).
K-Means Error is the total distance of each point from its centroid.
- Sum of Squares error.
In this step each centroid gets updated to the mean for data points in its cluster.
- Each centroid is going to move according to their cluster members.
To break this down a little:
Based on all of the points within a cluster
- The algorithm is going to get an average of them
- Say ok this is the middle of all of those points in this neighborhood so actually this is a better centroid to use for this specific cluster.
- This process continues until the centroids stop moving or until the algorithm decides it has found a solid enough central point on this graph between all these data points that live in this cluster. KEEEP IN MIND:
- Every time the centroid moves each point in relation to the centroid needs to be measured again.
Caveat:
There is no guarantee that the algorithm is going to converge to the global optimum and the results may depend on the initial clusters.
Which means the result of this algorithm may not produce the best possible outcome.
Solution:
It is common to run the whole process multiple times with different starting conditions. This means with randomized starting centroids it may produce a better outcome.
- Since this algorithm is usually very fast it should not be a problem to run it multiple times.
K-Means accuracy and characteristics
The farther apart the clusters a placed the better.
How can we evaluate the goodness of the clusters formed by K-means?
Compare the clusters with the ground truth if it is available. However this is usually not the case since K-Means is an unsupervised algorithm.
Average distance between data points within a cluster. Also average of the distance of data points from their cluster centroids can be used as a metric of error for the clustering algorithm.
Determing the value of K in K-Means clustering is a common problem in data clustering.
Choosing K:
The value of K is ambigous because it is dependent on the shape and scale of the distributions of points in a dataset.
There are some solutions to this problem
- One of them is creating different values for K and then get an average of all of the models ran with different values for K to see which value for K fits the best. This metric can be mean, distance between data points and their clusters centroid.
So basically everytime you run a model with a different value for K you then measure the distance between the different clusters centroids and the points inside of the cluster.
The problem is that with increasing the number of clusters the distance of data points to centroids will always reduce.
- This means that increasing the value of K will always decrease the error.
- So, the value of the metric as a function of K is plotted and the elbow point is determined where the rate of decrease sharply shifts.
- In other words we repeat this process on a graph untill we see the margin of error decrease drastically.
- We then choose the value of K based on this. This method is called the elbow method.
Drawback is that we need to pre specify the number of clusters which is ultimately not an easy task.