Machine Learning (Big Picture)
In this blog, we will (I mean, I will) take the efforts to explain the Big Picture of the whole pipeline of Machine Learning.
Machine Learning a subset of Artificial Intelligence and is a vast field in itself. Machine Learning itself is a home to sub sets which are huge in the amount of research being done in each one of them.
Let's understand it with the help of an example but before that, this picture below shows the basic pipeline of how it works.
You have a huge dataset which contains images of Koalas and Sloths and you need to classify them.
Data Collection
As shown in the image, our first step is going to be Data Collection. What does this mean? I am pretty sure it's very intuitive. Since, teaching a Machine requires a substantially huge amount of data, we need to get this data from somewhere. Now normally, while you are still learning, there are going to be so many websites which offer datasets. You can search for the datasets just with a simple Google Search. Kaggle is one famous website for accessing datasets.
Now when you work for some company, chances are pretty high that you will have to deal with collecting the data yourself for feeding your model and one of the common ways to do is Scraping. But for simplicity's sake, let's say you found the dataset on Kaggle. So, with this you are now done with the first block of the pipeline.
Data Pre - Processing
Now this is arguably the most boring and often, very challenging step of the pipeline. 99% of the times, the dataset you get, be it from a website like Kaggle or some data you've collected yourself, it's going to a Bad Dataset. What do I mean by that? That dataset is going to contain missing values, wrong data entered in the wrong column, outliers present in the data.
Let's take a second's break to understand the term Outlier. An outlier pretty much means which does not follow the most common trend. Let's say your whole family is a big fan of Cricket but you're the only one who doesn't like it and likes watching F1 instead, then you become an Outlier instantly or let's say that one guy in class is still getting 95% inspite of the 99% percent of the class falling behind the score of 80. Then, that guy is an outlier. I hope you get my point.
Other than these three things, there are many more complex ways in which data can be bad, such as the data not following a Normal Distribution, etc.
Now, other than data being bad, there are many many scenarios where we need to apply Pre - Processing to our data. So, to understand that we can have an another example. Let's say we are doing some prediction and the features (the columns) that we have are Salaries of an employee and weight. So, in general, we deal with weights in terms of 50kg to 100kg, right? Whereas the Salaries are somewhere in terms of lakhs, crores (millions and tens of millions). It is easy for the Model to think that Salary in general is higher in magnitude, so, it should be the the dominating factor so it favours it. But this is misleading for the model, so, what we do is that we often scale down salary and weights in the range of 0 to 1 so that the representation is fair and the model doesn't become biased towards anything. This is an another aspect of Data Pre - Processing.
You'll come to know more about Pre - Processing as you go forward and implement your knowledge into projects. (Best place to start is Kaggle and I've written about the same here - https://blog.pointblank.club/ghost/#/site . The title is Kaggling).
Choosing the right ML Model
Now comes the most interesting yet a very crucial task in the ML Pipeline. Good for us that even here the frameworks like Tensorflow, PyTorch or libraries like Sklearn (Scikit - Learn) come to our rescue.
First, decide if your task is going to require a framework like TensorFlow or PyTorch to solve the problem which often are home to algorithms which are used to tasks related to Neural Networks or is it going to be some algorithm that we use from Sklearn and in that case, we'll have to use Sklearn.
In our current example, we are going to use a Neural Network which we need to perform an Image Classification and something which can not be performed well by using the classical ML models from Sklearn like Logistic Regression, etc.
We'll use a ML model from TensorFlow known as Convolution Neural Networks (CNNs). They are pretty good at Image Classification problems.
Note - We have chosen a very simple problem statement here which makes it easy for us to choose the right fit model for this task but in general, this decision is often a very time taking one and often requires the Developer to play around with multiple models before deciding which is the best fit for that particular Problem Statement.
Training The Model
Now the next step in our journey is to teach the model to correctly differentiate between a Koala and a Sloth's image. Now we simply train the model to do the required task by feeding it this Clean Dataset that we've created. Training is not hard when we consider the fact that there are some many frameworks out there which have substantially abstracted so many things for us developers.Goo
Validating The Predictions
Going forth, we need to verify if each prediction that the model has made is correct or not. So, it's crucial to understand this, we have in total three sub datasets from the super datasets. One is the Training set, the other one is Testing set but there is an another one which you should always keep in mind, the Validation Set. Let's cover the last two in brief.
We use train set to train the model on the data that we've collected. We use the testing set to test if the model has actually understood how to perform classification tasks which is exactly like the cruel exams we face during our Semester exams. Now the validation set is a very crucial step. Picture this - You are preparing for an exam, you use all the materials to make yourself well versed with the concepts and everything and now you take up a mock test to see how well prepared you are, if you have memorized the concepts or have you actually understood everything. This Mock test exactly resembles the task of Validation Dataset.
So, we first check if the model is performing well on Validating dataset before jumping into using Test Dataset.
Testing The Model
Now the penultimate task is to test the model on the test data that you've been provided with. This resembles the final main exam that you take up after all those hours of learning the concepts and the mock tests that you;ve attempted to test yourself in soltitude. Depending upon the results of this test, you calculate how much Loss (or how many errors) has the model incurred. Depending upon the Loss, we perform something known as Backpropagation to tweak the model and help it learn better. After this Backpropagation, we start from the step of training the model again and this loop continuous until we feel the Loss that model is incurring has reduced.
Deploying The Model
The final step of building this pipeline is to deploy the model. It basically means to putting the work that you've done in Production. You will learn more about this step when you start developing many real - life projects.
Conclusion
I hope all of this made sense to you, if not, then you can always reach you to me on my LinkedIn and X handle for any kind of clarifications.
In my next blog, we'll do a walk through the code part understanding how does the code part (the detailed part) work.