Overfitting and Underfitting

Overfitting and Underfitting are two fundamental problems due to which a machine learning model performs poorly. Any machine learning model's primary objective is to generalize effectively. Here, generalization refers to the ability of an ML model to adapt the provided set of unknown inputs to produce an acceptable output. It indicates that it can generate trustworthy and accurate output after undergoing training on the dataset.

Before we move on to overfitting and underfitting, we need to be familiar with some prerequisite terms:

Noise: Noise stands for unnecessary or irrelevant data, or other similar outliers, that do not follow the general trend of the overall dataset.

Bias: Bias is the error rate of the training data, and occurs due to the oversimplification of machine learning algorithms when the model makes assumptions to make a function easier to learn.

Variance: Variance is defined as the difference in the model's error rate with the training data and the model's error rate with the testing data. We need a low variance for our model to be generalized well.

Overfitting

Overfitting occurs when our model tries to cover all the data points in the training data set, so much so that it takes into account a large amount of noise as well. As a result, the model performs well with the training data, but falls short when it comes to the testing data. Hence, it has a low bias but a high variance.

Overfitting occurs when the model tries to capture too many details, that is, the chances of overfitting occurring increase as we provide higher amounts of training to the model with a limited dataset.

We can avoid/reduce overfitting by the following measures:

Reducing the complexity of the model
Increasing the amount of training data
Using K-fold cross-validation (a method for determining how well the model performs on fresh data)
Using Regularization techniques such as Lasso and Ridge
Stopping early during the training phase (by keeping an eye on the loss and halting training as it begins to increase)
Removing some features

Underfitting

Underfitting occurs when our model is unable to discern the underlying pattern in the training data given. When the training is stopped at an early stage, the model fails to learn enough from the dataset, and hence its accuracy is reduced and it generates unreliable predictions. An underfitted model performs poorly on both training and testing data. Therefore, it has a high bias but a low variance.

When we have very little data to work with, the model tries to apply the rules of machine learning to such minimal data, and ends up with errors.

We can avoid/reduce underfitting by the following measures:

Increasing the complexity of the model
Increasing the amount of training data
Removing as much noise as possible from the data
Increasing the number of features, performing feature engineering
Increasing the time duration of the training phase, increasing the number of epochs

A “Good Fit”

In theory, a model with a good fit produces predictions with zero errors, but practically, this is challenging to do. It lies in the middle of the underfitted and overfitted models.

Our model will continue to learn as time goes on, and as a result, the error for the model on the training and testing data will continue to drop. The presence of noise and less valuable features will make the model more prone to overfitting if it is allowed to learn for an excessively long time. As a result, our model's performance will decline. We will halt just before the error begins to increase in order to get a "good fit". The model is proficient at this point for both our testing dataset and training datasets.

A graphical example of Underfitting, Overfitting, and “Good Fit”

Picture courtesy: GeeksForGeeks

GDSC IIT Indore

Search This Blog

Overfitting and Underfitting

Comments

Post a Comment

Popular posts from this blog

All About Reinforcement learning

Data Structures in Machine Learning

Natural Language Processing