Skip to main content

Overfitting and Underfitting

Overfitting and Underfitting are two fundamental problems due to which a machine learning model performs poorly. Any machine learning model's primary objective is to generalize effectively. Here, generalization refers to the ability of an ML model to adapt the provided set of unknown inputs to produce an acceptable output. It indicates that it can generate trustworthy and accurate output after undergoing training on the dataset.

Before we move on to overfitting and underfitting, we need to be familiar with some prerequisite terms:

  • Noise: Noise stands for unnecessary or irrelevant data, or other similar outliers, that do not follow the general trend of the overall dataset.

  • Bias: Bias is the error rate of the training data, and occurs due to the oversimplification of machine learning algorithms when the model makes assumptions to make a function easier to learn.

  • Variance: Variance is defined as the difference in the model's error rate with the training data and the model's error rate with the testing data. We need a low variance for our model to be generalized well.



Overfitting

Overfitting occurs when our model tries to cover all the data points in the training data set, so much so that it takes into account a large amount of noise as well. As a result, the model performs well with the training data, but falls short when it comes to the testing data. Hence, it has a low bias but a high variance.

Overfitting occurs when the model tries to capture too many details, that is, the chances of overfitting occurring increase as we provide higher amounts of training to the model with a limited dataset.

We can avoid/reduce overfitting by the following measures:

  • Reducing the complexity of the model
  • Increasing the amount of training data
  • Using K-fold cross-validation (a method for determining how well the model performs on fresh data)
  • Using Regularization techniques such as Lasso and Ridge
  • Stopping early during the training phase (by keeping an eye on the loss and halting training as it begins to increase)
  • Removing some features



Underfitting

Underfitting occurs when our model is unable to discern the underlying pattern in the training data given. When the training is stopped at an early stage, the model fails to learn enough from the dataset, and hence its accuracy is reduced and it generates unreliable predictions. An underfitted model performs poorly on both training and testing data. Therefore, it has a high bias but a low variance.

When we have very little data to work with, the model tries to apply the rules of machine learning to such minimal data, and ends up with errors.

We can avoid/reduce underfitting by the following measures:

  • Increasing the complexity of the model
  • Increasing the amount of training data
  • Removing as much noise as possible from the data
  • Increasing the number of features, performing feature engineering
  • Increasing the time duration of the training phase, increasing the number of epochs



A “Good Fit”

In theory, a model with a good fit produces predictions with zero errors, but practically, this is challenging to do. It lies in the middle of the underfitted and overfitted models.

Our model will continue to learn as time goes on, and as a result, the error for the model on the training and testing data will continue to drop. The presence of noise and less valuable features will make the model more prone to overfitting if it is allowed to learn for an excessively long time. As a result, our model's performance will decline. We will halt just before the error begins to increase in order to get a "good fit". The model is proficient at this point for both our testing dataset and training datasets.


A graphical example of Underfitting, Overfitting, and “Good Fit”

Picture courtesy: GeeksForGeeks



Comments

Popular posts from this blog

All About Reinforcement learning

Reinforcement Learning is a type of Machine Learning where an agent learns to make decisions by interacting with an environment. It is based on the concept of trial and error learning, where the agent tries different actions and learns from the feedback it receives in the form of rewards or penalties. Reinforcement Learning is widely used in various domains such as gaming, robotics, finance, and healthcare. Reinforcement Learning Cycle The Reinforcement Learning process starts with an agent and an environment. The agent interacts with the environment by taking actions and receiving feedback in the form of rewards or penalties. The goal of the agent is to maximize its cumulative reward over a period of time. The agent uses a policy, which is a set of rules that determine the actions it takes in different situations. The policy is learned through trial and error, and it is updated based on the feedback received from the environment. The rewards and penalties in Reinforcement Learning are...

Natural Language Processing

Natural Language Processing (or NLP) is the field of artificial intelligence that aims to enable computers to comprehend written and spoken language in a manner that is similar to that of humans. NLP blends statistical, machine learning, and deep learning models with computational linguistics—rule-based modelling of human language. With the use of these technologies, computers are now able to process human language in the form of text or audio data and fully "understand" what is being said or written, including the speaker's or writer's intentions and sentiment. It has numerous practical uses in a wide range of industries, including corporate intelligence, search engines, and medical research. NLP has two components: Natural Language Understanding (NLU): It involves converting the provided natural language input into helpful representations and examining the language's various facets. Natural Language Generation (NLG): Relatively more straightforward than NLU, i...