Skip to main content

Data Structures in Machine Learning

Ever used python libraries like scikit-learn or TensorFlow, Keras, and PyTorch? Ever wondered what lies beyond the one line of code that initiates the Model? Ever wondered how the data is stored and processed in a model? Today, we will explore the realms of data structures used to implement different machine-learning models and see what importance it holds in machine learning and deep learning.

Deep Learning requires much math, and we need methods to optimally perform this math in the lowest time and space complexity possible. We try to do this using parallel computation and changing the for loops into matrix multiplications running parallelly across multiple processors or GPUs. This is used to increase efficiency.  Data is the most important part of any machine learning or deep learning problem. From the data loading to the prediction, every step uses one or the other data structure, giving us the least possible time complexity. The need of the hour is to make our data loaders much more efficient to occupy less RAM. The storage of this data on RAM takes place in arrays. The computation and math happen on matrices. Imagine if we did not have these data structures! We would have been using loops for even the most basic and frequent operation of multiplication of data that would run on O(N) time complexity. The nested for loops would take even greater time complexity. The presence of these data structures and its clever use helped us reduce the time complexity of O(N3) to O(1). Many people think that the algorithms used in machine learning is a black box, given some data input, it would give us the required output. They have mastered the art of using this black box. But they don't know what goes within. To improve the technology, we need to explore this black box. We need to know what goes round in this black box that gives us the output so accurately. So, today we will see an example where the data structure tree is used as part of efficient algorithm implementation, followed by a brief discussion on graph based models as well.

Machine learning and deep learning can not be done unless we have all the data structures like Array, vectors, matrices, Linked list, Binary trees, Graph, Stack, Queue, Hashing, Set, Dynamic Programming, Greedy Algorithms, Randomized Algorithms. This is because in absence of these data structures and algorithms, we won’t have any means of storage and the implementation would become impossible. Moreover, we use more than one data structure for the implementation of complex algorithms because we want the models to be deployable therefore as efficient as possible. So, arrays are used in machine learning almost everywhere. They store the data values preserving the order by index. The arrays are used to perform matrix multiplications which is one of the most important step in deep learning. Stacks are used for binary classification. Dynamic arrays are made using linked lists which offer constant time insertion and deletion. Queues are used where multiple arrays are being processed. Priority queues are also important which set the priority of operation that should take place. Trees and graphs are really important when it comes to machine learning. Trees have nodes and branches which gives us categories and subcategories of data points.

Decision Tree

A decision tree is a supervised learning algorithm that is used for both, regression and classification tasks. It is implemented using a tree, consisting of a root node, branches, internal nodes, and leaf nodes. It has a hierarchical tree structure. Given a data point, the tree structure,based on specific parameters and rules governed by the features decides a certain flow of the data through the tree structure to give us the output at the end. Let's explore how this happens!

A decision tree uses divide and conquer strategy to identify the optimal split points in the tree. This process of the split is then repeated recursively from top to bottom until it reaches a decision i.e., the data point has been classified. So basically, the decision tree is a tree which has a root node. The root node is followed by the decision nodes which depict the decision and then we have leaf nodes which represent the consequence of that decision. The decision tree-building algorithm is a recursive algorithm. Initially, the whole dataset is considered to be the root node. Then, it recursively goes through the unused attributes and calculates the entropy and the information gain of the data. Based on how low the entropy is and how high the information gain is, it makes the subnodes. Information gain is used as a criterion when we consider categorical data, that is in the case of classification. For regression, we use factors like gini index.

Where p is the probability of getting profit to say ‘yes,’ and N is loss to say ‘No.’

So, decision tree being a tree data structure helps us to solve linear to non-linear regression problems along with classification problems quite effectively. Without the presence of a tree data structure, it would have been very difficult to represent the data and solve the problem. It improves the time complexity as well as the space complexity.

Graph-based models

Similarly, graphs have nodes and edges. We use graph-based models when the data is interlinked, and the data points have a certain relationship. So, nodes along with the edges give us more and more control over links and connections of data points amongst themselves. Graphs add another valuable piece of information to the machine learning model, that is the relationship between the data points. So, the data point information is represented by the nodes while the information about the relationship is captured in the edges. Since graphs work on both, the data point and the information about the data point linked with each other, it can be used for both supervised and unsupervised learning. It can also be used for semi-supervised learning. Machine learning techniques with graphs have three tasks to do, one, node embedding, two, node classification, and three, link prediction. The node embedding task involves the creation of a node vector that has the node embedded among its neighbors. This vector is then passed into the model as a feature vector. Sometimes, instead of passing the feature vector into a model, we directly classify the data points depending on their links. This is called node classification. Link prediction is used when we want to predict the similarity between two data points or nodes.

It is rightly said that in order to understand any topic, you need to look at its root. So, here we had a look at how different data points can be arranged and feeded into a machine learning or deep learning model.

Comments

Popular posts from this blog

All About Reinforcement learning

Reinforcement Learning is a type of Machine Learning where an agent learns to make decisions by interacting with an environment. It is based on the concept of trial and error learning, where the agent tries different actions and learns from the feedback it receives in the form of rewards or penalties. Reinforcement Learning is widely used in various domains such as gaming, robotics, finance, and healthcare. Reinforcement Learning Cycle The Reinforcement Learning process starts with an agent and an environment. The agent interacts with the environment by taking actions and receiving feedback in the form of rewards or penalties. The goal of the agent is to maximize its cumulative reward over a period of time. The agent uses a policy, which is a set of rules that determine the actions it takes in different situations. The policy is learned through trial and error, and it is updated based on the feedback received from the environment. The rewards and penalties in Reinforcement Learning are

Overfitting and Underfitting

Overfitting and Underfitting are two fundamental problems due to which a machine learning model performs poorly. Any machine learning model's primary objective is to generalize effectively. Here, generalization refers to the ability of an ML model to adapt the provided set of unknown inputs to produce an acceptable output. It indicates that it can generate trustworthy and accurate output after undergoing training on the dataset. Before we move on to overfitting and underfitting, we need to be familiar with some prerequisite terms: Noise: Noise stands for unnecessary or irrelevant data, or other similar outliers, that do not follow the general trend of the overall dataset. Bias: Bias is the error rate of the training data, and occurs due to the oversimplification of machine learning algorithms when the model makes assumptions to make a function easier to learn. Variance: Variance is defined as the difference in the model's error rate with the training data and the model's