Ever used python libraries like scikit-learn or TensorFlow, Keras, and PyTorch? Ever wondered what lies beyond the one line of code that initiates the Model? Ever wondered how the data is stored and processed in a model? Today, we will explore the realms of data structures used to implement different machine-learning models and see what importance it holds in machine learning and deep learning.
Deep Learning requires much math, and we need methods to optimally perform this math in the lowest time and space complexity possible. We try to do this using parallel computation and changing the for loops into matrix multiplications running parallelly across multiple processors or GPUs. This is used to increase efficiency. Data is the most important part of any machine learning or deep learning problem. From the data loading to the prediction, every step uses one or the other data structure, giving us the least possible time complexity. The need of the hour is to make our data loaders much more efficient to occupy less RAM. The storage of this data on RAM takes place in arrays. The computation and math happen on matrices. Imagine if we did not have these data structures! We would have been using loops for even the most basic and frequent operation of multiplication of data that would run on O(N) time complexity. The nested for loops would take even greater time complexity. The presence of these data structures and its clever use helped us reduce the time complexity of O(N3) to O(1). Many people think that the algorithms used in machine learning is a black box, given some data input, it would give us the required output. They have mastered the art of using this black box. But they don't know what goes within. To improve the technology, we need to explore this black box. We need to know what goes round in this black box that gives us the output so accurately. So, today we will see an example where the data structure tree is used as part of efficient algorithm implementation, followed by a brief discussion on graph based models as well.
Machine learning and deep learning can not be done unless we have all the data structures like Array, vectors, matrices, Linked list, Binary trees, Graph, Stack, Queue, Hashing, Set, Dynamic Programming, Greedy Algorithms, Randomized Algorithms. This is because in absence of these data structures and algorithms, we won’t have any means of storage and the implementation would become impossible. Moreover, we use more than one data structure for the implementation of complex algorithms because we want the models to be deployable therefore as efficient as possible. So, arrays are used in machine learning almost everywhere. They store the data values preserving the order by index. The arrays are used to perform matrix multiplications which is one of the most important step in deep learning. Stacks are used for binary classification. Dynamic arrays are made using linked lists which offer constant time insertion and deletion. Queues are used where multiple arrays are being processed. Priority queues are also important which set the priority of operation that should take place. Trees and graphs are really important when it comes to machine learning. Trees have nodes and branches which gives us categories and subcategories of data points.
Decision Tree
A decision tree is a supervised learning algorithm that is used for both, regression and classification tasks. It is implemented using a tree, consisting of a root node, branches, internal nodes, and leaf nodes. It has a hierarchical tree structure. Given a data point, the tree structure,based on specific parameters and rules governed by the features decides a certain flow of the data through the tree structure to give us the output at the end. Let's explore how this happens!
A decision tree uses divide and conquer strategy to identify the optimal split points in the tree. This process of the split is then repeated recursively from top to bottom until it reaches a decision i.e., the data point has been classified. So basically, the decision tree is a tree which has a root node. The root node is followed by the decision nodes which depict the decision and then we have leaf nodes which represent the consequence of that decision. The decision tree-building algorithm is a recursive algorithm. Initially, the whole dataset is considered to be the root node. Then, it recursively goes through the unused attributes and calculates the entropy and the information gain of the data. Based on how low the entropy is and how high the information gain is, it makes the subnodes. Information gain is used as a criterion when we consider categorical data, that is in the case of classification. For regression, we use factors like gini index.
Where p is the probability of getting profit to say ‘yes,’ and N is loss to say ‘No.’
So, decision tree being a tree data structure helps us to solve linear to non-linear regression problems along with classification problems quite effectively. Without the presence of a tree data structure, it would have been very difficult to represent the data and solve the problem. It improves the time complexity as well as the space complexity.
Graph-based models
Similarly, graphs have nodes and edges. We use graph-based models when the data is interlinked, and the data points have a certain relationship. So, nodes along with the edges give us more and more control over links and connections of data points amongst themselves. Graphs add another valuable piece of information to the machine learning model, that is the relationship between the data points. So, the data point information is represented by the nodes while the information about the relationship is captured in the edges. Since graphs work on both, the data point and the information about the data point linked with each other, it can be used for both supervised and unsupervised learning. It can also be used for semi-supervised learning. Machine learning techniques with graphs have three tasks to do, one, node embedding, two, node classification, and three, link prediction. The node embedding task involves the creation of a node vector that has the node embedded among its neighbors. This vector is then passed into the model as a feature vector. Sometimes, instead of passing the feature vector into a model, we directly classify the data points depending on their links. This is called node classification. Link prediction is used when we want to predict the similarity between two data points or nodes.
Comments
Post a Comment