Trending November 2023 # Dealing With Sparse Datasets In Machine Learning # Suggested December 2023 # Top 12 Popular

You are reading the article Dealing With Sparse Datasets In Machine Learning updated in November 2023 on the website Minhminhbmm.com. We hope that the information we have shared is helpful to you. If you find the content interesting and meaningful, please share it with your friends and continue to follow and support us for the latest updates. Suggested December 2023 Dealing With Sparse Datasets In Machine Learning

 This article was published as a part of the Data Science Blogathon.

Introduction

Missing data in machine learning is a type of data that contains null values, whereas Sparse data is a type of data that does not contain the actual values of sing data.

Sparse datasets with high zero values can cause problems like over-fitting in the machine learning models and several other problems. That is why dealin arse data is one of the most hectic processes in machine learning.

Most of the time, sparsity in the dataset is not a good fit for the machine learning problems in it should be handled properly. Still, sparsity in the dataset is good in some cases as it reduces the memory footprint of regular networks to fit mobile devices and shortens training time for ever-growing networks in deep learning. 

In the above Image, we can see the dataset with a high amount of zeros, meaning that the dataset is sparse. Most of the time, while working with a one-hot encoder, this type of sparsity is observed due to the working principle of the one-hot encoder.

The Need For Sparse Data

Handling

Several problems with the sparse datasets cause problems while training machine learning models. Due to the problem associated with sparse data, it should be handled properly.

A common problem with sparse data is:

1. Over-fitting: 

if there are too many features included in the training data, then while training a model, the model with tend to follow every step of the training data, results in higher accuracy in training data and lower performance in the testing dataset.

In the above image, we can see that the model is over-fitted on the training data and tries to follow or mimic every trend of the training data. This will result in lower performance of the model on testing or unknown data.

2. Avoiding Important Data:

Some machine-learning algorithms avoid the importance of sparse data and only tend to train and fit on the dense dataset. They do not tend to fit on sparse datasets.

The avoided sparse data can also have some training power and useful information, which the algorithm neglects. So it is not always a better approach to deal with sparse datasets.

3. Space Complexity 

If the dataset has a sparse feature, it will take more space to store than dense data; hence, the space complexity will increase. Due to this, higher computational power will be needed to work with this type of data.

4. Time Complexity

If the dataset is sparse, then training the model will take more time to train compared to the dense dataset on the data as the size of the dataset is also higher than the dense dataset.

5. Change in Behavior of the algorithms

Some of the algorithms might perform badly or low on sparse datasets. Some algorithms tend to perform badly while training them on sparse datasets. Logistic Regression is one of the algorithms which shows flawed behavior in the best fit line while training it on a space dataset.

Ways to Deal with Sparse Datasets

As discussed above, sparse datasets can be proven bad for training a machine learning model and should be handled properly. There are several ways to deal with sparse datasets.

1. Convert the feature to dense from sparse

It is always good to have dense features in the dataset while training a machine learning model. If the dataset has sparse data, it would be a better approach to convert it to dense features.

There are several ways to make the features dense:

1. Use Principle Component Analysis:

PCA is a dimensionality reduction method used to reduce the dimension of the dataset and select important features only in the output.

Example:



Implementing PCA on the dataset

from sklearn.decomposition import PCA pca = PCA(n_components=2) principalComponents = pca.fit_transform(df) pca_df = pd.DataFrame(data = principalComponents , columns = ['principal component 1', 'principal component 2']) df = pd.concat([pca_df, df[['label']]], axis = 1)

2. Use Feature Hashing:

Feature hashing is a technique used on sparse datasets in which the dataset can be binned into the desired number of outputs.

from sklearn.feature_extraction import FeatureHasher h = FeatureHasher(n_features=10) p = [{'dog': 1, 'cat':2, 'elephant':4},{'dog': 2, 'run': 5}] f = h.transform(p) f.toarray()

Output:

array([[ 0., 0., -4., -1., 0., 0., 0., 0., 0., 2.], [ 0., 0., 0., -2., -5., 0., 0., 0., 0., 0.]])

3. Perform Feature Selection and Feature Extraction

4. Use t-Distributed Stochastic Neighbor Embedding (t-SNE)

5. Use low variance filter

2. Remove the features from the model

It is one of the easiest and quick methods for handling sparse datasets. This method includes removing some of the features from the dataset which are not so important for the model training.

However, it should be noted that sometimes sparse datasets can also have some useful and important information that should not be removed from the dataset for better model training, which can cause lower performance or accuracy.

Dropping a whole column having sparse data:

import pandas as pd df = pd.drop(['SparseColumnName'],axis=1)

Dropping a column having sparse datatype:

import pandas as pd import numpy as np df = pd.DataFrame({"A": pd.arrays.SparseArray([0, 1, 0])}) df.sparse.to_dense() print(df)

3. Use methods that are not affected by sparse datasets

Some of the machine learning models are robust to the sparse dataset, and the behavior of the models is not affected by the sparse datasets. This approach can be used if there is no restriction to using these algorithms.

For example, Normal K means the algorithm is affected by sparse datasets and performs badly, resulting in lower accuracy. Still, the entropy-weighted k means algorithm is not affected by the sparse data, giving reliable results. So it can be used while dealing with sparse datasets.

Conclusion

Sparse data in machine learning is a widespread problem, especially when working with one hot encoding. Due to the problem caused by sparse data (like over-fitting, lower performance of the models, etc.), handling these types of data is more recommended for better model building and higher performance of the machine-learning models.

Some Key Insights from this blog are:

1. Sparse data is completely different from missing data. It is a form of data that contains a high amount of zero values.

2. The sparse data should be handled properly to avoid problems like time and space complexity, lower performance of the models, over-fitting, etc.

3. Dimensionality reduction, converting the sparse features into dense features and using algorithms like entropy-weighted k means, which are robust to sparsity, can be the solution while dealing with sparse datasets.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Related

You're reading Dealing With Sparse Datasets In Machine Learning

Hyperparameters In Machine Learning Explained

To improve the learning model of machine learning, there are various concepts given in machine learning. Hyperparameters are one of such important concepts that are used to improve the learning model. They are generally classified as model hyperparameters that are not included while setting or fitting the machine to the training set because they refer to the model selection task. In deep learning and machine learning, hyperparameters are the variables that you need to apply or set before the application of a learning algorithm to a dataset.

What are Hyperparameters?

Hyperparameters are those parameters that are specifically defined by the user to improve the learning model and control the process of training the machine. They are explicitly used in machine learning so that their values are set before applying the learning process of the model. This simply means that the values cannot be changed during the training of machine learning. Hyperparameters make it easy for the learning process to control the overfitting of the training set. Hyperparameters provide the best or optimal way to control the learning process.

Hyperparameters are externally applied to the training process and their values cannot be changed during the process. Most of the time, people get confused between parameters and hyperparameters used in the learning process. But parameters and hyperparameters are different in various aspects. Let us have a brief look over the differences between parameters and hyperparameters in the below section.

Parameters Vs Hyperparameters

These are generally misunderstood terms by users. But hyperparameters and parameters are very different from each other. You will get to know these differences as below −

Model parameters are the variables that are learned from the training data by the model itself. On the other hand, hyperparameters are set by the user before training the model.

The values of model parameters are learned during the process whereas, the values of hyperparameters cannot be learned or changed during the learning process.

Model parameters, as the name suggests, have a fixed number of parameters, and hyperparameters are not part of the trained model so the values of hyperparameters are not saved.

Classification of Hyperparameters

Hyperparameters are broadly classified into two categories. They are explained below −

Hyperparameter for Optimization

The hyperparameters that are used for the enhancement of the learning model are known as hyperparameters for optimization. The most important optimization hyperparameters are given below −

Learning Rate − The learning rate hyperparameter decides how it overrides the previously available data in the dataset. If the learning rate hyperparameter has a high value of optimization, then the learning model will be unable to optimize properly and this will lead to the possibility that the hyperparameter will skip over minima. Alternatively, if the learning rate hyperparameter has a very low value of optimization, then the convergence will also be very slow which may raise problems in determining the cross-checking of the learning model.

Batch Size − The optimization of a learning model depends upon different hyperparameters. Batch size is one of those hyperparameters. The speed of the learning process can be enhanced using the batch method. This method involves speeding up the learning process of the dataset by dividing the hyperparameters into different batches. To adjust the values of all the hyperparameters, the batch method is acquired. In this method, the training model follows the procedure of making small batches, training them, and evaluating to adjust the different values of all the hyperparameters. Batch size affects many factors like memory, time, etc. If you increase the size of the batch, then more learning time will be needed and more memory will also be required to process the calculation. In the same manner, the smaller size of the batch will lower the performance of hyperparameters and it will lead to more noise in the error calculation.

Number of Epochs − An epoch in machine learning is a type of hyperparameter that specifies one complete cycle of training data. The epoch number is a major hyperparameter for the training of the data. An epoch number is always an integer value that is represented after every cycle. An epoch plays a major role in the learning process where repetition of trial and error procedure is required. Validation errors can be controlled by increasing the number of epochs. Epoch is also named as an early stopping hyperparameter.

Hyperparameter for Specific Models

Number of Hidden Units − There are various neural networks hidden in deep learning models. These neural networks must be defined to know the learning capacity of the model. The hyperparameter used to find the number of these neural networks is known as the number of hidden units. The number of hidden units is defined for critical functions and it should not overfit the learning model.

Number of Layers − Hyperparameters that use more layers can give better performance than that of less number of layers. It helps in performance enhancement as it makes the training model more reliable and error-free.

Conclusion

Hyperparameters are those parameters that are externally defined by machine learning engineers to improve the learning model.

Hyperparameters control the process of training the machine.

Parameters and hyperparameters are terms that sound similar but they differ in nature and performance completely.

Parameters are the variables that can be changed during the learning process but hyperparameters are externally applied to the training process and their values cannot be changed during the process.

There are various methods categorized in different types of hyperparameters that enhance the performance of the learning model and also make error-free learning models.

How Machine Learning Is Transforming Healthcare In India

The integration of machine learning in the healthcare industry of India is set to transform conventional methods

Healthcare has become one of the biggest sectors in India’s economy. According to a report from

Solving the problem: too much raw data, too few real insights

Healthcare settings are flooded with unprecedented volumes of complex data from clinicians’ notes, medical devices, labs, and more. Remote patient wearables are increasingly adding to the onslaught. Electronic health records are helping digitize the information, but their job is not to ease the administrative workload on the front end or provide at-a-glance decision support. All the data coming in is only as valuable as the insights that can be quickly gleaned from it and appropriately actioned to improve healthcare delivery. Machine learning can make that possible, especially for digitized data sets with clear patterns. Machine learning not only collects but also unifies, data from disparate sources. It can perform the complex calculations required for doctors, nurses, and other members of the healthcare team to make quick sense of raw physiological, behavioral, and imaging information.  

Automation of manual tasks

Machine learning reduces the workload of physicians, radiologists, pathologists, and other providers by employing algorithms to garner insights. Automated workflows designed around how healthcare teams work in the real world are often used in tandem for easy information sharing and collaboration. Typical applications include:

Imaging analysis leveraging widely available data sets

Precise patient monitoring in the ICU or OR

Real-time remote patient monitoring through wearables that track heart rate, activity level, and more

Streamlining tedious administrative tasks like clinical documentation

Powerful predictive capabilities 

Precise predictive analysis of what a given patient will likely need next has historically been stopped by two barriers: the burden of collecting data and the difficulty of calculation. With machine learning, data collection speed and calculation complexity no longer depend on what humans can do by hand.  Using these powerful algorithms, one can imagine treatment decisions tailored to each patient’s specific situation and better outcomes as a result.  

Digital transformation: what to expect next

India is poised for an exciting digital transformation in healthcare. The penetration of machine learning and other innovative technologies, including automation and other AI techniques like natural language processing, is surging—with 5G coming soon. A vibrant ecosystem of startup and established health-tech companies is now in-country, with a rising population to fill new roles. Healthcare providers have gained a greater awareness of tech-enabled ways to do more with less manual effort. The government has stepped up with increased spending on evolving healthcare delivery, and the general public is in support.  

Government’s mission is to transform the healthcare infrastructure

Since 2023 due to the Covid-19 pandemic, there has been a huge focus by the government on investing in India’s healthcare infrastructure. This has also enabled technology firms to dive into the healthcare segment and innovate to contribute to the improvement of healthcare facilities in the country. Under the Digital India Initiative, the government has recently announced the launch of the Ayushman Bharat Health Mission which aims at creating India’s digital health ecosystem. The initiative focuses on creating digital health records for the citizens and their families to access and share digitally. Under this mission, the citizens will receive a randomly generated 14-digit number used for the purposes of uniquely identifying persons, authenticating them, and threading their health records only with their informed consent across multiple systems and stakeholders. Moreover, inclusion is one of the key principles of ABDM. The digital health ecosystem created by ABDM supports continuity of care across primary, secondary, and tertiary healthcare in a seamless manner. It aids the availability of health care services, particularly in remote and rural areas through various technology interventions like telemedicine etc. Digital health start-ups in India provide a vast backdrop for solutions with the government’s push to strengthen the digital healthcare infrastructure. The start-up landscape within the Indian healthcare ecosystem goes well beyond a specific disease, therapeutic area, geography, type of product, and service or business model. In a country where access to affordable healthcare is still a looming issue, the public stands to gain immensely from the development of the Digital Health industry. The ABDM is a one-of-a-kind strategy to unify the healthcare system in India and promote innovation in the industry. With the public interest in the minds of both the Government as well as the innovators, it remains to be seen how Digital Health will be perceived in law. While there is a long way to go, the use of AI and ML  has gained a strong foothold in India over the past year and we foresee a promising future for the industry.  

Author:

Healthcare has become one of the biggest sectors in India’s economy. According to a report from NITI Ayog , the sector has grown at a compound annual growth rate (CAGR) of 22% since 2023. Millions of jobs have been created, with millions more to come. How can a country short on trained clinical resources with vast inequities in care distribution grow at this pace? Machine learning is one way to help close the gaps.Healthcare settings are flooded with unprecedented volumes of complex data from clinicians’ notes, medical devices, labs, and more. Remote patient wearables are increasingly adding to the onslaught. Electronic health records are helping digitize the information, but their job is not to ease the administrative workload on the front end or provide at-a-glance decision support. All the data coming in is only as valuable as the insights that can be quickly gleaned from it and appropriately actioned to improve healthcare delivery. Machine learning can make that possible, especially for digitized data sets with clear patterns. Machine learning not only collects but also unifies, data from disparate sources. It can perform the complex calculations required for doctors, nurses, and other members of the healthcare team to make quick sense of raw physiological, behavioral, and imaging information.Machine learning reduces the workload of physicians, radiologists, pathologists, and other providers by employing algorithms to garner insights. Automated workflows designed around how healthcare teams work in the real world are often used in tandem for easy information sharing and collaboration. Typical applications include:Precise predictive analysis of what a given patient will likely need next has historically been stopped by two barriers: the burden of collecting data and the difficulty of calculation. With machine learning, data collection speed and calculation complexity no longer depend on what humans can do by hand. Using these powerful algorithms, one can imagine treatment decisions tailored to each patient’s specific situation and better outcomes as a result.India is poised for an exciting digital transformation in healthcare. The penetration of machine learning and other innovative technologies, including automation and other AI techniques like natural language processing, is surging—with 5G coming soon. A vibrant ecosystem of startup and established health-tech companies is now in-country, with a rising population to fill new roles. Healthcare providers have gained a greater awareness of tech-enabled ways to do more with less manual effort. The government has stepped up with increased spending on evolving healthcare delivery, and the general public is in support.Since 2023 due to the Covid-19 pandemic, there has been a huge focus by the government on investing in India’s healthcare infrastructure. This has also enabled technology firms to dive into the healthcare segment and innovate to contribute to the improvement of healthcare facilities in the country. Under the Digital India Initiative, the government has recently announced the launch of the Ayushman Bharat Health Mission which aims at creating India’s digital health ecosystem. The initiative focuses on creating digital health records for the citizens and their families to access and share digitally. Under this mission, the citizens will receive a randomly generated 14-digit number used for the purposes of uniquely identifying persons, authenticating them, and threading their health records only with their informed consent across multiple systems and stakeholders. Moreover, inclusion is one of the key principles of ABDM. The digital health ecosystem created by ABDM supports continuity of care across primary, secondary, and tertiary healthcare in a seamless manner. It aids the availability of health care services, particularly in remote and rural areas through various technology interventions like telemedicine etc. Digital health start-ups in India provide a vast backdrop for solutions with the government’s push to strengthen the digital healthcare infrastructure. The start-up landscape within the Indian healthcare ecosystem goes well beyond a specific disease, therapeutic area, geography, type of product, and service or business model. In a country where access to affordable healthcare is still a looming issue, the public stands to gain immensely from the development of the Digital Health industry. The ABDM is a one-of-a-kind strategy to unify the healthcare system in India and promote innovation in the industry. With the public interest in the minds of both the Government as well as the innovators, it remains to be seen how Digital Health will be perceived in law. While there is a long way to go, the use of AI and ML has gained a strong foothold in India over the past year and we foresee a promising future for the industry.Punit Soni, Founder & CEO, Suki

What Role Does Machine Learning Play In Biotechnology?

ML is changing biological research. This has led to new discoveries in biotechnology and healthcare.

Machine Learning and Artificial Intelligence are changing the way that people live and work. These fields have been praised and criticized. AI and ML, or as they are commonly known, have many applications and benefits across a wide variety of industries. They are changing biological research and resulting in new discoveries in biotechnology and healthcare.

What are the Applications of Machine Learning in Biotechnology?

Here are some use cases of ML in biotech:

Identifying Gene Coding Regions

Next-generation sequencing is a fast and efficient way to study genomics. The machine-learning approach to discovering gene coding regions in a genome is now being used. These machine-learning-based gene prediction techniques are more sensitive than traditional sequence analysis based on homology.

Structure Prediction

PPI has been mentioned in the context of proteomics before. However, ML has improved structure prediction accuracy by more than 70% to over 80%. Text mining has great potential. Training sets can be used to identify new or unusual pharmacological targets using many journals articles and secondary databases.

Also read:

Best Video Editing Tips for Beginners in 2023

Neural Networks

Deep learning, an extension of neural networks, is a relatively recent topic in ML. Deep learning refers to the number of layers that data can be changed. Deep learning is therefore analogous to a multilayer neural structure. Multi-layer nodes simulate the brain’s workings to help solve problems. ML already uses neural networks. Neural network-based ML algorithms need to be able to analyze the raw data. It is becoming more difficult to analyze significant data due to the increasing amount of information generated by genome sequencing. Multiple layers of neural networks filter information and interact with one another, which allows for refined output.

Mental Illness

AI in Healthcare

Final Thoughts

Every business sector and industry has been affected by digitization. These effects aren’t limited to the biotech, healthcare, and biology industries. Companies are looking for a way to combine their operations and allow them to exchange and transmit data more efficiently, faster, and in a more efficient manner. Bioinformatics and biomedicine have struggled for years with processing biological data.

Feature Selection Techniques In Machine Learning (Updated 2023)

Introduction

As a data scientist working with Python, it’s crucial to understand the importance of feature selection when building a machine learning model. In real-life data science problems, it’s almost rare that all the variables in the dataset are useful for building a model. Adding redundant variables reduces the model’s generalization capability and may also reduce the overall accuracy of a classifier. Furthermore, adding more variables to a model increases the overall complexity of the model.

As per the Law of Parsimony of ‘Occam’s Razor’, the best explanation of a problem is that which involves the fewest possible assumptions. Thus, feature selection becomes an indispensable part of building machine learning models.

Learning Objectives:

Understanding the importance of feature selection.

Familiarizing with different feature selection techniques.

Applying feature selection techniques in practice and evaluating performance.

Table of Contents What Is Feature Selection in Machine Learning?

The goal of feature selection techniques in machine learning is to find the best set of features that allows one to build optimized models of studied phenomena.

The techniques for feature selection in machine learning can be broadly classified into the following categories:

Supervised Techniques: These techniques can be used for labeled data and to identify the relevant features for increasing the efficiency of supervised models like classification and regression. For Example- linear regression, decision tree, SVM, etc.

Unsupervised Techniques: These techniques can be used for unlabeled data. For Example- K-Means Clustering, Principal Component Analysis, Hierarchical Clustering, etc.

From a taxonomic point of view, these techniques are classified into filter, wrapper, embedded, and hybrid methods.

Now, let’s discuss some of these popular machine learning feature selection methods in detail.

Types of Feature Selection Methods in ML Filter Methods

Filter methods pick up the intrinsic properties of the features measured via univariate statistics instead of cross-validation performance. These methods are faster and less computationally expensive than wrapper methods. When dealing with high-dimensional data, it is computationally cheaper to use filter methods.

Let’s, discuss some of these techniques:

Information Gain

Information gain calculates the reduction in entropy from the transformation of a dataset. It can be used for feature selection by evaluating the Information gain of each variable in the context of the target variable.

The Chi-square test is used for categorical features in a dataset. We calculate Chi-square between each feature and the target and select the desired number of features with the best Chi-square scores. In order to correctly apply the chi-squared to test the relation between various features in the dataset and the target variable, the following conditions have to be met: the variables have to be categorical, sampled independently, and values should have an expected frequency greater than 5.

Fisher score is one of the most widely used supervised feature selection methods. The algorithm we will use returns the ranks of the variables based on the fisher’s score in descending order. We can then select the variables as per the case.

Correlation is a measure of the linear relationship between 2 or more variables. Through correlation, we can predict one variable from the other. The logic behind using correlation for feature selection is that good variables correlate highly with the target. Furthermore, variables should be correlated with the target but uncorrelated among themselves.

If two variables are correlated, we can predict one from the other. Therefore, if two features are correlated, the model only needs one, as the second does not add additional information. We will use the Pearson Correlation here.

We need to set an absolute value, say 0.5, as the threshold for selecting the variables. If we find that the predictor variables are correlated, we can drop the variable with a lower correlation coefficient value than the target variable. We can also compute multiple correlation coefficients to check whether more than two variables correlate. This phenomenon is known as multicollinearity.

Variance Threshold

The variance threshold is a simple baseline approach to feature selection. It removes all features whose variance doesn’t meet some threshold. By default, it removes all zero-variance features, i.e., features with the same value in all samples. We assume that features with a higher variance may contain more useful information, but note that we are not taking the relationship between feature variables or feature and target variables into account, which is one of the drawbacks of filter methods.

The get_support returns a Boolean vector where True means the variable does not have zero variance.

Mean Absolute Difference (MAD)

‘The mean absolute difference (MAD) computes the absolute difference from the mean value. The main difference between the variance and MAD measures is the absence of the square in the latter. The MAD, like the variance, is also a scaled variant.’ [1] This means that the higher the MAD, the higher the discriminatory power.

‘Another measure of dispersion applies the arithmetic mean (AM) and the geometric mean (GM). For a given (positive) feature Xi on n patterns, the AM and GM are given by

respectively; since AMi ≥ GMi, with equality holding if and only if Xi1 = Xi2 = …. = Xin, then the ratio

Wrapper Methods

Wrappers require some method to search the space of all possible subsets of features, assessing their quality by learning and evaluating a classifier with that feature subset. The feature selection process is based on a specific machine learning algorithm we are trying to fit on a given dataset. It follows a greedy search approach by evaluating all the possible combinations of features against the evaluation criterion. The wrapper methods usually result in better predictive accuracy than filter methods.

Let’s, discuss some of these techniques:

Forward Feature Selection

This is an iterative method wherein we start with the performing features against the target features. Next, we select another variable that gives the best performance in combination with the first selected variable. This process continues until the preset criterion is achieved.

This method works exactly opposite to the Forward Feature Selection method. Here, we start with all the features available and build a model. Next, we the variable from the model, which gives the best evaluation measure value. This process is continued until the preset criterion is achieved.

This method, along with the one discussed above, is also known as the Sequential Feature Selection method.

Exhaustive Feature Selection

This is the most robust feature selection method covered so far. This is a brute-force evaluation of each feature subset. This means it tries every possible combination of the variables and returns the best-performing subset.

Embedded Methods

These methods encompass the benefits of both the wrapper and filter methods by including interactions of features but also maintaining reasonable computational costs. Embedded methods are iterative in the sense that takes care of each iteration of the model training process and carefully extract those features which contribute the most to the training for a particular iteration.

Let’s discuss some of these techniques here:

LASSO Regularization (L1)

Regularization consists of adding a penalty to the different parameters of the machine learning model to reduce the freedom of the model, i.e., to avoid over-fitting. In linear model regularization, the penalty is applied over the coefficients that multiply each predictor. From the different types of regularization, Lasso or L1 has the property that can shrink some of the coefficients to zero. Therefore, that feature can be removed from the model.

Random Forests is a kind of Bagging Algorithm that aggregates a specified number of decision trees. The tree-based strategies used by random forests naturally rank by how well they improve the purity of the node, or in other words, a decrease in the impurity (Gini impurity) over all trees. Nodes with the greatest decrease in impurity happen at the start of the trees, while notes with the least decrease in impurity occur at the end of the trees. Thus, by pruning trees below a particular node, we can create a subset of the most important features.

Conclusion

We have discussed a few techniques for feature selection. We have purposely left the feature extraction techniques like Principal Component Analysis, Singular Value Decomposition, Linear Discriminant Analysis, etc. These methods help to reduce the dimensionality of the data or reduce the number of variables while preserving the variance of the data.

Apart from the methods discussed above, there are many other feature selection methods. There are hybrid methods, too, that use both filtering and wrapping techniques. If you wish to explore more about feature selection techniques, great comprehensive reading material, in my opinion, would be ‘Feature Selection for Data and Pattern Recognition’ by Urszula Stańczyk and Lakhmi C. Jain.

Key Takeaways

Understanding the importance of feature selection and feature engineering in building a machine learning model.

Familiarizing with different feature selection techniques, including supervised techniques (Information Gain, Chi-square Test, Fisher’s Score, Correlation Coefficient), unsupervised techniques (Variance Threshold, Mean Absolute Difference, Dispersion Ratio), and their classifications (Filter methods, Wrapper methods, Embedded methods, Hybrid methods).

Evaluating the performance of feature selection techniques in practice through implementation.

Frequently Asked Questions Related

Calibration Of Machine Learning Models

This article was published as a part of the Data Science Blogathon.

Introduction

source: iPhone Weather App

A screen image related to a weather forecast must be a familiar picture to most of us. The AI Model predicting the expected weather predicts a 40% chance of rain today, a 50% chance of Wednesday, and a 50% on Thursday. Here the AI/ML Model is talking about the probability of occurrence, which is the interesting part. Now, the question is this AI/ML model trustworthy?

As learners of Data Science/Machine Learning, we would have walked through stages where we build various supervisory ML Models(both classification and regression models). We also look at different model parameters that tell us how well the model performs. One important but probably not so well-understood model reliability parameter is Model Calibration. The calibration tells us how much we can trust a model prediction. This article explores the basics of model calibration and its relevancy in the MLOps cycle. Even though Model Calibration applies to regression models as well, we will exclusively look at classification examples to get a grasp on the basics.

The Need for Model Calibration

Wikipedia amplifies calibration as ” In measurement technology and metrology, calibration is the comparison of measurement values delivered by a device under test with those of a calibration standard of known accuracy. “

The model outputs two important pieces of information in a typical classification ML model. One is the predicted class label (for example, classification as spam or not spam emails), and the other is the predicted probability. In binary classification, the sci-kit learn library gives a method called the model.predict_proba(test_data) that gives us the probability for the target to be 0 and 1 in an array form.  A model predicting rain can give us a 40% probability of rain and a 60% probability of no rain. We are interested in the uncertainty in the estimate of a classifier. There are typical use cases where the predicted probability of the model is very much of interest to us, such as weather models, fraud detection models, customer churn models, etc. For example, we may be interested in answering the question, what is the probability of this customer repaying the loan?

Let’s say we have an ML model which predicts whether a patient has cancer-based on certain features. The model predicts a particular patient does not have cancer (Good, a happy scenario!).  But if the predicted probability is 40%, then the Doctor may like to conduct some more tests for a certain conclusion. This is a typical scenario where the prediction probability is critical and of immense interest to us. The Model Calibration helps us improve the model’s prediction probability so that the model’s reliability improves. It also helps us to decipher the predicted probability observed from the model predictions. We can’t take for granted that the model is twice as confident when giving a predicted probability of 0.8 against a figure of 0.4.

We also must understand that calibration differs from the model’s accuracy. The model accuracy is defined as the number of correct predictions divided by the total number of predictions made by the model. It is to be clearly understood that we can have an accurate but not calibrated model and vice versa.

If we have a model predicting rain with 80% predicted probability at all times, then if we take data for 100 days and find 80 days are rainy, we can say that model is well calibrated. In other words, calibration attempts to remove bias in the predicted probability.

Consider a scenario where the ML model predicts whether the user who is making a purchase on an e-commerce website will buy another associated item or not. The model predicts the user has a probability of 68% for buying Item A  and an item B probability of 73%. Here we will present Item B to the user(higher predicted probability), and we are not interested in actual figures. In this scenario, we may not insist on strict calibration as it is not so critical to the application.

The following shows details of 3 classifiers (assume that models predict whether an image is a dog image or not). Which of the following model is calibrated and hence reliable?

(a) Model 1 : 90% Accuracy, 0.85 confidence in each prediction

(b) Model 2 : 90% Accuracy, 0.98 confidence in each prediction

(c) Model 3 : 90% Accuracy ,0.91 confidence in each prediction

If we look at the first model, it is underconfident in its prediction, whereas model 2 seems overconfident. Model  3 seems well-calibrated, giving us confidence in the model’s ability. Model 3 thinks it is correct 91% of the time and 90% of the time, which shows good calibration.

Reliability Curves

The model’s calibration can be checked by creating a calibration plot or Reliability Plot. The calibration plot reveals the disparity between the probability predicted by the model and the true class probabilities in the data. If the model is well calibrated, we expect to see a straight line at 45 degrees from the origin (indicative that estimated probability is always the same as empirical probability ).

We will attempt to understand the calibration plot using a toy dataset to concretize our understanding of the subject.

Source: own-document

The resulting probability is divided into multiple bins representing possible ranges of outcomes. For example,  [0-0.1), [0.1-0.2), etc., can be created with 10 bins. For each bin, we calculate the percentage of positive samples. For a well-calibrated model, we expect the percentage to correspond to the bin center. If we take the bin with the interval [0.9-1.0), the bin center is 0.95, and for a well-calibrated model, we expect the percentage of positive samples ( samples with label 1) to be 95%.

Source: self-document

We can plot the Mean predicted value (midpoint of the bin ) vs. Fraction of TRUE Positives in each bin in a line plot to check the calibration. of the model.

We can see the difference between the ideal curve and the actual curve, indicating the need for our model to be calibrated. Suppose the points obtained are below the diagonal. In that case, it indicates that the model has overestimated (model predicted probabilities are too high). If the points are above the diagonal, it can be estimated that model has been underconfident in its predictions(the probability is too small). Let’s also look at a real-life Random Forest Model curve in the image below.

If we look at the above plot, the S curve ( remember the sigmoid curve seen in Logistic Regression !) is observed commonly for some models. The Model is seen to be underconfident at high probabilities and overconfident when predicting low probabilities. For the above curve, for the samples for which the model predicted probability is 30%, the actual value is only 10%. So the Model was overestimating at low probabilities.

The toy dataset we have shown above is for understanding, and in reality, the choice of bin size is dependent on the amount of data we have, and we would like to have enough points in each bin such that the standard error on the mean of each bin is small.

Brier Score

We do not need to go for the visual information to estimate the Model calibration. The calibration can be measured using the Brier Score. The Brier score is similar to the Mean Squared Error but used slightly in a different context. It takes values from 0 to 1, with 0 meaning perfect calibration, and the lower the Brier Score, the better the model calibration.

The Brier score is a statistical metric used to measure probabilistic forecasts’ accuracy. It is mostly used for binary classification.

Let’s say a probabilistic model predicts a 90% chance of rain on a particular day, and it indeed rains on that day. The Brier score can be calculated using the following formula,

Brier Score = (forecast-outcome)2

 Brier Score in the above case is calculated to be (0.90-1)2  = 0.01.

The Brier Score for a set of observations is the average of individual Brier Scores.

On the other hand, if a model predicts with a 97%  probability that it will rain but does not rain, then the calculated Brier Score, in this case, will be,

Brier Score = (0.97-0)2 = 0.9409 . A lower Brier Score is preferable.

Calibration Process

Now, let’s try and get a glimpse of how the calibration process works without getting into too many details.

Some algorithms, like Logistic Regression, show good inherent calibration standards, and these models may not require calibration. On the other hand, models like SVM, Decision Trees, etc., may benefit from calibration.  The calibration is a rescaling process after a model has made the predictions.

 There are two popular methods for calibrating probabilities of ML models, viz,

(a) Platt Scaling

(b) Isotonic Regression

It is not the intention of this article to get into details of the mathematics behind the implementation of the above approaches. However, let’s look at both methods from a ringside perspective.

The Platt Scaling is used for small datasets with a reliability curve in the sigmoid shape. It can be loosely understood as putting a sigmoid curve on top of the calibration plot to modify the predictive probabilities of the model.

The above images show how imposing a Platt calibrator curve on the reliability curve of the model modifies the curve. It is seen that the points in the calibration curve are pulled toward the ideal line (dotted line) during the calibration process.

It is noted that for practical implementation during model development, standard libraries like sklearn support easy model calibration(sklearn.calibration.CalibratedClassifier).

Impact on Performance

It is pertinent to note that calibration modifies the outputs of trained ML models. It could be possible that calibration also affects the model’s accuracy. Post calibration, some values close to the decision boundary (say 50% for binary classification) may be modified in such a way as to produce an output label different from prior calibration. The impact on accuracy is rarely huge, and it is important to note that calibration improves the reliability of the ML model.

Conclusion

In this article, we have looked at the theoretical background of Model Calibration. Calibration of Machine Learning models is an important but often overlooked aspect of developing a reliable model. The following are key takeaways from our learnings:-

(a) Model Calibration gives insight or understanding of uncertainty in the prediction of the model and in turn, the reliability of the model to be understood by the end-user, especially in critical applications.

(b) Model calibration is extremely valuable to us in cases where predicted probability is of interest.

(c) Reliability curves and Brier Score gives us an estimate of the calibration levels of the model.

(c) Platt scaling and Isotonic Regression is popular methods to scale the calibration levels and improve the predicted probability.

Where do we go from here? This article aims to give you a basic understanding of Model Calibration. We can further build on this by exploring actual implementation using standard python libraries like scikit Learn for use cases.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Related

Update the detailed information about Dealing With Sparse Datasets In Machine Learning on the Minhminhbmm.com website. We hope the article's content will meet your needs, and we will regularly update the information to provide you with the fastest and most accurate information. Have a great day!