首页 > 代码库 > (转)8 Tactics to Combat Imbalanced Classes in Your Machine Learning Dataset

(转)8 Tactics to Combat Imbalanced Classes in Your Machine Learning Dataset

 

8 Tactics to Combat Imbalanced Classes in Your Machine Learning Dataset

by Jason Brownlee on August 19, 2015 in Machine Learning Process

 

Has this happened to you?

You are working on your dataset. You create a classification model and get 90% accuracy immediately. “Fantastic” you think. You dive a little deeper and discover that 90% of the data belongs to one class. Damn!

This is an example of an imbalanced dataset and the frustrating results it can cause.

In this post you will discover the tactics that you can use to deliver great results on machine learning datasets with imbalanced data.

 

Find some balance in your machine learning.
Photo by MichaEli, some rights reserved.

Coming To GripsWith Imbalanced Data

I get emailsabout class imbalance all the time, for example:

I have a binaryclassification problem and one class is present with 60:1 ratio in my trainingset. I used the logistic regression and the result seems to just ignores oneclass.

And this:

I am working ona classification model. In my dataset I have three different labels to beclassified, let them be A, B and C. But in the training dataset I have Adataset with 70% volume, B with 25% and C with 5%. Most of time my results areoverfit to A. Can you please suggest how can I solve this problem?

I write longlists of techniques to try and think about the best ways to get past thisproblem. I finally took the advice of one of my students:

Perhaps one ofyour upcoming blog posts could address the problem of training a model toperform against highly imbalanced data, and outline some techniques andexpectations.

Frustration!

Imbalanced datacan cause you a lot of frustration.

You feel veryfrustrated when you discovered that your data has imbalanced classes and thatall of the great results you thought you were getting turn out to be a lie.

The next wave offrustration hits when the books, articles and blog posts don’t seem to give yougood advice about handling the imbalance in your data.

Relax, there aremany options and we’re going to go through them all. It is possible, you canbuild predictive models for imbalanced data.

What isImbalanced Data?

Imbalanced datatypically refers to a problem with classification problems where the classesare not represented equally.

For example, youmay have a 2-class (binary) classification problem with 100 instances (rows). Atotal of 80 instances are labeled with Class-1 and the remaining 20 instancesare labeled with Class-2.

This is animbalanced dataset and the ratio of Class-1 to Class-2 instances is 80:20 ormore concisely 4:1.

You can have aclass imbalance problem on two-class classification problems as well asmulti-class classification problems. Most techniques can be used on either.

The remainingdiscussions will assume a two-class classification problem because it is easierto think about and describe.

Imbalance isCommon

Mostclassification data sets do not have exactly equal number of instances in eachclass, but a small difference often does not matter.

There areproblems where a class imbalance is not just common, it is expected. Forexample, in datasets like those that characterize fraudulent transactions areimbalanced. The vast majority of the transactions will be in the “Not-Fraud”class and a very small minority will be in the “Fraud” class.

Another exampleis customer churn datasets, where the vast majority of customers stay with theservice (the “No-Churn” class) and a small minority cancel their subscription(the “Churn” class).

When there is amodest class imbalance like 4:1 in the example above it can cause problems.

Accuracy Paradox

The accuracy paradox is thename for the exact situation in the introduction to this post.

It is the casewhere your accuracy measures tell the story that you have excellent accuracy(such as 90%), but the accuracy is only reflecting the underlying classdistribution.

It is verycommon, because classification accuracy is often the first measure we use whenevaluating models on our classification problems.

Put it All OnRed!

What is going onin our models when we train on an imbalanced dataset?

As you mighthave guessed, the reason we get 90% accuracy on an imbalanced data (with 90% ofthe instances in Class-1) is because our models look at the data and cleverlydecide that the best thing to do is to always predict “Class-1” and achievehigh accuracy.

This is bestseen when using a simple rule based algorithm. If you print out the rule in thefinal model you will see that it is very likely predicting one class regardlessof the data it is asked to predict.

8 Tactics ToCombat Imbalanced Training Data

We nowunderstand what class imbalance is and why it provides misleadingclassification accuracy.

So what are ouroptions?

1) CanYou Collect More Data?

You might thinkit’s silly, but collecting more data is almost always overlooked.

Can you collectmore data? Take a second and think about whether you are able to gather moredata on your problem.

A larger datasetmight expose a different and perhaps more balanced perspective on the classes.

More examples ofminor classes may be useful later when we look at resampling your dataset.

2) Try ChangingYour Performance Metric

Accuracy is notthe metric to use when working with an imbalanced dataset. We have seen that itis misleading.

There aremetrics that have been designed to tell you a more truthful story when workingwith imbalanced classes.

I give more advice on selectingdifferent performance measures in my post “Classification Accuracy is Not Enough: MorePerformance Measures You Can Use“.

In that post Ilook at an imbalanced dataset that characterizes the recurrence of breastcancer in patients.

From that post,I recommend looking at the following performance measures that can give moreinsight into the accuracy of the model than traditional classificationaccuracy:

  • Confusion Matrix: A breakdown of predictions into a table showing correctpredictions (the diagonal) and the types of incorrect predictions made (whatclasses incorrect predictions were assigned).
  • Precision: A measure of a classifiers exactness.
  • Recall: A measure of a classifiers completeness
  • F1 Score (or F-score): A weighted average of precision and recall.

I would alsoadvice you to take a look at the following:

  • Kappa (or Cohen’s kappa): Classification accuracy normalized by the imbalance ofthe classes in the data.
  • ROC Curves: Like precision and recall, accuracy is divided intosensitivity and specificity and models can be chosen based on the balancethresholds of these values.

You can learn a lot more about using ROCCurves to compare classification accuracy in our post “Assessing and Comparing ClassifierPerformance with ROC Curves“.

Still not sure?Start with kappa, it will give you a better idea of what is going on thanclassification accuracy.

3) TryResampling Your Dataset

You can changethe dataset that you use to build your predictive model to have more balanceddata.

This change iscalled sampling your dataset and there are two main methods that you can use toeven-up the classes:

  1. You can add copiesof instances from the under-represented class called over-sampling (or moreformally sampling with replacement), or
  2. You can deleteinstances from the over-represented class, called under-sampling.

These approachesare often very easy to implement and fast to run. They are an excellentstarting point.

In fact, I wouldadvise you to always try both approaches on all of your imbalanced datasets,just to see if it gives you a boost in your preferred accuracy measures.

You can learn a little more in the theWikipedia article titled “Oversampling and undersampling in dataanalysis“.

Some Rules ofThumb

  • Consider testingunder-sampling when you have an a lot data (tens- or hundreds of thousands ofinstances or more)
  • Consider testingover-sampling when you don’t have a lot of data (tens of thousands of recordsor less)
  • Consider testingrandom and non-random (e.g. stratified) sampling schemes.
  • Consider testingdifferent resampled ratios (e.g. you don’t have to target a 1:1 ratio in abinary classification problem, try other ratios)

4) Try GenerateSynthetic Samples

A simple way togenerate synthetic samples is to randomly sample the attributes from instancesin the minority class.

You could samplethem empirically within your dataset or you could use a method like Naive Bayesthat can sample each attribute independently when run in reverse. You will havemore and different data, but the non-linear relationships between theattributes may not be preserved.

There aresystematic algorithms that you can use to generate synthetic samples. The mostpopular of such algorithms is called SMOTE or the Synthetic MinorityOver-sampling Technique.

As its namesuggests, SMOTE is an oversampling method. It works by creating syntheticsamples from the minor class instead of creating copies. The algorithm selectstwo or more similar instances (using a distance measure) and perturbing aninstance one attribute at a time by a random amount within the difference tothe neighboring instances.

Learn more about SMOTE, see the original2002 paper titled “SMOTE: Synthetic Minority Over-sampling Technique“.

There are anumber of implementations of the SMOTE algorithm, for example:

  • In Python, take alook at the “UnbalancedDataset” module. It provides a number ofimplementations of SMOTE as well as various other resampling techniques thatyou could try.
  • In R, the DMwR package provides an implementation of SMOTE.
  • In Weka, you canuse the SMOTE supervised filter.

5) Try DifferentAlgorithms

As always, Istrongly advice you to not use your favorite algorithm on every problem. Youshould at least be spot-checking a variety of different types of algorithms ona given problem.

For more onspot-checking algorithms, see my post “Why you should be Spot-CheckingAlgorithms on your Machine Learning Problems”.

That being said,decision trees often perform well on imbalanced datasets. The splitting rulesthat look at the class variable used in the creation of the trees, can forceboth classes to be addressed.

If in doubt, trya few popular decision tree algorithms like C4.5, C5.0, CART, and RandomForest.

For some example R code using decisiontrees, see my post titled “Non-Linear Classification in R with DecisionTrees“.

For an example of using CART in Pythonand scikit-learn, see my post titled “Get Your Hands Dirty With Scikit-Learn Now“.

6) Try PenalizedModels

You can use the samealgorithms but give them a different perspective on the problem.

Penalizedclassification imposes an additional cost on the model for makingclassification mistakes on the minority class during training. These penaltiescan bias the model to pay more attention to the minority class.

Often thehandling of class penalties or weights are specialized to the learningalgorithm. There are penalized versions of algorithms such as penalized-SVM andpenalized-LDA.

It is also possible to have genericframeworks for penalized models. For example, Weka has a CostSensitiveClassifier that can wrap anyclassifier and apply a custom penalty matrix for miss classification.

Usingpenalization is desirable if you are locked into a specific algorithm and areunable to resample or you’re getting poor results. It provides yet another wayto “balance” the classes. Setting up the penalty matrix can be complex. Youwill very likely have to try a variety of penalty schemes and see what worksbest for your problem.

7) Try aDifferent Perspective

There are fieldsof study dedicated to imbalanced datasets. They have their own algorithms,measures and terminology.

Taking a lookand thinking about your problem from these perspectives can sometimes shameloose some ideas.

Two you might like to consider are anomaly detection and change detection.

Anomaly detection is the detectionof rare events. This might be a machine malfunction indicated through itsvibrations or a malicious activity by a program indicated by it’s sequence ofsystem calls. The events are rare and when compared to normal operation.

This shift inthinking considers the minor class as the outliers class which might help youthink of new ways to separate and classify samples.

Change detection is similarto anomaly detection except rather than looking for an anomaly it is lookingfor a change or difference. This might be a change in behavior of a user asobserved by usage patterns or bank transactions.

Both of theseshifts take a more real-time stance to the classification problem that mightgive you some new ways of thinking about your problem and maybe some moretechniques to try.

8) Try GettingCreative

Really climbinside your problem and think about how to break it down into smaller problemsthat are more tractable.

For inspiration, take a look at the verycreative answers on Quora in response to the question “In classification, how do you handle anunbalanced training set?”

For example:

Decompose yourlarger class into smaller number of other classes…

…use a One ClassClassifier… (e.g. treat like outlier detection)

…resampling theunbalanced training set into not one balanced set, but several. Running anensemble of classifiers on these sets could produce a much better result thanone classifier alone

These are just afew of some interesting and creative ideas you could try.

For more ideas, check out these commentson the reddit post “Classification when 80% of my training set isof one class“.

Pick a Methodand Take Action

You do not needto be an algorithm wizard or a statistician to build accurate and reliablemodels from imbalanced datasets.

We have covereda number of techniques that you can use to model an imbalanced dataset.

Hopefully thereare one or two that you can take off the shelf and apply immediately, forexample changing your accuracy metric and resampling your dataset. Both arefast and will have an impact straight away.

Which method are you going to try?

A FinalWord, Start Small

Remember that wecannot know which approach is going to best serve you and the dataset you areworking on.

You can use someexpert heuristics to pick this method or that, but in the end, the best adviceI can give you is to “become the scientist” and empirically test each methodand select the one that gives you the best results.

Start small andbuild upon what you learn.

Want More? Further Reading…

There areresources on class imbalance if you know where to look, but they are few andfar between.

I’ve looked andthe following are what I think are the cream of the crop. If you’d like to divedeeper into some of the academic literature on dealing with class imbalance,check out some of the links below.

Books

  • Imbalanced Learning: Foundations, Algorithms, andApplications

Papers

  • Data Mining for Imbalanced Datasets: An Overview
  • Learning from Imbalanced Data
  • Addressing the Curse of Imbalanced Training Sets:One-Sided Selection (PDF)
  • A Study ofthe Behavior of Several Methods for Balancing Machine Learning Training Data

Did you findthis post useful? Still have questions?

Leave a commentand let me know about your problem and any questions you still have abouthandling imbalanced classes.

 

(转)8 Tactics to Combat Imbalanced Classes in Your Machine Learning Dataset