Crucial Metrics for Assessing Classification Model Performance

5 min readJul 6, 2024

Measuring the performance of a classification model is essential for ensuring its effectiveness in real-world applications, and in this article, we’ll explore the most important metrics to understand how well your model is performing.

1. Confusion Matrix

A confusion matrix is a table that is often used to describe the performance of a classification model on a set of test data for which the true values are known. It is a table with four different combinations of predicted and actual values in the case for a binary classifier.

The confusion matrix for a multi-class classification problem can help you determine mistake patterns.

For a binary classifier:

A true positive is an outcome where the model correctly predicts the positive class. Similarly, a true negative is an outcome where the model correctly predicts the negative class.

False Positive and False Negative

The terms false positive and false negative are used in determining how well the model is predicting with respect to classification. A false positive is an outcome where the model incorrectly predicts the positive class. And a false negative is an outcome where the model incorrectly predicts the negative class. The more values in main diagonal, the better the model, whereas the other diagonal gives the worst result for classification.

False Positive

False positive (type I error) — when you reject a true null hypothesis.

This is an example in which the model mistakenly predicted the positive class. For example, the model inferred that a particular email message was spam (the positive class), but that email message was actually not spam. It’s like a warning sign that the mistake should be rectified as it’s not much of a serious concern compared to false negative.

False Negative

False negative (type II error) — when you accept a false null hypothesis.

This is an example in which the model mistakenly predicted the negative class. For example, the model inferred that a particular email message was not spam (the negative class), but that email message actually was spam. It’s like a danger sign that the mistake should be rectified early as it’s more serious than a false positive.

Accuracy, Precision, Recall, and F-1 Score

From the confusion matrix, we can infer accuracy, precision, recall, and F-1 score.

Accuracy

Accuracy is the fraction of predictions our model got right.

Accuracy can also be written as

Accuracy alone doesn’t tell the full story when working with a class-imbalanced data set, where there is a significant disparity between the number of positive and negative labels. Precision and recall are better metrics for evaluating class-imbalanced problems.

Precision

Out of all the classes, precision is how much we predicted correctly.

Precision should be as high as possible.

Recall

Out of all the positive classes, recall is how much we predicted correctly. It is also called sensitivity or true positive rate (TPR).

Recall should be as high as possible.

F-1 Score

It is often convenient to combine precision and recall into a single metric called the F-1 score, particularly if you need a simple way to compare two classifiers. The F-1 score is the harmonic mean of precision and recall.

The regular mean treats all values equally, while the harmonic mean gives much more weight to low values thereby punishing the extreme values more. As a result, the classifier will only get a high F-1 score if both recall and precision are high.

2. Receiver Operator Curve (ROC) & Area Under The Curve (AUC)

The ROC curve is an important classification evaluation metric. It tells us how well the model has accurately predicted. The ROC curve shows the sensitivity of the classifier by plotting the rate of true positives to the rate of false positives. If the classifier is outstanding, the true positive rate will increase, and the area under the curve will be close to one. If the classifier is similar to random guessing, the true positive rate will increase linearly with the false positive rate. The better the AUC measure, the better the model.

3. Cumulative Accuracy Profile Curve

The CAP of a model represents the cumulative number of positive outcomes along the y-axis versus the corresponding cumulative number of classifying parameters along the x-axis. The CAP is distinct from the receiver operating characteristic (ROC), which plots the true-positive rate against the false-positive rate. CAP curve is rarely used as compared to the ROC curve.

Consider a model that predicts whether a customer will purchase a product. If a customer is selected at random, there is a 50 percent chance they will buy the product. The cumulative number of elements for which the customer buys would rise linearly toward a maximum value corresponding to the total number of customers. This distribution is called the “random” CAP. It's the blue line in the above diagram. A perfect prediction, on the other hand, determines exactly which customer will buy the product, such that the maximum customers buying the property will be reached with a minimum number of customer selections among the elements. This produces a steep line on the CAP curve that stays flat once the maximum is reached, which is the “perfect” CAP. It’s also called the “ideal” line and is the grey line in the figure above.

In the end, a model should predict where it maximizes the correct predictions and gets closer to a perfect model line.