In this part, we will elaborate on some model evaluation metrics specifically for multi-class classification problems. Accuracy, precision, recall, and confusion matrix are discussed below for our facies problem. This post is the third part of __part1__, __part2__. You can find the jupyter notebook file of this part __here__.

When I was fresh in machine learning, I always considered constructing a model as the most important step of the ML tasks, while now, I have another concept; model evaluation skill is the fundamental key to modeling success. We need to make sure that our model is working well with new data. On the other hand, we have to be able to interpret various evaluation metrics to understand our modelâ€™s strengths and weaknesses leading us to model improvement hints. As we are dealing with the multi-class problem in this tutorial, we will focus on related evaluation metrics, but before that, we need to get familiar with some definitions.

**3â€“1 Model Metrics**

When we are working with classification problems, we will have 4 kinds of possibility with

model outcomes:

**A)True Positive(TP**) is the outcome of the model *correctly *predicts the *positive *class. In our dataset, a positive class is a label that we are looking for specifically for that label prediction. For example, if we are analyzing â€˜Dolomiteâ€™ class prediction, TP is the number of truly predicted Dolomite samples of test data by the model.

**B) True Negative(TN**) is an outcome where the model *correctly *predicts the *negative *class. Negative class in our dataset for Dolomite prediction are those facies classes that truly predicted as not Dolomite(predicted as the rest of classes and truly were not Dolomite).

**C) False Positive(FP**) is an outcome where the model *incorrectly *predicts the *positive *class. In our dataset, all facies classes that incorrectly predicted as Dolomite when we are evaluating Dolomite class prediction.

**D**) **False Negative(FN**) is an outcome where the model *incorrectly *predicts *negative *class. Again for Dolomite prediction, FN is the prediction of Dolomite as non-Dolomite classes.

**1.Accuracy**: it is simply calculated as a fraction of correct predictions over the total number of predictions.

Accuracy = (TP+TN) / (TP+TN+FP+FN)

**2. Precision**: this metric answers this question: *what proportion of positive predictions is totally correct?*

Precision = TP / (TP+FP) looking at the equation, we can see that if a model has zero False Positive prediction, the precision will be 1. Again, in Dolomite prediction, this index shows what proportion of predicted Dolomite is truly Dolomite (not other facies are classified as Dolomite).

**3. Recall**: recall answer this question:* what proportion of actual positives is classified correctly?*

Recall= TP / (TP+FN) looking at the equation, we can see that if a model has zero False Negative prediction, the recall will be 1. In our example, recall shows the proportion of Dolomite class that correctly identified by the model.

**Note**: to evaluate the model efficiency, we need to consider both precision and recall together. Unfortunately, these two parameters act against each other, improving one leads to decreasing the other. The ideal case is that both of them show near 1 values.

**4. f1_score: **The F1 score can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its best value at 1 and the worst score at 0. The relative contribution of precision and recall to the F1 score are equal. The formula for the F1 score is:

F1 = 2 * (precision * recall) / (precision + recall)

Letâ€™s see one example of Logistic Regression classifier performance:

Run:

```
from sklearn.metrics import precision_recall_fscore_support
model_log=LogisticRegression(C = 10, solver = â€˜lbfgsâ€™, max_iter= 200 )
model_log.fit(X_train, y_train)
y_pred_log = model_log.predict(X_test)
print(classification_report(y_test, y_pred_log, target_names= facies_labels))
```

To evaluate the Logistic Regression classifier performance, let's look at the first facies class Sandstone(SS). When this model predicts a facies as SS, it is correct in 75% of the time(Precision). On the other hand, this model correctly identifies 89% of all SS facies members(Recall). We can guess that f1_score is somewhere between these two metrics. Support means the individual class members for the test.

Let's have some __block of codes__ to implement the above-mentioned procedure in order for all models and plot the result as an average. Up to line 15, we defined the model objects with hyper-parameters that we already obtained from the grid-search approach. Then(line 16 to 25) models are appended into a list to be iterable when we want to fit and cross-validate in order. After cross-validation, we stored metrics results in the list for each model. line 37 to 52, we established a for loop to calculate the average value of each of these metrics for each model. The rest of the code is a plotting task.

This is the plot that shows the average value of each of the evaluation metrics(y-axis) for individual employed models. Here, we wanted to compare all models performances overall. It seems that Extra tree and Random forest did the best prediction while Gaussian Naive Bays was not that much efficient predictor model.

If we are concerned about an individual facies prediction, we should consider eliminating the rest of the metrics from the â€˜resultsâ€™ list and run the program again.

## 3â€“2 Confusion matrix

The confusion matrix shows predicted class labels against original true label data. This is a fantastic visualization tool that we can see each class of facies is predicted correctly or wrong into other classes.

In the __line of codes__, we first defined a function to make fancy use of the confusion matrix function developed by sklearn. After function definition, we fit and run the Logistic Regression classifier. Now we have predicted facies labels with true facies labels for test data. Calling the function with the required input parameters will create a plot of the confusion matrix.

Taking a look at the plot(first row), we recognize that this algorithm could predict 151 SS class correctly while 18 true SS were classified as CSiS incorrectly. From the previous section, we are familiar with the recall concept. From all true class members of SS(169), the classifier could recognize 151 correctly; 151/169 is 89%(we have seen this number in the class report in the picture above). So, we can conclude that if we move our evaluation in the row direction(True labels) we are dealing with recall. You may guess that if we go in the column direction, we will deal with Precision. For SS precision is 75% as 149/201.

In the picture below, we see how the Naive Bayes classifier poorly performed prediction. This classifier is totally overestimated BS class in prediction.

Up to now, we have some metrics that helped us to evaluate the model performances but still, we can *not guarantee* that which one is the best. Because some models can memorize training data and follow data complexity severely and when it faces a new dataset, its performance will be poor. This is called **over-fitting **(model with high variance). A model with high variance will change a lot with small changes to the training dataset.
On the other hand, when a model too generalizes prediction, it will not be able to capture the complexity of a dataset, this is called **under-fitting**(model with high bias).
Our ideal model is something between these two models leaving us for bias-variance trade-off.

*The question is*: how we can recognize that our model is over-fit or under-fit?
We will cover in the next part of this tutorial.

## Conclusion:

Model evaluation is the most important task in ML model production. We mainly start with simple evaluation metrics and then narrow down to specific and more detailed metrics to understand our model's strengths and weaknesses.

## ã‚³ãƒ¡ãƒ³ãƒˆ