I have 7 trained classification models(based on different algorithms) that are trained on the same train set and tested on the same test set. Im measuring their performance in the test set using several measures that are listed below:
['Cohen Kappa','Accuracy','F1 Micro','F1 Macro','F1 Weighted','FBETA Micro(b=0.25)','FBETA Macro(b=0.25)','FBETA Weighted(b=0.25)','FBETA Micro(b=0.50)','FBETA Macro(b=0.50)','FBETA Weighted(b=0.50)','FBETA Micro(b=0.75)','FBETA Macro(b=0.75)','FBETA Weighted(b=0.75)','Hamming Loss','Recall Micro','Recall Macro','Recall Weighted','Jaccard Similarity','Precision Micro','Precision Macro','Precision Weighted','Mathews Correlation Coefficient']
My dataset is for multiclass classification.I know that some measures that are listed above should have close or equivalent values because of the multiclass problem.The problem is that almost all the measures are pretty close with each other and that bothers me .I can see differences between the classification models but the measures for a specific model only differ on the second or third floating point if not equivalent.Is such thing normal? Im using python to train and test the models with the help of scikit-learn library.My goal was to see if the measures differ but my results suggest they don't.