Choosing datasets for your experiments

Which of these datasets to choose for your experiments? Often, datasets are chosen to be “diverse” based on characteristics like sample size or dimensionality. However, “similar” datasets may behave very differently in terms of classifier performances.

In our new paper, “Characterizing Multiple Instance Datasets“, we proposed to describe datasets by performances of different multiple instance classifiers. Dataset distance can be defined by differences in the performance (in our case, area under the ROC curve), or even by comparing ROC curves directly. The distances can be visualized after embedding the distance matrix into a 2-dimensional space.


I made a demo app to illustrate this process. Select a few datasets and classifiers (all are from the MIL toolbox) and see how the embedding changes. I am new to Shiny, but as I learn more about it I will add more functionality to the app, like being able to see the dataset characteristics when you click on a point. Or, if you are a Shiny expert, drop me a line so I get there faster 🙂


PRTools 5

To use the MIL datasets with the newest version of PRTools (version 5), please use the prload function. This will load the dataset and convert it to the PRTools format.