Biocreative is a text categorization problem. From the README file: the task is to decide whether a given pair should be annotated with some Gene Ontology (GO) code. As input, we have paragraphs of documents, each paragraph described by a feature vector. Features used are word occurrence frequencies and some statistics about the nature of the protein-GO code interaction for each paragraph. Each document corresponds to a bag and each paragraph to an instance in a bag. The hypothesis is that a bag should be annotated with a GO code iff there exists a paragraph in it that supports this annotation. Conversely, if no paragraph supports such an annotation, the document should not be annotated.

Original source

The original data that this dataset is based on can be found here: This dataset has been represented as a MIL problem (in C4.5 format) by Dr. Soumya Ray. For more details about the creation of the dataset please refer to:

title={Learning statistical models for annotating proteins with function information using biomedical text},
author={Ray, Soumya and Craven, Mark},
journal={BMC bioinformatics},
number={Suppl 1},
publisher={BioMed Central Ltd}

title={Supervised versus multiple instance learning: An empirical comparison},
author={Ray, Soumya and Craven, Mark},
booktitle={Proceedings of the 22nd international conference on Machine learning},

The data was then converted to Matlab format with a parser by Gary Doran.


Files – This file contains three different .MAT files for different tasks (component, function, process) in the dataset. Each .MAT file contains a training and a test set.

Leave a Reply

Your email address will not be published. Required fields are marked *