Two years ago, Amber Boydstun approached me to develop an R package for semi-automated topic classification. The goal of the RTextTools project, a collaborative effort of researchers from University of California, Davis, University of Washington, Sciences Po Paris, and Vrije Universiteit Amsterdam, was to simplify the supervised learning process and make machine learning more accessible to political scientists. The package has been a huge success within the social science community, but we've seen applications in the natural sciences as well, including the classification of protein sequences, stars, and tumors.
One such application is the classification of breast cancer masses as benign or malignant. Using the Wisconsin Diagnostic Breast Cancer Dataset from UC Irvine, we wrote a script that trains eight classifiers on characteristics including clump thickness, uniformity of cell size, uniformity of cell shape, marginal adhesion, single epithelial cell size, bare nuclei, bland chromatin, normal nucleoli, and mitoses. When trained on the data, the classifiers were able to achieve up to 96% recall accuracy on a randomly sampled training set of 200 patients and test set of 400 patients, nearly matching the best predictive accuracy documented by Wolberg et al. When filtering for 6-algorithm agreement, the ensemble outperforms the best predictive accuracy at 98% while still providing 95% coverage of the test sample.
The source code is available below, and the dataset is automatically downloaded from UC Irvine's servers.