Sample Size in Machine Learning and Artificial Intelligence

The lack of sample size determination in reports of machine learning models is a sad state of affairs. However, it also presents an opportunity. By including at least post hoc sample-size calculations in articles we submit, our Department can lead the charge for more rigorous machine learning and artificial intelligence methodologies.

If you’ve talked with me about starting a machine learning project, you’ve probably heard me quote the rule of thumb that we need at least 1,000 samples per class. I recently found out many data scientists only quote this rule when repeatedly pushed for a sample size estimate (see How Much Training Data is Required for Machine Learning? (machinelearningmastery.com)). As a result, I decided to do a literature search to see what better or newer answers to the question of sample sizes for machine learning might look like.

The problem is even bigger than I suspected! A recent review article [1] searched 167 articles related to machine learning and found only 4 attempt any kind of pre hoc sample size determination. The review also found 18 articles that attempted some kind of post hoc sample size determination.

Annoyingly, many of the pre hoc methods referenced work only for neural networks (only one of many kinds of machine learning). For example, a group from MIT determined a worst-case calculation method that ensures at least a specified fraction of images are correctly classified given a sample size N [2]. This calculation is helpful for deep learning, but it yields a prohibitively large number of samples needed for even a simple neural network (around 4,000 per class in a simple example calculation for a relatively simple network structure). Other methods borrow from statistics and essentially assume Cohen’s calculation for effect size generalizes to machine learning. For example, see this online article from Boston University School of Public Health (Power and Sample Size Determination (bu.edu)). However, this calculation neglects information about model type. Would we really expect all machine learning model types to have the exact same sample size needs? That seems unlikely.

One researcher has rigorously proved a method for calculating the probability that a model’s error rate is less than some percentage given a sample size and a theoretical property of a model called its “Vapnik–Chervonenkis dimension” [3] (also see Vapnik–Chervonenkis dimension – Wikipedia). However, this quantity is indeed theoretical and can only be bounded (setting an upper limit) for some model types. There’s no formulaic method for determining the exact quantity for a given model (at least not yet).

From the review article [1], it seems the most popular systematic approach for sample size determination is the post hoc method of fitting a learning curve. Essentially, you take increasingly large subsets of your data and calculate the error. For example, if I use 10% of my data, the error is y1. If I use 20%, the error is y2. Then you plot {y} as a function of number of observations in the subsample and fit a power law curve [1,4-5]. The resulting curve allows for inference of the needed sample size for a desired error rate.

The lack of sample size determination in reports of machine learning models is a sad state of affairs. However, it also presents an opportunity. By including at least post hoc sample-size calculations in articles, we can be leaders in the charge for more rigorous machine learning and artificial intelligence methodologies.

References

[1] Balki, I., Amirabadi, A., Levman, J., Martel, A. L., Emersic, Z., Meden, B., … & Tyrrell, P. N. (2019). Sample-size determination methodologies for machine learning in medical imaging research: a systematic review. Canadian Association of Radiologists Journal, 70(4), 344-353.

[2] Baum, E. B., & Haussler, D. (1989). What size net gives valid generalization?. Neural computation, 1(1), 151-160.

[3] Vapnik, V. (2000). The nature of statistical learning theory. Springer.

[4] Rokem, A., Wu, Y., & Lee, A. (2017). Assessment of the need for separate test set and number of medical images necessary for deep learning: a sub-sampling study. BioRxiv, 196659.

[5] Cho, J., Lee, K., Shin, E., Choy, G., & Do, S. (2015). How much data is needed to train a medical image deep learning system to achieve necessary high accuracy?. arXiv preprint arXiv:1511.06348.