Alexandru G. Floares, Marius Ferisgan, Daniela Onita, Andrei Ciuparu, George A. Calin, Florin B. Manolache
cancer, microRNA, next generation sequencing, machine learning, predictive models, power law
High-quality omics tests can be developed by using machine learning. As high-throughput molecular determinations are costly, we want to build the best models, utilizing the minimal number of samples. Here, we specify a set of criteria for high-quality models and select the algorithms which best satisfy them. Boosted C5, Random Forest and Stochastic Gradient Boosting reach accuracy greater than 95%, and even greater than 99%, in discriminating between breast cancer and normal, on the miRNA NGS TCGA data, generalize well to new cases, and are relatively transparent. For these algorithms, we investigate the relationships between accuracy and sample size, and between the number of features (miRNAs here) and sample size. We proposed power law formulas for all these relationships, allowing the computation of the required number of samples for the desired accuracy. The above algorithms dramatically lower the sample size for the highest accuracies and reduce the corresponding costs.
Cite this paper
Alexandru G. Floares, Marius Ferisgan, Daniela Onita, Andrei Ciuparu, George A. Calin, Florin B. Manolache. (2017) The Smallest Sample Size for the Desired Diagnosis Accuracy. International Journal of Oncology and Cancer Therapy, 2, 13-19