Untitled Document
You are from : ( )  
Untitled Document
Untitled Document

International Journal of Information Technology & Computer Science ( IJITCS )

Abstract :

Clinical outcome prediction from high-dimensional data is problematic in the common setting where there is only a relatively small number of samples. The imbalance causes data overfitting, and outcome prediction becomes computationally expensive or even impossible. We propose a Bayesian outcome prediction method that can be applied to data of arbitrary dimension d, from 2 outcome classes, and reduces overfitting without any approximations at parameter level. This is achieved by avoiding numerical integration or approximation, and solving the Bayesian integrals analytically. We thereby reduce the dimension of numerical integrals from 2d dimensions to 4, for any d. For large d, this is reduced further to 3, and we obtain a simple outcome prediction formula without integrals in leading order for very large d. We compare our method to the mclustDA method (Fraley and Raftery 2002), using simulated and real data sets. Our method perform as well as or better than mclustDA in low dimensions d. In large dimensions d, mclustDA breaks down due to computational limitations, while our method provides a feasible and computationally efficient alternative. .

Keywords :

: Discriminant analysis; Bayesian outcome prediction; Overfitting; Curse of dimensionality; Bayesian integration in high dimensions; Binary-class prediction.

References :

  1. Hastie T, Tibshirani R. Discriminant analysis by gaussian mixtures. Journal of the Royal Statistical Society, Series B 1996; 58:155–176.
  2. Fraley C, Raftery AE. Model-based clustering, discriminant analysis, and density estimation. Journal of the American Statistical Association 2002; 97(458):611–631, doi:10.1198/016214502760047131. URL http://www.tandfonline.com/doi/abs/10.1198/016214502760047131.
  3. Dean N, Murphy TB, Downey G. Using unlabelled data to update classification rules with applications in food authenticity studies. Journal of the Royal Statistical Society: Series C (Applied Statistics) 2006; 55(1):1–14, doi:10.1111/j.1467-9876.2005.00526.x. URL http://dx.doi.org/10.1111/j.1467-9876.2005.00526.x.
  4. Fraley C, Raftery AE. Model-based methods of classification: Using the mclust software in chemometrics. JOURNAL OF STATISTICAL SOFTWARE JAN 2007; 18(6).
  5. Iverson AA, Gillett C, Cane P, Santini CD, Vess TM, Kam-Morgan L, Wang A, Eisenberg M, Rowland CM, Hessling JJ, et al.. A single-tube quantitative assay for mrna levels of hormonal and growth factor receptors in breast cancer specimens. The Journal of Molecular Diagnostics 2009; 11(2):117 – 130, doi:http://dx.doi.org/10.2353/jmoldx.2009.080070. URL http://www.sciencedirect.com/science/article/pii/S1525157810602176.
  6. Murphy TB, Dean N, Raftery AE. Variable selection and updating in model-based discriminant analysis for high dimensional data with food authenticity applications. The annals of applied statistics 2010; 4(1):396.
  7. Andrews J, McNicholas P. Model-based clustering, classification, and discriminant analysis via mixtures of multivariate t-distributions. Statistics and Computing 2012; 22(5):1021–1029, doi:10.1007/s11222-011-9272-x. URL http://dx.doi.org/10.1007/s11222-011-9272-x.
  8. Lachenbruch PA, Mickey MR. Estimation of error rates in discriminant analysis. Technometrics 1968; 10(1):pp. 1–11. URL http://www.jstor.org/stable/1266219.
  9. Stone M. Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society. Series B (Methodological) 1974; 36(2):pp. 111–147. URL : http://www.jstor.org/stable/2984809.
  10. Meek C, Thiesson B, Heckerman D, Kaelbling P. The learning-curve sampling method applied to modelbased clustering. Also in AI and Statistics, 2001; 397–418.
  11. Clarke R, Ressom HW, Wang A, Xuan J, Liu MC, Gehan EA, Wang Y. The properties of high-dimensional data spaces: implications for exploring gene and protein data. Nat Rev Cancer Jan 2008; 8(1):37–49. URL http://dx.doi.org/10.1038/nrc2294.
  12. Michiels S, Kramar A, Koscielny S. Multidimensionality of microarrays: Statistical challenges and (im)possible solutions. Mol Oncol Apr 2011; 5(2):190–196. URL http://linkinghub.elsevier.com/retrieve/pii/S1574789111000184?showall=true.
  13. Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the em algorithm. JOURNAL OF THE ROYAL STATISTICAL SOCIETY, SERIES B 1977; 39(1):1–38.
  14. Bouveyron C, Girard S, Schmid C. High-dimensional discriminant analysis. Communications in Statistics - Theory and Methods 2007; 36(14):2607–2623, doi:10.1080/03610920701271095. URL : http://www.tandfonline.com/doi/abs/10.1080/03610920701271095.
  15. McNicholas PD. On model-based clustering, classification, and discriminant analysis. Journal of the Iranian Statistical Society 2011; 10(2):181–199.
  16. Jolliffe IT. Principal Component Analysis (2nd edn). Springer, 2002.
  17. Fraley C, Raftery AE, Murphy TB, Scrucca L. mclust version 4 for r: Normal mixture modeling for model-based clustering, classification, and density estimation 2012.
  18. Scott DW, Thompson JR. Probability density estimation in higher dimensions. Computer Science and Statistics: Proceedings of the Fifteenth Symposium on the Interface, vol. 528, North-Holland, Amsterdam, 1983; 173–179.
  19. Chang WC. On using principal components before separating a mixture of two multivariate normal distributions. Journal of the Royal Statistical Society. Series C (Applied Statistics) 1983; 32(3):pp. 267–275. URL http://www.jstor.org/stable/2347949.
  20. Bouveyron C, Brunet-Saumard C. Model-based clustering of high-dimensional data: A review. Compu- tational Statistics & Data Analysis 2014; 71(0):52 – 78, doi:http://dx.doi.org/10.1016/j.csda.2012.12.008. URL http://www.sciencedirect.com/science/article/pii/S0167947312004422.
  21. Bishop CM. Pattern Recognition and Machine Learning. Springer, 2006.
  22. Berge L, Bouveyron C, Girard S. Hdclassif: An r package for model-based clustering and discriminant analysis of high-dimensional data. Journal of Statistical Software 1 2012; 46(6):1–29. URL http://www.jstatsoft.org/v46/i06.
  23. Bellman R. Dynamic Programming. Princeton University Press, 1957.
  24. R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria 2013. URL http://www.R-project.org/.
  25. Ripley BD. Pattern Recognition and Neural Networks. Cambridge University Press, 1996.
  26. Duda R, Hart P, Stork D. Pattern Classification (2nd edn). Wiley, 2001.
  27. McLachlan G, Peel D, Bean R. Modelling high-dimensional data by mixtures of factor analyzers. Compu- tational Statistics & Data Analysis 2003; 41(34):379 – 388, doi:http://dx.doi.org/10.1016/S0167-9473(02) 00183-4. URL http://www.sciencedirect.com/science/article/pii/S0167947302001834, recent Developments
    in Mixture Model.
  28. Wang Y, Miller DJ, Clarke R. Approaches to working in high-dimensional data spaces: gene expression microarrays. Br J Cancer Feb 2008; 98(6):1023–1028. URL http://dx.doi.org/10.1038/sj.bjc.6604207.
  29. de Rinaldis E, Gazinska P, Mera A, Modrusan Z, Fedorowicz G, Burford B, Gillett C, Marra P, Grigoriadis A, Dornan D, et al.. Integrated genomic analysis of triple-negative breast cancers reveals novel micrornas associated with clinical and molecular phenotypes and sheds light on the pathways they control. BMC Genomics 2013; 14(1):643, doi:10.1186/1471-2164-14-643. URL http://www.biomedcentral.com/1471-2164/14/643.
  30. Network TCGAR. Comprehensive molecular portraits of human breast tumours. Nature Oct 2012; 490(7418):61–70, doi:10.1038/nature11412.
  31. Network TCGAR. Integrated genomic analyses of ovarian carcinoma. Nature 06 2011; 474(7353):609–615. URL http://dx.doi.org/10.1038/nature10166.
  32. Barretina J, Caponigro G, Stransky N, Venkatesan K, Margolin AA, Kim S, Wilson CJ, Lehar J, Kryukov GV, Sonkin D, et al.. The cancer cell line encyclopedia enables predictive modelling of anticancer drug sensitivity. Nature 03 2012; 483(7391):603–307. URL http://dx.doi.org/10.1038/nature11003.
  33. Dasgupta A, Raftery AE. Detecting features in spatial point processes with clutter via modelbased clustering. Journal of the American Statistical Association 1998; 93(441):pp. 294–302. URL http://www.jstor.org/stable/2669625.
  34. Fraley C, Raftery AE. How many clusters? which clustering method? answers via model-based cluster analysis. The Computer Journal 1998; 41:578–588.
  35. Banfield JD, Raftery AE. Model-based Gaussian and non-Gaussian clustering. Biometrics 1993; 49(3):803– 821.
  36. Schwarz G. Estimating the Dimension of a Model. The Annals of Statistics 1978; 6(2):461–464, doi:10. 2307/2958889. URL http://dx.doi.org/10.2307/2958889.
  37. Wehrens R, Buydens LM, Fraley C, Raftery AE. Model-based clustering for image segmentation and large datasets via sampling. Journal of Classification 2004; 21(2):231–253, doi:10.1007/s00357-004-0018-8. URL http://dx.doi.org/10.1007/s00357-004-0018-8.
  38. Fraley C, Raftery A, Wehrens R. Incremental model-based clustering for large datasets with small clusters. Journal of Computational and Graphical Statistics 2005; 14(3):529–546, doi:10.1198/106186005X59603. URL http://amstat.tandfonline.com/doi/abs/10.1198/106186005X59603.
  39. Rodriguez CC. Entropic priors for discrete probabilistic networks and for mixtures of gaussians models. arXiv preprint physics/0201016 2002; .
  40. Caticha A, Preuss R. Maximum entropy and bayesian data analysis: Entropic prior distributions. Physical Review E 2004; 70(4):046 127.
  41. Neumann T. Bayesian inference featuring entropic priors. 27th Int. Work. on Bayesian Inf. and Max. Ent
    2007; 954(1):283–292.
  42. Gerds TA, Cai T, Schumacher M. The performance of risk prediction models. Biometrical Journal 2008; 50(4):457–479, doi:10.1002/bimj.200810443. URL http://dx.doi.org/10.1002/bimj.200810443.
  43. van ’t Veer LJ, Dai H, van de Vijver MJ, He YD, Hart AAM, Mao M, Peterse HL, van der Kooy K, Marton MJ, Witteveen AT, et al.. Gene expression profiling predicts clinical outcome of breast cancer. Nature Jan 2002; 415(6871):530–536. URL http://dx.doi.org/10.1038/415530a.
  44. van de Vijver MJ, He YD, van ’t Veer LJ, Dai H, Hart AA, Voskuil DW, Schreiber GJ, Peterse JL, Roberts C, Marton MJ, et al.. A gene-expression signature as a predictor of survival in breast cancer. New England Journal of Medicine 2002; 347(25):1999–2009, doi:10.1056/NEJMoa021967. URL
    http://www.nejm.org/doi/full/10.1056/NEJMoa021967, pMID: 12490681.
  45. Raftery AE, Dean N. Variable selection for model-based clustering. Journal of the Amer- ican Statistical Association 2006; 101(473):168–178, doi:10.1198/016214506000000113. URL http://www.tandfonline.com/doi/abs/10.1198/016214506000000113.

Untitled Document
Untitled Document
  Copyright © 2013 IJITCS.  All rights reserved. IISRC® is a registered trademark of IJITCS Properties.