International Journal of Information Technology & Computer Science ( IJITCS )
Developments in information and communication technology give possibility to publish the news which happens in any region of the world to long distances. Increasing in web-based news texts makes it difficult to monitor and follow the information related to subject to be interested in the news. For this reason, web based text mining is often used for classification and monitoring of web-based news text. In this study, a classification algorithm is suggested which takes the news related to Turkey from international news channel’s RSS systems (Real Simple Syndication). The algorithm is based on the term weighting and KNN algorithm. 175 texts related to Turkey, and 150 texts related to other countries have been obtained from different news channels’ RSS systems manually (Aljazeera English, China.org English, CNN.com World, Irna, The Moscow Times News, Reuters World News, Spiegel Online International, UKWorldNews) in different parts of the world. A glossary and a database is created by using the roots of the words at texts with Porter Stemmer algorithm and some other words such as TBMM (Turkish Grand National Assembly), Çankaya and so on are added to glossary. Database is used for classification in following stages. For testing purpose, 100 different texts obtained from news channels randomly. The words in the news texts determined and identified whether they are existed in the glossary or not, and a feature vector is created. K-NN algorithm is used to determine the feature vector belongs to which class. The proposed system carries out the process of classification of the news with 90% accuracy.
: Web based text mining; K-NN; Turkish News
- R. Cooley, B. Mobasher, and J. Sirivastana, “Web Mining: information, and pattern discovery on the World Wide Web,” IEEE International Conference on Tools with Artificial Intelligence, pp. 558–567, November 1997.
- S. Tongchim, V. Somelertlamvanich, and H. Isahara, “Classification of news web documents based on structural features,” Advandaces in Natural Language Processing, 5th Int. Con. On Fintal 2006 Turku, Finland, pp. 153–160 August 2006.
- X. Qui, and B. D. Davison, “Web page classification: features and algorithms”, Technical Report LU-CSE-07-010, Dep. Of Comp Sci. and Eng. Leigh University, Bethlehem, PA, 18015, June 2007. Pp. 1-31
- Z. Zhu, G-Q Wu, X Wu, X-G Hu, and F-Y Wang, “Automatic recognition of news web pages”, Intelligence and Security Informatics, IEEE ISI 2008 Workshop:PAISI, PACCF, and SOCO 2008, Taipei, Tawian, June 2008. Pp. 496–501
- R.C. Chen, and C.H. Hsieh, “Web page classification based on a support vector machine using a weighted vote schema,” Expert Systems with Applications, vol. 31, pp. 427–435, 2006
- Selamet, and S. Omatu, “Web page feature selection and classification using neural networks,” Information Science, vol. 158, pp. 69–88, January 2004
- Ribeiro, V. Fresno, M.C. Garcia-Alegra, and D. Guinea, “Web page classification: A soft computing approach,” Lecture Notes in Artificial Intelligence, vol. 2663, pp. 103–112, May 2003
- S.A. Özel, “A Web page classification system based on a genetic algorithm using tagged-terms as features,” Expert Systems with Applications, vol. 38, pp. 3407–3415, April 2011
- H. Uğuz, “A two-stage feature selection method for text categorization by using information gain, principal component analysis and genetic algorithm,” Knowled-Based Systems, vol. 24, pp. 1024–1032, October 2011
- Xhemali, C.J. Hinde, and R.G. Stone, “Naïve bayes vs. decision trees vs. neural networks in the classification of training web pages,” Int. J. of Com. Sci., vol 4 (1), pp. 16–23, 2009
- M.F. Porter, “An algorithm for suffix stripping,” Program (Automated Library and, Information Systems), vol. 14 (3), pp. 130–137, 1980
- T.M. Cover, and P.E. Hart, “Nearest neighbor pattern classification,” IEEE Transactions on Information Theory, vol. 13 (1), pp. 21–27, 1967.