International Journal of Information Technology & Computer Science ( IJITCS )
The growth of the Internet has resulted in the rapid growth of using XML for data representation and exchange over the Web. Finding the similarity of XML documents is a significant research task in order to effectively control and retrieve information over the web. In this paper, we propose a new approach for determining similarity of XML documents by considering their content and structure. The similarity is computed by using the Sorensen–Dice’s coefficient and fuzzy intersection. We experimentally demonstrate the accuracy of the similarity method using real data sets.
: XML document similarity, fuzzy set, string matching
- A. Aïtelhadj, M. Boughanem, M. Mezghiche, and F. Souam, “Using structural similarity for clustering XML documents,” in Knowledge and Information Systems, 2012, vol. 32, no. 1, pp. 109–139.
- K. L., S. W., and C. K.J., “Semantic Mapping of XML tags using Inductive Machine Learning,” International Conference on Information and knowledge Management, 2002.
- W. Kim, “XML document similarity measure in terms of the structure and contents,” COMPUTER ENGINEERING and APPLICATIONS (CEA), pp. 205–212, 2008.
- P. Bille, “A survey on tree edit distance and related problems,” Theoretical Computer Science, vol. 337, no. 1–3, pp. 217–239, Jun. 2005.
- U. Park and Y. Seo, “An Implementation of XML Documents Search System based on Similarity in Structure and Semantics,” International Workshop on Challenges in Web Information Retrieval and Integration(WIRI), pp. 97–103, 2005.
- G. Li, X. Liu, J. Feng, and L. Zhou, “Efficient Similarity Search for Tree-Structured,” SSDBM, pp. 131–149, 2008.
- G. Navarro, “A guided tour to approximate string matching,” ACM computing surveys (CSUR), 2001.
- E. Ukkonen, “Approximate string-matching with q-grams and maximal matches,” Theoretical Computer Science, vol. 92, no. 1, pp. 191–211, Jan. 1992.
- L. Gravano, P. Ipeirotis, and H. Jagadish, “Approximate string joins in a database (almost) for free,” VLDB, 2001.
- R. Behrens, “A grammar based model for XML schema integration,” British National Conferenceon Databases (BNCOD), pp. 172–190, 2000.
- H. Prüfer, “Neuer beweis eines satzes uber permutationen,” Archiv fur Mathematik und Physik, vol. 27, pp. 142–144, 1918.
- C. Li, B. Wang, and X. Yang, “VGRAM: Improving performance of approximate queries on string collections using variable-length grams,” VLDB, 2007.
- C. Li, J. Lu, and Y. Lu, “Efficient merging and filtering algorithms for approximate string searches,” ICDE, 2008.
- G. Kondrak, A. Hall, C. Tg, D. Marcu, K. Knight, and M. Rey, “Cognates Can Improve Statistical Translation Models,” Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics(HLT-NAACL), pp. 46–48, 2003.
- L. Zadeh, “fuzzy set,” Information and control, 1965.
- G. a. Miller, “WordNet: a lexical database for English,” Communications of the ACM, vol. 38, no. 11, pp. 39–41, Nov. 1995.
- D. Knuth, J. Morris, Jr, and V. Pratt, “FAST PATTERN MATCHING IN STRINGS,” SIAM journal on computing, vol. 6, no. 2, pp. 323–350, 1977.
- B. Yao, X. Yang, and S. Zhu, “Introduction to a large-scale general purpose ground truth database: methodology, annotation tool and benchmarks,” Energy Minimization Methods in Computer Vision (EMMCVPR), pp. 169–183, 2007.