Duplicate Detection Of Records in Queries using Clustering

Download Full Text
M. Anitha, A. Srinivas, T.P. Shekhar, D. Sagar
Published Date:
February 29, 2012
Volume 2, Issue 2
29 - 32

data cleaning, duplicate data, data warehouse, data mining
M. Anitha, A. Srinivas, T.P. Shekhar, D. Sagar, "Duplicate Detection Of Records in Queries using Clustering". International Journal of Research in Computer Science, 2 (2): pp. 29-32, February 2012. doi:10.7815/ijorcs.22.2012.019 Other Formats


ata is one of the major problems in the broad area of data cleaning and data quality in data warehouse. Many times, the same logical real world entity may have multiple representations in the data warehouse. Duplicate elimination is hard because it is caused by several types of errors like typographical errors, and different representations of the same logical value. Also, it is important to detect and clean equivalence errors because an equivalence error may result in several duplicate tuples. Recent research efforts have focused on the issue of duplicate elimination in data warehouses. This entails trying to match inexact duplicate records, which are records that refer to the same real-world entity while not being syntactically equivalent. This paper mainly focuses on efficient detection and elimination of duplicate data. The main objective of this research work is to detect exact and inexact duplicates by using duplicate detection and elimination rules. This approach is used to improve the efficiency of the data.

  1. S. Chaudhuri, K. Ganjam, V. Ganti, and R. Motwani, “Robust and Efficient Fuzzy Match for Online Data Cleaning,” Proc. ACM SIGMOD, pp. 313-324, 2003.
  2. Kuhanandha Mahalingam and Michael N.Huhns, “Representing and using Ontologies”,USC-CIT Technical Report 98-01.
  3. Weifeng Su, Jiying Wang, and Federick H.Lochovsky, ” Record Matching over Query Results from Multiple Web Databases” IEEE transactions on Knowledge and Data Engineering, vol. 22, N0.4,2010.
  4. R. Ananthakrishna, S. Chaudhuri, and V. Ganti, “Eliminating Fuzzy Duplicates in Data Warehouses. VLDB”, pages 586-597, 2002. doi:10.1016/B978-155860869-6/50058-5
  5. Tetlow.P,Pan.J,Oberle.D,Wallace.E,Uschold.M,Kendall.E,”Ontology Driven Architectures and Potential Uses of the Semantic Web in Software Engineering”,W3C,Semantic Web Best Practices and Deployment Working Group,Draft(2006).
  6. Ji-Rong Wen, Fred Lochovsky, Wei-Ying Ma, “Instance-based Schema Matching for Web Databases by Domain-specific Query Probing”, Proceedings of the 30th VLDB Conference, Toronto, Canada, 2004.
  7. Amy J.C.Trappey, Charles V.Trappey, Fu-Chiang Hsu,and David W.Hsiao, “A Fuzzy Ontological Knowledge Document Clustering Methodology”,IEEE Transactions on Systems,Man,and Cybernetics-Part B:Cybernetics,Vol.39,No.3,june 2009.

    Sorry, there are no citation(s) for this manuscript yet.