Commentz-Walter: Any Better than Aho-Corasick for Peptide Identification ?

Download Full Text
Author(s):
S.M. Vidanagamachchi, S.D. Dewasurendra, R.G. Ragel, M. Niranjan
Published Date:
November 05, 2012
Issue:
Volume 2, Issue 6
Page(s):
33 - 37
DOI:
10.7815/ijorcs.26.2012.053
Views:
6412
Downloads:
403

Keywords:
aho-corasick, commentz-walter, peptide identification
Citation:
S.M. Vidanagamachchi, S.D. Dewasurendra, R.G. Ragel, M. Niranjan, "Commentz-Walter: Any Better than Aho-Corasick for Peptide Identification ?". International Journal of Research in Computer Science, 2 (6): pp. 33-37, November 2012. doi:10.7815/ijorcs.26.2012.053 Other Formats

Abstract

An algorithm for locating all occurrences of a finite number of keywords in an arbitrary string, also known as multiple strings matching, is commonly required in information retrieval (such as sequence analysis, evolutionary biological studies, gene/protein identification and network intrusion detection) and text editing applications. Although Aho-Corasick was one of the commonly used exact multiple strings matching algorithm, Commentz-Walter has been introduced as a better alternative in the recent past. Comments-Walter algorithm combines ideas from both Aho-Corasick and Boyer Moore. Large scale rapid and accurate peptide identification is critical in computational proteomics. In this paper, we have critically analyzed the time complexity of Aho-Corasick and Commentz-Walter for their suitability in large scale peptide identification. According to the results we obtained for our dataset, we conclude that Aho-Corasick is performing better than Commentz-Walter as opposed to the common beliefs.

  1. Kal S., “Bioinformatics: Sequence alignment and Markov models”. McGraw-Hill Professional, 2008.
  2. Vidanagamachchi, S. M. Dewasurendra, S. D. Ragel R. G. Niranjan, M. "Tile optimization for area in FPGA based hardware acceleration of peptide identification". Industrial and Information Systems (ICIIS) 2011, 6th IEEE International Conference, 16-19 Aug, pp: 140-145. doi: 10.1109/ICIINFS.2011.6038056
  3. A.V. AHO, M.J. CORASICK, “Efficient string matching: an aid to bibliographic search”. Communications of the ACM, Vol. 18, No. 6, 1979, pp: 333-340. doi: 10.1145/360825.360855
  4. Commentz-Walter, B. "A String Matching Algorithm Fast on the Average". Automata, Languages and Programming, Springer Berlin / Heidelberg, 1979, pp: 118-132
  5. Boyer Robert S., Moore J. STROTHER, "A fast string searching algorithm". Communications of the ACM, Vol. 20, No. 10, 1977, pp: 762-772. doi: 10.1145/359842.359859
  6. Komodia (2012), "Aho-Corasick source code". Available: http://www.komodia.com/aho-corasick
  7. Uniprot (2012), "Uniprot", Available: http://www.uniprot.org/
  8. Zeeshan A. KHAN, R.K. PATERIYA, " Multiple Pattern String Matching Methodologies: A Comparative Analysis". International Journal of Scientific and Research Publications (IJSRP), Vol. 2, No. 7, 2012, pp: 498-504
  9. Yoginder S. DANDASS, Shane C. BURGESS, Mark LAWRENCE and Susan M. BRIDGES, “Accelerating String Set Matching in FPGA Hardware for Bioinformatics Research”. BMC Bioinformatics 2008, Vol. 9, No. 197, 2008
  10. Jung, H. J. Baker, Z. K and Prasanna, V. K. “Performance of FPGA Implementation of Bit-split Architecture for Intrusion Detection Systems”. Proceedings of the 20th international conference on Parallel and distributed processing, ser. IPDPS’06, 2006, Washington, DC, USA: IEEE Computer Society, pp. 189–189.
  11. Liu, Y. Yang, Y. Liu, P. and Tan, J. “A table compression method for extended aho-corasick automaton”. Proceedings of the 14th International Conference on Implementation and Application of Automata, ser. CIAA ’09. 2009, Berlin, Heidelberg: Springer-Verlag, pp. 84–93. doi: 10.1007/978-3-642-02979-0_12
  12. Soewito, B. and Weng, N. "Methodology for Evaluating DNA Pattern Searching Algorithms on Multiprocessor". Proceedings of the 7th IEEE International Conference on Bioinformatics and Bioengineering, 2007, pp. 570 – 577. doi: 10.1109/BIBE.2007.4375618
  13. Tumeo, A. Villa, O. Chavarrı´a-Miranda, D.G. "Aho-Corasick String Matching on Shared and Distributed-Memory Parallel Architectures". IEEE Transactions on Parallel and Distributed Systems, 2012, pp. 436-443. doi: 10.1109/TPDS.2011.181
  14. Kouzinopoulos, C. S., Michailidis, P. D. and Margaritis, K. G. “Experimental Results on Multiple Pattern Matching Algorithms for Biological Sequences”. Bioinformatics'11, 2011, pp. 274-277.
  15. Dan Gusfield, “Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology”. Cambridge University Press, 1997.
  16. B. RAJU, S. DYLN, “Exact Multiple Pattern Matching Algorithm using DNA Sequence and Pattern Pair”. International Journal of Computer Applications, Vol. 17, No. 8, 2011, pp. 32-38
  17. Fisk, M. and Varghese, G. (2004) “Applying Fast String Matching to Intrusion Detection”. University of California San Diego, 2004
  18. S. BENFANO, M. ATU, W. NING and W. HAIBO, "High-speed string matching for network intrusion detection". International Journal of Communication Networks and Distributed Systems, Vol. 3, No. 4, 2009, pp. 319-339
  19. Leena Salmela, "Improved Algorithms for String Searching Problems", Helsinki University of Technology, 2009
  20. Textolution (2012), “FuzzySearch Library”. Available: http://www.textolution.com/fuzzysearch.asp
  21. K. RICHARD and R. MICHAEL, "Efficient randomized pattern-matching algorithms". IBM Journal of Research and Development, Vol. 31, No. 2, 1987, pp. 249-260
  22. String searching algorithm, "String searching algorithm". Available: http://en.wikipedia.org/wiki /String_searching_algorithm#Finite_state_automaton_based_search
  23. D. KNUTH, J. MORRIS and V. PRATT, "Fast pattern matching in strings". SIAM Journal on Computing, Vol. 6, No. 2, 1977, pp.323–350. doi: 10.1137/0206024

  • Dayarathne, Nayomi, and Roshan Ragel. "Accelerating Rabin Karp on a Graphics Processing Unit (GPU) using Compute Unified Device Architecture (CUDA)." Information and Automation for Sustainability (ICIAfS), 2014 7th International Conference on. IEEE, 2014.