CFP last date
28 January 2025
Reseach Article

Biological Sequence Clustering with Symbol Table Data Structure

by Barilee Baridam
International Journal of Applied Information Systems
Foundation of Computer Science (FCS), NY, USA
Volume 7 - Number 10
Year of Publication: 2014
Authors: Barilee Baridam
10.5120/ijais14-451243

Barilee Baridam . Biological Sequence Clustering with Symbol Table Data Structure. International Journal of Applied Information Systems. 7, 10 ( October 2014), 1-6. DOI=10.5120/ijais14-451243

@article{ 10.5120/ijais14-451243,
author = { Barilee Baridam },
title = { Biological Sequence Clustering with Symbol Table Data Structure },
journal = { International Journal of Applied Information Systems },
issue_date = { October 2014 },
volume = { 7 },
number = { 10 },
month = { October },
year = { 2014 },
issn = { 2249-0868 },
pages = { 1-6 },
numpages = {9},
url = { https://www.ijais.org/archives/volume7/number10/683-1243/ },
doi = { 10.5120/ijais14-451243 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2023-07-05T18:55:36.278773+05:30
%A Barilee Baridam
%T Biological Sequence Clustering with Symbol Table Data Structure
%J International Journal of Applied Information Systems
%@ 2249-0868
%V 7
%N 10
%P 1-6
%D 2014
%I Foundation of Computer Science (FCS), NY, USA
Abstract

Clustering is the identification of interesting distribution patterns and similarities, natural groupings or clusters, within a collection of objects in a dataset based on some user-defined criteria. Clustering as an unsupervised learning problem can be distance-based or conceptual. In distance-based clustering the similarity criterion is based on distance. Objects belong to the same cluster if they are close according to a given distance. Conceptual clustering defines a concept common to all the objects in the cluster. In this case, objects are clustered based on their fitness to some descriptive concepts, and not according to distance or similarity measure. The extension of the usage of the common symbol table is employed in this paper to the clustering of biological sequences. The method does not depend on concept as does conceptual clustering. It does not also use distance measure, rather it uses data structures (hash table or list) and detect the occurrence of codons by way of comparing sequence to sequence (pattern-element-wise) using the codon-based scoring method. The results obtained indicate the usefulness of the symbol table in biological sequence clustering.

References
  1. C. C. Aggarwal and P. S. Yu. Finding generalized projected clusters in high dimensional spaces. In ACM SIGMOD, pages 70–81, 2000.
  2. R. Agrawal, J. E. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of high dimensional data for data mining applications. In ACM SIGMOD, 1998.
  3. T. Ahvenlampi and U. Kortela. Clustering algorithms in process monitoring and control application to continuous digesters. Informatics, 29:101–109, 2005.
  4. G. D. Bader and C. W. Hogue. An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinformatics, 4(2):448–456, 2003.
  5. Z. Bar-Joseph, E. D. Demaine, D. K. Gifford, N. Srebro, A. M. Hamel, and T. S. Jaakkola. K-ary clustering with optimal leaf ordering for gene expression data. Bioinformatics, 19(9):1070–1078, July 2003.
  6. B. B Baridam. A scoring method for the clustering of nucleic acid sequences. International Journal of Computer Applications, 44(2):14–22, 2012.
  7. B. B. Baridam and O. Owolabi. Conceptual clustering of RNA sequences with codon usage mode. Global Journal of Computer Science and Technology, 10(8), 2010.
  8. A. Ben-Dor, R. Shamir, and Z. Yakhini. Clustering gene expression patterns. Journal of Computational Biology, 6(3/4), 2005.
  9. P. Berkhin. Survey of clustering data mining techniques. Technical Report 4, Accrue Software, Inc. , San Jose, California, 2002. Available online: [www. citeseer. nj. nec. com/berkhin02survey. html].
  10. D. A. Binder. Cluster analysis under parametric models. Phd thesis, University of London, 1977.
  11. C. Bohm, K. Kailing, P. Kroger, and A. Zimek. Computing clusters of correlation connected objects. In ACM SIGMOD Conference, 2004.
  12. P. E. Bourne and H. Weissig. In Phillip Bourne and Helge Weissig, editors, Structural Bioinformatics, pages 35–49. Wiley-Liss, Inc. , Hoboken, New Jersey, 2003.
  13. S. Broh´ee and J. van Helden. Evaluation of clustering algorithms for protein-protein interaction networks. BMC Bioinformatics, 7(488), 2006. Available online: [www. biomedcentral. com/1471-2105/7/488].
  14. C. Brun, C. Herrmann, and A. Gu´enoche. Clustering proteins from interaction networks for the prediction of cellular functions. BMC Bioinformatics, 5(95), 2004.
  15. H. Chim and X. Deng. A new suffix tree similarity measure for document clustering. In Proceedings of the 16th international conference on World Wide Web, pages 121–130. ACM, 2007.
  16. J. Claverie and C. Notredame. Bioinformatics for dummies. Wiley, Indiana, 2nd edition, 2007.
  17. J. Cong and M. Smith. A parallel bottom-up clustering algorithm with applications to circuit partitioning in vlsi design. In Proceedings of the 30th ACM/IEEE Design Automation Conference, pages 755–760, 1993.
  18. S. J. Devlin, R. Gnanadesikan, and J. R. Kettenring. Robust estimation and outlier detection with correlation coefficients. Biometrika, 62(3):531–545, 1975.
  19. I. S. Dhillon and D. S. Modha. Concept decompositions for large sparse text data using clustering. Machine Learning, 42:143–175, 2001.
  20. M. Eisen, P. Spellman, P. Brown, and D. Botstein. Cluster analysis and display of genome-wide expression patterns. In Proceedings National Academy of Science, USA, volume 95, pages 14863–14868, 1998.
  21. H. Frigui and R. Krishnapuram. A robust competitive clustering algorithm with applications in computer vision. In IEEE Transactions on Pattern Analysis and Machine Intelligence, volume 21, pages 450–465, May 1999.
  22. G. Getz, H. Gal, I. Kela, D. A. Notterman, and E. Domany. Coupled two-way clustering analysis of breast cancer and colon cancer gene expression data. Bioinformatics, 19(9):1079–1089, 2003.
  23. G. Getz, E. Levine, and E. Domany. Coupled two-way clustering analysis of gene microarray data. In Proceedings of National Academy of Science, USA, volume 97, pages 12079–12084, 2000.
  24. G. Getz, E. Levine, E. Domany, and M. Q. Zhang. Superparametric clustering of yeast gene expression profiles. Physica A, 279:457–464, 2000.
  25. J. S. Hallinan. Cluster analysis of the p53 genetic regulatory network: Topology and biology. In Proceedings of the 2004 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology, pages 1–8, October 2004.
  26. E. Hartuv, A. Schmitt, J. Lange, S. Meier-Ewert, H. Lehrach, and R Shamir. An algorithm for clustering cdnas for gene expression analysis. In Proceedings of the Third International Conference on Computational Molecular Biology (RECOMB'99), 1999.
  27. L. J. Heyer, S. Kruglyak, and S. Yooseph. Exploring expression data: Identification and analysis of coexpressed genes. Genome Research, 9:1106–1115, 1999.
  28. X. Hu, I. Yoo, I. Song, M. Song, J. Han, and M. Lechner. Extracting and mining protein-protein interaction network from biomedical literature. In Proceedings of the IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology, pages 244–251, 2004.
  29. K. Kannan, N. Amariglio, G. Rechavi, J. Jakob-Hirsch, I. Kela, N. Kaminski, G. Getz, and E. Domany. DNA microarrays identification of primary and secondary target genes regulated by p53. Oncogene, 20:2225–2234, 2001.
  30. K. M. Kaplan and J. J. Kaplan. Multiple DNA sequence approximate matching. In Proceedings of the IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology, pages 79–86, 2004.
  31. G. Karypis, E. Han, and V. Kumar. CHAMELEON: A hierarchical clustering algorithm using dynamic modeling. IEEE Transaction on Computers, 32(8):68–75, 1999.
  32. V. Kirzhner, A. Paz, Z. Volkovich, E. Nevo, and A. Korol. Different clustering of genomes across life using the A-TC- G and degenerate R-Y alphabets: Early and late signaling on genome evolution. Journal of Molecular Evolution, 64:448–456, 2007.
  33. T. Li, S. Zhu, and M. Ogihara. Algorithms for clustering high dimensional and distributed data. Intelligent Data Analysis, 7(4):305–326, 2003.
  34. J. Liu and W. Wang. OP-cluster: Clustering by tendency in high dimensional space. In Proceedings of the Third IEEE International Conference on Data Mining, 2003.
  35. H. Lu, X. Zhu, H. Liu, G. Skogerbo, J. Zhang, Y. Zhang, L. Cai, Y. Zhao, S. Sun, J. Xu, D. Bu, and R. Chen. The interactome as a tree - an attempt to visualize the proteinprotein interaction network in yeast. Nucleic Acids Research, 32(16):4804–4811, 2004.
  36. C. S. Moller-Levet, F. Klowonn, K. H. Cho, H. Yin, and O. Wolkenhauer. Clustering of unevenly sampled gene expression time-series data. Fuzzy Sets and Systems, 152:49– 66, 2005.
  37. A. Natsev, R. Rastogi, and K. Shim. WALRUS: A similarity retrieval algorithm for image databases. IEEE Transaction on Knowledge and Data Engineering, 16(3):301–316, 2004.
  38. E. P. Nawrocki and S. R. Eddy. Query-dependent banding (QDB) for faster RNA similarity searches. PLOS Computational Biology, 3(3):0540–0554, 2007.
  39. R. Ng and J. Han. CLARANS: A method for clustering objects for spartial data mining. IEEE Transaction on Knowledge and Data Engineering, 14(5):1003–1016, 2004.
  40. V. Nikulin. Weighted threshold-based clustering for intrusion detection system. International Journal of Computational Intelligence and Applications, 6(1):31–19, 2006.
  41. M. G. H. Omran. Particle Swarm Optimization Methods for Pattern Recognition and Image Processing. Phd thesis,University of Pretoria, Faculty of Engineering, Built Environment and Information Technology, Department of Computer Science, November 2004.
  42. F. Porikli. Clustering variable length sequences by eigenvector decomposition using HMM. Springer, 3138, 2004.
  43. M. F. Ramoni, P. Sebastiani, and I. I. Kohane. Cluster analysis of gene expression dynamics. In Proceedings of National Academy of Science, volume 99, pages 9121–9126, July 2002.
  44. R. Sedgewick and M. Schidlowsky. Algorithms in Java, Part 5: Graph Algorithms. Addison-Wesley Longman Publishing Co. , Inc. , 2003.
  45. R. Sedgewick and K. Wayne. Introduction to programming in Java: an interdisciplinary approach. Addison-Wesley Publishing Company, 2007.
  46. R. Sharan and R. Shamir. CLICK: A clustering algorithm with applications to gene expression analysis. In Proceedings of International Conference on Intelligent Systems and Molecular Biology, volume 8, pages 307–316, 2000.
  47. D. Simovici, N. Singla, and M. Kuperberg. Metric incremental clustering of nominal data. In Proceedings of the Fourth IEEE International Conference on Data Mining (ICDM'04), volume 00, pages 523–526, 2004.
  48. P. Smyth. Clustering sequences with hidden markov models. Advances in Neural Information Processing Systems, 648, 1997.
  49. N. Speer, C. Spieth, and A. Zell. A mimetic clustering algorithm for the functional partition of genes based on the gene ontology. In Proceedings of the IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology, pages 252–259, 2004.
  50. C. Tang, L. Zhang, and A. Zhang. Interrelated two-way clustering: An unsupervised approach for gene expression data analysis. In Proceedings of the IEEE 2nd International Symposium on Bioinformatics and Bioengineering Conference, pages 41–48, November 2001.
  51. S. Tavazoie, J. Hughes, M. Campbell, R. Cho, and G. Church. Systematic determination of genetic network architecture. Nature genetics, 22:281–285, 2001.
  52. I. Tetko, A. Facius, A. Ruepp, and H. Mewes. Super parametric clustering of protein sequences. BMC Bioinformatics, 6(82), 2005.
  53. E. Torarinsson, J. H. Havgaard, and J. Gorodkin. Multiple structure alignment and clustering of RNA sequences. Bioinformatics, 23(8):926–932, 2007.
  54. X. Wang, J. T. Wang, K. Lin, D. Shasha, B. A. Shapiro, and K. Zhang. An index structure for data mining and clustering. Knowledge and Information Systems, 2(2):161–184, June 2000.
  55. E. P. Xing and R. M. Karp. CLIFF: Clustering of highdimensional microarray data via iterative feature filtering using normalized cuts. Bioinformatics, 17(1):306–315, 2001.
  56. R. Xu and D. Wunsch II. Survey of clustering algorithms. IEEE Transactions on Neural Networks, 16(3):601–614, May 2005.
  57. Y. Xu, V. Olman, and D. Xu. Clustering gene expression data using a graph theoretic approach: an application of minimum spanning trees. Bioinformatics, 18(4):536–545, 2002.
  58. K. Y. Yeung, D. R. Haynor, and W. L. Ruzzo. Validating clustering for gene expression data. Bioinformatics, 17(4):309–318, 2001.
  59. S. Yoo, C. Park, and S. Cho. Analyzing fuzzy partitions of saccharomyces cerevisiae cell-cycle gene expression data by bayesian validation method. In Proceedings of the IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology, pages 116–122, 2004.
  60. L. Zhao and M. Zaki. TRICLUSTER: An effective algorithm for mining coherent clusters in 3d microarray data. In ACM SIGMOD Conference, 2005.
Index Terms

Computer Science
Information Sciences

Keywords

Clustering Sequence Symbol Table Codon Similarity measures