Electronic Theses and Dissertation Database
Library Home  |  ` Library Catalog  |  ETD Home  |  Browse ETDs  |  Search ETDs  |  ETD Resources

Title page for ETD etd-04122005-155139


Type of Document Master's Thesis
Author Atlas, Mourad
Author's Email Address matlas1@student.gsu.edu
URN etd-04122005-155139
Title Parallel Computing in Statistical-Validation of Clustering Algorithm for the Analysis of High Throughput Data
Degree Master of Science
Department Mathematics and Statistics
Advisory Committee
Advisor Name Title
Dr. Susmita Datta Committee Chair
Dr. Gengsheng Qin Committee Member
Dr. Saied Belkasim Committee Member
Keywords
  • Statistical Validation
  • Clustering Algorithms
  • High Throughput Data
  • Parallel computing
Date of Defense 2005-04-08
Availability restricted
Abstract
Currently, clustering applications use classical methods to partition a set of data (or objects) in a set of meaningful sub-classes, called clusters. A cluster is therefore a collection of objects which are “similar” among them, thus can be treated collectively as one group, and are “dissimilar” to the objects belonging to other clusters. However, there are a number of problems with clustering. Among them, as mentioned in [Datta03], dealing with large number of dimensions and large number of data items can be problematic because of computational time.

In this thesis, we investigate all clustering algorithms used in [Datta03] and we present a parallel solution to minimize the computational time. We apply parallel programming techniques to the statistical algorithms as a natural extension to sequential programming technique using R.

The proposed parallel model has been tested on a high throughput dataset. It is microarray data on the transcriptional profile during sporulation in budding yeast. It contains more than 6,000 genes. Our evaluation includes clustering algorithm scalability pertaining to datasets with varying dimensions, the speedup factor, and the efficiency of the parallel model over the sequential implementation. Our experiments show that the gene expression data follow the pattern predicted in [Datta03] that is Diana appears to be solid performer also the group means for each cluster coincides with that in [Datta03].

We show that our parallel model is applicable to the clustering algorithms and more useful in applications that deal with high throughput data, such as gene expression data.

Files
  Filename       Size       Approximate Download Time (Hours:Minutes:Seconds) 
 
 28.8 Modem   56K Modem   ISDN (64 Kb)   ISDN (128 Kb)   Higher-speed Access 
[GSU] atlas_mourad_200504_mast.pdf 910.19 Kb 00:04:12 00:02:10 00:01:53 00:00:56 00:00:04
[GSU] indicates that a file or directory is accessible from the Georgia State University campus network only.

Browse All Available ETDs by ( Author | Department )

Click here to send a comment to ETD Support