Electronic Theses and Dissertation Database
Library Home  |  ` Library Catalog  |  ETD Home  |  Browse ETDs  |  Search ETDs  |  ETD Resources

Title page for ETD etd-11262007-015841


Type of Document Dissertation
Author Tan, Feng
URN etd-11262007-015841
Title IMPROVING FEATURE SELECTION TECHNIQUES FOR MACHINE LEARNING
Degree Ph.D.
Department Computer Science
Advisory Committee
Advisor Name Title
Anu G. Bourgeois Committee Chair
Robert Harrison Committee Member
Yanqing Zhang Committee Member
Yichuan Zhao Committee Member
Keywords
  • Feature selection
  • Gene selection
  • Text categorization
  • Text classification
  • Genetic algorithm
  • Dimension Reduction
  • Term selection
Date of Defense 2007-10-23
Availability unrestricted
Abstract
As a commonly used technique in data preprocessing for machine learning, feature selection identifies important features and removes irrelevant, redundant or noise features to reduce the dimensionality of feature space. It improves efficiency, accuracy and comprehensibility of the models built by learning algorithms. Feature selection techniques have been widely employed in a variety of applications, such as genomic analysis, information retrieval, and text categorization.

Researchers have introduced many feature selection algorithms with different selection criteria. However, it has been discovered that no single criterion is best for all applications. We proposed a hybrid feature selection framework called based on genetic algorithms (GAs) that employs a target learning algorithm to evaluate features, a wrapper method. We call it hybrid genetic feature selection (HGFS) framework. The advantages of this approach include the ability to accommodate multiple feature selection criteria and find small subsets of features that perform well for the target algorithm. The experiments on genomic data demonstrate that ours is a robust and effective approach that can find subsets of features with higher classification accuracy and/or smaller size compared to each individual feature selection algorithm.

A common characteristic of text categorization tasks is multi-label classification with a great number of features, which makes wrapper methods time-consuming and impractical. We proposed a simple filter (non-wrapper) approach called Relation Strength and Frequency Variance (RSFV) measure. The basic idea is that informative features are those that are highly correlated with the class and distribute most differently among all classes. The approach is compared with two well-known feature selection methods in the experiments on two standard text corpora. The experiments show that RSFV generate equal or better performance than the others in many cases.

Files
  Filename       Size       Approximate Download Time (Hours:Minutes:Seconds) 
 
 28.8 Modem   56K Modem   ISDN (64 Kb)   ISDN (128 Kb)   Higher-speed Access 
  tan_feng_200712_phd.pdf 806.09 Kb 00:03:43 00:01:55 00:01:40 00:00:50 00:00:04

Browse All Available ETDs by ( Author | Department )

Click here to send a comment to ETD Support