A Survey of Feature Selection
Shi Zheng/6529992
CMPSMC24
Abstract
Feature selection is an important topic in data mining,especially for high dimensional datasets. This report analyzes and summarizes the definition of feature selection,and then introduces the high dimensional data.It divides feature selection into three classes according to the selecting strategy,and categorizes the methods into three styles by the evaluation function.Through analyzing the infection factors in the feature selection technology,this report introduces some principles of choosing the feature selection methods as well.
1Introduction
Currently,with the rapid development of the information technology and database technology, we are able to obtain and store a large amount of data.And the data from the global data center illustrate that data center has already entered the zettabyte era,and by2016will nearly reach 6.6zettabytes per year(Ci
sco,2012).Therefore,there must be a lot of important information among the massive data,and the tradition data analysis Management Information System)could only conduct some basic data analysis such as searching,statistics and sorting instead of acquiring the internal relationship and underlying messages.Recently,data mining technology attracted a great deal of attention in the information industry and in society.It is regarded as a result of the natural evolution of information technology.Han and Kamber(2006) treat data mining as knowledge mining from data,which means the data mining is an essential step in the process of knowledge discovery.
2The importance of feature selection
Data available for mining is raw data.It may be in different formats as it comes from different sources,and it could consist of noisy data,irrelevant attributes,missing data etc.So data needs to be preprocessed before applying any kind of data mining algorithm.And feature selection is an essential step in the data preprocessing.In the real world,data sets for analysis would contain variety kinds of features,many of which could be irrelevant or redundant to the mining task.For example,if the task is to classify customers as to whether or not they are likely to purchase the latest VW cars,the features such as the customers telephone number and address are likely to be irrelevant or redundant,while some attributes like age,brand preference and income are of vital impact on the customers behaviors.
Thus,feature selection aims atfinding a minimum set of attributes,which would help to make the pattern easier to understand by reducing the number of attributes appearing in the discovered pattern(Han and Kamber,2006).For example,if a data set has three features A1,A2,A3then it should be a total of23feature subsets and if it has N features the total subset should be2N.The problem of feature selection is relatively trivial if the number of feature is small,since we can analyze all subsets in any order and a search will completed in a short time(Kantardzic,2003).However,in many applications,the size of a dataset is so large(N>20),it would lead to an inefficient work and more running time.So feature selection methods try to pick a subset of features that are relevant to the requirement. That is to say,given a feature set of size M,the feature selection task is tofind a feature subset of size n(n<<M)that maximizes the systems ability to classify object instances(Bins,2001). By discarding the irrelevant or misleading factors,learning from data techniques can benefit a lot.
3High dimension data
truncated data
As mentioned above,feature selection is effective in reducing dimensionality,removing irrelevant data and increasing learning accuracy.However,the recent increase of dimensionality of data poses a severe challenge to many existing feature selection methods(Yu and Liu,2003).And the more dimensions the data have,the more complex the task is.Some high dimension tasks, such as multi-clas
s text categorization(Bo and Yan,2010)and genetic research data(Xiang, 2010),exist a large amount of data,which represent different characteristics with low dimension data.The outcomes would be totally diverse if the low dimension data FS methods are used for analyzing the high dimension data.Thus,choosing the appropriated FS methods or reducing the dimensionality could,to some extent,lead to the more accurate data learning.
4Feature selection method
Undoubtedly,feature selection methods search through all the2n subsets of features,and try to find the best one according to the requirement.Dash and Liu(1997)divided a typical feature selection process into four basic steps(see Figure1).One is a generation procedure,which could also regard as a search procedure and it generates the subset used in the next stage.The second is an evaluation function,which allows comparing different hypothesis to guide the search process.The third stage is the stopping criterion,which decide when to stop and the last one is a validation procedure,which serves as an opportunity to check whether the subset is valid. If the outcome satisfies the criterion thenfinished,otherwise it would repeat the previous steps until it meets the demand.
Figure1:Feature selection process
4.1Generation procedure
In terms of the generation procedure,if the original features contain n features,the total number of feature subset is2n.It would be a large number even if the n were just bigger than ten.By processing the generation procedure,this issue could,to some extent,be addressed.And this procedures include three main methods:Complete,heuristic and random(Wang,2005)
4.1.1Complete
It refers to search for all possible subsets combination through the whole feature space and choose the most optimal one.If the number of feature is N,and the search space is2N.By
doing this method.An optimal subset is definitely obtained.However,it could cause time-consuming and high computation complexity if the feature space is so large,which make this method less practicability(Wang,2005).The complete method could be done using various techniques,such as:branch and bound,bestfirst search,and beam search(Dash and Liu1997).
4.1.2Heuristic
renews the feature subset by repeating the iteration.In addition,the computation complexity is not more than2N.Heuristic could be implemented in a relatively more simple and convenient way.Therefore,it has been widespread use in the reality(Wang,2005).Some technologies are relief,forward selection,backward selection and decision tree(Dash and Liu1997).The draw-back of this method is that it could not guarantee to get an optimal subset but an approximate one.
4.1.3Random
This generation procedure is comparatively new compared to the other two categories.Although the search space is2N as well,this method could search fewer number of subset by setting the maximum number of iterations(Dash and Liu1997).Some common methods are LVF and Genetic Algorithm.Thus the setting number is of vital impact on whether it could get an optimal subset.How to set these numbers efficiently is still a severe issue.
Overall,only the complete method is able to get an optimal subset among the three methods mentioned above,while it costs a large amount of time with a relatively higher computation complexity.On the contrary,the last two methods sacrifice the accuracy in return for producing results quickly.In practice,in order to get a balance between the accuracy and efficiency,it would combine the several methods to solve the problem.As the three-step algorithm that come up with by Draper and Bins(2001),firstly it could use a Relief algorithm to remove irrelevance and then clusters features using K-means to remove redundancy,and use a standard combinatorial feature selection algorithm in thefinal stage.This three-step combination is shown to be more effective than standard feature selection algorithms for large data sets with lots of irrelevant and redundant features.
4.2Evaluation procedure
Feature selection could regard as an optimization problem and the key is to establish a valuation criterion to identify which feature subset exist irrelevant or misleading factors.Different evalu-ation methods would cause different outcomes.According to the relationship between a feature selection algorithm and the inducer chosen to evaluate the usefulness of the feature selection process,the evaluation procedure can be divided into three broad classes.One isfilter method and another one is wrapper method and the third one is embedded method(Tan,2006).
4.2.1Filter approaches
These methods select features before the data mining algorithm is run,and they uses some methods that are independent of the data mining task(Tan,2006)(see Figure2).In most cases,it calculates the score of feature relevance,while the low-scoring features are removed. The subset of features left after feature removal is thought to be the input to the classification
algorithm.And the Information gain,correlation-based feature selection and fast correlation-based feature selection etc.are considered to be the most popularfilter methods.According to its specific character,filter techniques could deal with the high dimensional datasets in a much simple,fast and inexpensive way(Ji and Hu2011).As thefilter method is independent of the algorithm thus feature select
ion only needs to be performed once,and then variety classifiers could be evaluated(Beniwal and Arora,2012).However,the disadvantages are obvious as well.As these approaches ignore the interaction with the classifier,the feature dependencies are ignored,which would cause worse classification performance(ibid).
Figure2:Filter methods
4.2.2Wrapper approaches
Wrapper methods utilize the classifier as a black box to evaluate the subsets of features and determine how good a given attribute subset is(see Figure3).The major characteristic of the wrapper approach is that the quality of an attribute subset is directly measured by the performance of the algorithm applied to that attribute subset.As the data mining algorithm is applied to each feature subset considered by the search,the wrapper method tends to be much slower than thefilter method.Advantages of wrapper approaches include the interaction between feature subset search and
model selection,and the ability to take into account feature dependencies and much simple than thefilter.While it still includes some common drawbacks: they have a higher risk of overfitting and are very computationally intensive(Ladha,2011). Some typical technologies are:Plus L Minus R,beam search and randomized hill climbing etc.
Figure3:Wrapper Methods
4.2.3Embedded approaches
In terms of embedded methods,the feature selection happens as part of the data mining al-gorithm.The algorithm is able to decide which features could be selected or which would be discarded(Tan,2006)(see Figure4).Some traditional tools like decision trees or artificial neural networks
are included in these approaches.Embedded methods have the strength that
they interact with the classification model,and at the same time being far less computationally intensive than wrapper methods(Beniwal and Arora,2012).
In this section,we analysis the two major steps of feature section:generation procedure and evaluation function,which both includes three main methods.Therefore,it must exist more than9types of combinations of generation procedures and evaluation functions.Among these methods,there is no obvious robust comparative study showing that some methods are superior to others,which means there is no single feature selection method that can be applied to all applications.Although a branch and bound algorithm is developed to select the best feature subset based on an efficient subset enumeration scheme(Narendra and Fucknage,1977),it still depends on some specific conditions.There is no denying that every method could come up with the optimal subset if it meets particular criteria such as data type,data size and noise.
5The principles of choosing the methods
The choice of a feature selection method depends on various data characteristics:(i)data types, (ii)data mining task,and(iii)data size.After understanding all the data characteristics,we could select the optima
l method for the particular application.
5.1Data type
There are three basic data type:continuous,discrete and nominal(including Boolean).Each FS method has its range of data type.For instance,the branch and bound algorithm does not support the type of Boolean(Dash and Liu1997).Additionally,Koller-Sahami does not support the continuous(Koller,and Sahami,1996).
5.2Data mining task
In general,the task is divided into two parts:Two classes or multiple classes.Kira and Rendell (1992)indicated that some methods like Relief do not support the multiple classes.However, we could solve this kind of problems by dividing it into several two classes problems.
5.3Data size
As we mentioned above,heuristic and random method are more efficient in terms of handle large data size than the complete method in the generation process.While the complete approach could perform well for small data size as it could get the optimal subset.In the evaluation process,Filter is much more
efficient than wrapper,although the latter do better in accuracy. Thus thefilter methods are considered to be more practical in this information age,which datasets are commonly large in size.
6Conclusion
Feature selection is a necessary step in the pre-process,which received widespread attention in data miningfield.In the generation process of the feature selection,there are typically three methods:Complete,heuristic and random and in terms of the evaluation process,it could be separated into other three parts:Filter,wrapper and embedded.Each approach related to several specific algorithms.In addition,by analyzing the factors that affect the function of

版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系QQ:729038198,我们将在24小时内删除。