数据挖掘综述Data Mining--688IT编程网

A Survey of Feature Selection

Shi Zheng/6529992

CMPSMC24

Abstract

Feature selection is an important topic in data mining,especially for high dimensional datasets. This report analyzes and summarizes the deﬁnition of feature selection,and then introduces the high dimensional data.It divides feature selection into three classes according to the selecting strategy,and categorizes the methods into three styles by the evaluation function.Through analyzing the infection factors in the feature selection technology,this report introduces some principles of choosing the feature selection methods as well.

1Introduction

Currently,with the rapid development of the information technology and database technology, we are able to obtain and store a large amount of data.And the data from the global data center illustrate that data center has already entered the zettabyte era,and by2016will nearly reach 6.6zettabytes per year(Ci

sco,2012).Therefore,there must be a lot of important information among the massive data,and the tradition data analysis Management Information System)could only conduct some basic data analysis such as searching,statistics and sorting instead of acquiring the internal relationship and underlying messages.Recently,data mining technology attracted a great deal of attention in the information industry and in society.It is regarded as a result of the natural evolution of information technology.Han and Kamber(2006) treat data mining as knowledge mining from data,which means the data mining is an essential step in the process of knowledge discovery.

2The importance of feature selection

Data available for mining is raw data.It may be in diﬀerent formats as it comes from diﬀerent sources,and it could consist of noisy data,irrelevant attributes,missing data etc.So data needs to be preprocessed before applying any kind of data mining algorithm.And feature selection is an essential step in the data preprocessing.In the real world,data sets for analysis would contain variety kinds of features,many of which could be irrelevant or redundant to the mining task.For example,if the task is to classify customers as to whether or not they are likely to purchase the latest VW cars,the features such as the customers telephone number and address are likely to be irrelevant or redundant,while some attributes like age,brand preference and income are of vital impact on the customers behaviors.

Thus,feature selection aims atﬁnding a minimum set of attributes,which would help to make the pattern easier to understand by reducing the number of attributes appearing in the discovered pattern(Han and Kamber,2006).For example,if a data set has three features A1,A2,A3then it should be a total of23feature subsets and if it has N features the total subset should be2N.The problem of feature selection is relatively trivial if the number of feature is small,since we can analyze all subsets in any order and a search will completed in a short time(Kantardzic,2003).However,in many applications,the size of a dataset is so large(N>20),it would lead to an ineﬃcient work and more running time.So feature selection methods try to pick a subset of features that are relevant to the requirement. That is to say,given a feature set of size M,the feature selection task is toﬁnd a feature subset of size n(n<<M)that maximizes the systems ability to classify object instances(Bins,2001). By discarding the irrelevant or misleading factors,learning from data techniques can beneﬁt a lot.

3High dimension data

truncated data

As mentioned above,feature selection is eﬀective in reducing dimensionality,removing irrelevant data and increasing learning accuracy.However,the recent increase of dimensionality of data poses a severe challenge to many existing feature selection methods(Yu and Liu,2003).And the more dimensions the data have,the more complex the task is.Some high dimension tasks, such as multi-clas

s text categorization(Bo and Yan,2010)and genetic research data(Xiang, 2010),exist a large amount of data,which represent diﬀerent characteristics with low dimension data.The outcomes would be totally diverse if the low dimension data FS methods are used for analyzing the high dimension data.Thus,choosing the appropriated FS methods or reducing the dimensionality could,to some extent,lead to the more accurate data learning.

4Feature selection method

Undoubtedly,feature selection methods search through all the2n subsets of features,and try to ﬁnd the best one according to the requirement.Dash and Liu(1997)divided a typical feature selection process into four basic steps(see Figure1).One is a generation procedure,which could also regard as a search procedure and it generates the subset used in the next stage.The second is an evaluation function,which allows comparing diﬀerent hypothesis to guide the search process.The third stage is the stopping criterion,which decide when to stop and the last one is a validation procedure,which serves as an opportunity to check whether the subset is valid. If the outcome satisﬁes the criterion thenﬁnished,otherwise it would repeat the previous steps until it meets the demand.

Figure1:Feature selection process

4.1Generation procedure

In terms of the generation procedure,if the original features contain n features,the total number of feature subset is2n.It would be a large number even if the n were just bigger than ten.By processing the generation procedure,this issue could,to some extent,be addressed.And this procedures include three main methods:Complete,heuristic and random(Wang,2005)

4.1.1Complete

It refers to search for all possible subsets combination through the whole feature space and choose the most optimal one.If the number of feature is N,and the search space is2N.By

doing this method.An optimal subset is deﬁnitely obtained.However,it could cause time-consuming and high computation complexity if the feature space is so large,which make this method less practicability(Wang,2005).The complete method could be done using various techniques,such as:branch and bound,bestﬁrst search,and beam search(Dash and Liu1997).

4.1.2Heuristic

renews the feature subset by repeating the iteration.In addition,the computation complexity is not more than2N.Heuristic could be implemented in a relatively more simple and convenient way.Therefore,it has been widespread use in the reality(Wang,2005).Some technologies are relief,forward selection,backward selection and decision tree(Dash and Liu1997).The draw-back of this method is that it could not guarantee to get an optimal subset but an approximate one.

4.1.3Random

This generation procedure is comparatively new compared to the other two categories.Although the search space is2N as well,this method could search fewer number of subset by setting the maximum number of iterations(Dash and Liu1997).Some common methods are LVF and Genetic Algorithm.Thus the setting number is of vital impact on whether it could get an optimal subset.How to set these numbers eﬃciently is still a severe issue.

Overall,only the complete method is able to get an optimal subset among the three methods mentioned above,while it costs a large amount of time with a relatively higher computation complexity.On the contrary,the last two methods sacriﬁce the accuracy in return for producing results quickly.In practice,in order to get a balance between the accuracy and eﬃciency,it would combine the several methods to solve the problem.As the three-step algorithm that come up with by Draper and Bins(2001),ﬁrstly it could use a Relief algorithm to remove irrelevance and then clusters features using K-means to remove redundancy,and use a standard combinatorial feature selection algorithm in theﬁnal stage.This three-step combination is shown to be more eﬀective than standard feature selection algorithms for large data sets with lots of irrelevant and redundant features.

4.2Evaluation procedure

Feature selection could regard as an optimization problem and the key is to establish a valuation criterion to identify which feature subset exist irrelevant or misleading factors.Diﬀerent evalu-ation methods would cause diﬀerent outcomes.According to the relationship between a feature selection algorithm and the inducer chosen to evaluate the usefulness of the feature selection process,the evaluation procedure can be divided into three broad classes.One isﬁlter method and another one is wrapper method and the third one is embedded method(Tan,2006).

4.2.1Filter approaches

These methods select features before the data mining algorithm is run,and they uses some methods that are independent of the data mining task(Tan,2006)(see Figure2).In most cases,it calculates the score of feature relevance,while the low-scoring features are removed. The subset of features left after feature removal is thought to be the input to the classiﬁcation

algorithm.And the Information gain,correlation-based feature selection and fast correlation-based feature selection etc.are considered to be the most popularﬁlter methods.According to its speciﬁc character,ﬁlter techniques could deal with the high dimensional datasets in a much simple,fast and inexpensive way(Ji and Hu2011).As theﬁlter method is independent of the algorithm thus feature select

ion only needs to be performed once,and then variety classiﬁers could be evaluated(Beniwal and Arora,2012).However,the disadvantages are obvious as well.As these approaches ignore the interaction with the classiﬁer,the feature dependencies are ignored,which would cause worse classiﬁcation performance(ibid).

Figure2:Filter methods

4.2.2Wrapper approaches

Wrapper methods utilize the classiﬁer as a black box to evaluate the subsets of features and determine how good a given attribute subset is(see Figure3).The major characteristic of the wrapper approach is that the quality of an attribute subset is directly measured by the performance of the algorithm applied to that attribute subset.As the data mining algorithm is applied to each feature subset considered by the search,the wrapper method tends to be much slower than theﬁlter method.Advantages of wrapper approaches include the interaction between feature subset search and

model selection,and the ability to take into account feature dependencies and much simple than theﬁlter.While it still includes some common drawbacks: they have a higher risk of overﬁtting and are very computationally intensive(Ladha,2011). Some typical technologies are:Plus L Minus R,beam search and randomized hill climbing etc.

Figure3:Wrapper Methods

4.2.3Embedded approaches

In terms of embedded methods,the feature selection happens as part of the data mining al-gorithm.The algorithm is able to decide which features could be selected or which would be discarded(Tan,2006)(see Figure4).Some traditional tools like decision trees or artiﬁcial neural networks

are included in these approaches.Embedded methods have the strength that

they interact with the classiﬁcation model,and at the same time being far less computationally intensive than wrapper methods(Beniwal and Arora,2012).

In this section,we analysis the two major steps of feature section:generation procedure and evaluation function,which both includes three main methods.Therefore,it must exist more than9types of combinations of generation procedures and evaluation functions.Among these methods,there is no obvious robust comparative study showing that some methods are superior to others,which means there is no single feature selection method that can be applied to all applications.Although a branch and bound algorithm is developed to select the best feature subset based on an eﬃcient subset enumeration scheme(Narendra and Fucknage,1977),it still depends on some speciﬁc conditions.There is no denying that every method could come up with the optimal subset if it meets particular criteria such as data type,data size and noise.

5The principles of choosing the methods

The choice of a feature selection method depends on various data characteristics:(i)data types, (ii)data mining task,and(iii)data size.After understanding all the data characteristics,we could select the optima

l method for the particular application.

5.1Data type

There are three basic data type:continuous,discrete and nominal(including Boolean).Each FS method has its range of data type.For instance,the branch and bound algorithm does not support the type of Boolean(Dash and Liu1997).Additionally,Koller-Sahami does not support the continuous(Koller,and Sahami,1996).

5.2Data mining task

In general,the task is divided into two parts:Two classes or multiple classes.Kira and Rendell (1992)indicated that some methods like Relief do not support the multiple classes.However, we could solve this kind of problems by dividing it into several two classes problems.

5.3Data size

As we mentioned above,heuristic and random method are more eﬃcient in terms of handle large data size than the complete method in the generation process.While the complete approach could perform well for small data size as it could get the optimal subset.In the evaluation process,Filter is much more

eﬃcient than wrapper,although the latter do better in accuracy. Thus theﬁlter methods are considered to be more practical in this information age,which datasets are commonly large in size.

6Conclusion

Feature selection is a necessary step in the pre-process,which received widespread attention in data miningﬁeld.In the generation process of the feature selection,there are typically three methods:Complete,heuristic and random and in terms of the evaluation process,it could be separated into other three parts:Filter,wrapper and embedded.Each approach related to several speciﬁc algorithms.In addition,by analyzing the factors that aﬀect the function of

688IT编程网

数据挖掘综述Data Mining

发表评论

推荐文章

java正则表达式选择题

一种基于正则表达式的DBC文件解析及报文分析方法[发明专利]

工龄小数点提取

非零金额正则表达式

提取文本中数字的函数

热门文章

excel文字递增函数公式

数字递增公式

notepad 正则变量运算

C++regex库常用函数及实例

js正则表达式之前瞻后顾与非捕获分组

indesign正则数字和英文之间的空格

C#匹配中文字符串的4种正则表达式分享

PHP正则表达式匹配中文字符

匹配中文汉字的正则表达式介绍

Python正则表达式如何进行字符串替换

orcl中用正则表达式

sql正则表达式excel

dataframe正则表达式

postgress sql正则

el-upload accept 正则表达式

半小时正则表达式

判断科学计数法的正则

根据url判断静态资源的方法

Java正则表达式-匹配正负浮点数

替换模糊匹配正则-hive

最新文章

一种基于正则表达式的DBC文件解析及报文分析方法[发明专利]

能被5整除的十进制整数的正规表达式

大于0小于等于1的正则表达式

linux grep 26个字母

java pattern 正则表达式

掌握文本编辑器中的搜索和替换技巧

标签列表

688IT编程网

数据挖掘综述Data Mining

发表评论

推荐文章

java正则表达式 选择题

一种基于正则表达式的DBC文件解析及报文分析方法[发明专利]

工龄小数点提取

非零金额 正则表达式

提取文本中数字的函数

热门文章

excel文字递增函数公式

数字递增公式

notepad 正则变量运算

C++regex库常用函数及实例

js正则表达式之前瞻后顾与非捕获分组

indesign正则数字和英文之间的空格

C#匹配中文字符串的4种正则表达式分享

PHP正则表达式匹配中文字符

匹配中文汉字的正则表达式介绍

Python正则表达式如何进行字符串替换

orcl中用正则表达式

sql正则表达式excel

dataframe正则表达式

postgress sql正则

el-upload accept 正则表达式

半小时 正则表达式

判断科学计数法的正则

根据url判断静态资源的方法

Java正则表达式-匹配正负浮点数

替换模糊匹配正则-hive

最新文章

一种基于正则表达式的DBC文件解析及报文分析方法[发明专利]

能被5整除的十进制整数的正规表达式

大于0小于等于1的正则表达式

linux grep 26个字母

java pattern 正则表达式

掌握文本编辑器中的搜索和替换技巧

标签列表

java正则表达式选择题

非零金额正则表达式

半小时正则表达式