CRAFT A Framework for Evaluating--688IT编程网

CRAFT:A Framework for Evaluating Software Clustering Results in the Absence of Benchmark Decompositions

Brian S.Mitchell and Spiros Mancoridis

Department of Mathematics&Computer Science

Drexel University,Philadelphia,PA,USA

{bmitchel,smancori}@mcs.drexel.edu

Abstract

Software clustering algorithms are used to create high-level views of a system’s structure using source code-level artifacts.Software clustering is an active area of research that has produced many clustering al-gorithms.However,we have seen very little work that investigates how the results of these algorithms can be evaluated objectively in the absence of a benchmark de-composition,or without the active participation of the original designers of the system.

Ideally,for a given system,an agreed upon refer-ence(benchmark)decomposition of the system’s struc-tur

e would exist,allowing the results of various clus-tering algorithms to be compared against it.Since such benchmarks seldom exist,we seek alternative methods to gain conﬁdence in the quality of results produced by software clustering algorithms.

In this paper we present a tool that supports the eval-uation of software clustering results in the absence of

a benchmark decomposition.

1.Introduction&Background

Maintaining large and complex software systems is a daunting task.Poor software maintenance practices often lead to a system’s premature retirement,or ne-cessitates signiﬁcant restructuring in order to keep the source code in manageable condition.There are many causes for these poor maintenance practices.Little can be done about practical issues such as limited main-tenance budgets,lack of access to the original devel-opers and designers of the system,and short devel-opment schedules.However,if software maintainers have a thorough understanding of the underlying sys-tem structure,it is more likely that changes to the source code can be performed without deviating from the system design and architecture.Additionally,if the source code structure is documented,new m

aintenance developers will require less time to become productive.

The best way to gain insight into a system’s struc-ture is to have access to accurate documentation.Un-fortunately system documentation is often not up to date,if it exists at all.One way that researchers have addressed this problem is by developing software clustering algorithms and tools that produce high-level views of a system’s structure directly from its source code.Because clustering techniques use a variety of cri-teria to decompose a software system into clusters,it is not surprising that these algorithms produce diﬀerent results when applied to the same system.

The above issues further complicate things for soft-ware practitioners who are trying to understand the structure of complex systems.A manual analysis of the source code is tedious,and may not even be possible, because of its large number of components and depen-dencies.Clustering algorithms automate this analysis.

Many of the clustering techniques published in the literature can be categorized by the way they estab-lish clusters.Hutchens and Basili[7]developed an algorithm that clusters procedures into modules by measuring the interaction between pairs of procedures. Schwanke et al.[16,17]introduced the notion of using design principles such as low coupling and high cohe-sion to create clusters.Choi and Scacchi[4]

describe a clustering technique based on maximizing the cohesive-ness of clusters by evaluating the exchange of resources between modules.Hausi M¨u ller et al.[15]implemented several software clustering heuristics in the Rigi tool that(a)measure the relative strength between inter-faces,(b)identify omnipresent modules,and(c)use similarity of module names.Clustering based on simi-lar patterns in implementation ,mod-uleﬁle names)has been investigated by Anquetil et al.[2,3].Concept analysis[10,19,1]has also been ex-plored in the software clustering research.Our research is based on using optimization techniques[12,5,11,14] to determine clusters.

Now that a plethora of clustering approaches exist, the validation of clustering results is starting to at-tract interest from the Software Engineering research community.Many of the clustering techniques pub-lished in the literature present case studies where the results are evaluated by the authors or by the develop-ers of the system being studied.This evaluation tech-nique is very subjective.Moreover,the literature often does not describe the types of systems for which a clus-tering technique does not perform well.For example, our clustering technique,named Bunch[12,5,11,14], forms clusters based on maximizing the cohesiveness of the clusters,while minimizing the inter-cluster cou-pling.While we think that this technique is good for many types of systems,it is not suitable to all sys-tems.As an example,Bunch may not provide good results for systems whose architecture is event driven (e.g.,UI code),or whose architecture loads system re-sources dynamically1.

The issues described above should be considered,es-pecially when the decompositions produced by diﬀerent clustering algorithms diﬀer dramatically.Recently,re-searchers have begun developing infrastructure to eval-uate clustering techniques by proposing similarity mea-surements[2,3,13].These measurements enable the results of clustering algorithms to be compared to each other,and preferably to be compared to an agreed upon “benchmark”standard.A high similarity value should provide conﬁdence that the clustering algorithm is pro-ducing good decompositions.

The remainder of this paper is dedicated to tech-niques for evaluating the results of clustering algo-rithms when no standard benchmark exists.We present a tool we developed that provides a framework for evaluating clustering techniques.We hope that our tool will be of value to researchers who are creating or evaluating clustering algorithms.

2.Establishing Conﬁdence in Software

Clustering Results

In this section we address the problem of how to gain conﬁdence in a software clustering result when a reference benchmark standard is not available for com-parison.Although signiﬁcant research emphasis has been placed on clustering,we have seen little work on measuring the eﬀectiveness of these techni

ques.The following list outlines the current state of practice for evaluating software clustering results:

•Software clustering research often revisits the same set of test cases for ,linux, 1This problem is interesting and has received very little formal attention by researchers.We are investigating how dynamic pro-gram behavior can be integrated into our clustering techniques.

mosaic,etc.).This is appropriate because the structure of these systems is well understood.

However,for industry adoption of software cluster-ing technology,researchers must show that these tools can be applied to a variety of systems suc-cessfully.

•There are many clustering tools and techniques.

There is value to this diversity,as certain clus-tering techniques produce better results for some types of systems than others.However,no work has been done to help guide end-users to those techniques that work best for their type of system.•The interpretation of software clustering results tends to focus on the overall ,the en-tire system),and not on partial ,a subsystem)which may also be of value.•Software Engineering researchers have been inves-tigating similarity measurements[2,3,13]to quan-tify the level of agreement between pairs of clus-tering results.Research

ers tend to focus on using these measurements to compare clustering results directly;however,we know of no work that tries toﬁnd common patterns in clustering results that are produced by a family of diﬀerent clustering al-gorithms.

•The importance of evaluating clustering tech-niques has been investigated by Koschke and Eisenbarth[9].Their paper describes several tech-niques for comparing the results of a clustering algorithm to a reference decomposition.The au-thors also describe a consensus-driven approach for creating a reference decomposition.However, their approach for establishing the reference de-composition is manual,requiring teams of software engineers.

To address some of the problems mentioned above, we propose a framework that supports a process for automatically creating views of a system’s structure that are based on common patterns produced by vari-ous clustering algorithms.The initial step in our pro-cess clusters a system many times using diﬀerent clus-tering algorithms.For each clustering run,all of the modules that appear in the same cluster are recorded. Using this information our framework presents consol-idated views of the system structure to the user.By using a collection of clustering algorithms we“aver-age”the diﬀerent clustering results and present views based on common agreement across the various clus-tering algorithms.In the next section we introduce our framework,which we have implemented and made

Figure1.The User Interface of the CRAFT Tool

available to researchers on the world wide web at s.drexel.edu/bunch/CRAFT.

3.Our Framework-CRAFT

In Figure1we illustrate the user interface of our clustering analysis tool,which we call CRAFT (C lustering R esults A nalysis F ramework and T ools). The main window is organized into two sections.The section at the top of the window collects general pa-rameters,and the section at the bottom of the window accepts thresholds that apply to a particular analysis service.Our tool currently supports two such services, Conﬁdence Analysis and Impact Analysis,which are discussed later in this section.

The overall goal of the CRAFT framework is to expose common patterns in the results produced by diﬀerent clustering algorithms.By highlighting the common patterns,we gain conﬁdence that agreement across a collection of clustering algorithms is likely to reﬂect the underlying system structure.We are also interested in identifying modules/classes that tend to drift across many clusters,as modiﬁcations to these modules/classes may have a large impact on the over-all system structure.

As shown in Figure2,the CRAFT architecture con-sists of the following subsystems:•The User Interface:CRAFT obtains informa-tion from the user to guide the collection,process-ing and visualization of clustering results.•Clustering Services:The actual clustering pro-cess is externalized to a dynamically loadable clus-tering driver.The clustering driver is responsi-ble for partitioning a graph that represents the components and dependencies of a software sys-tem into a set of non-overlapping

clusters.The CRAFT framework invokes the clustering driver multiple times based on the value provided by the user in the Number of Runs entryﬁeld.

•Data Analysis Services:The results of the clus-tering activity are stored in a relational database. For each clustering run,every pair of modules in the system is recorded to indicate if the modules in the pair are in the same or diﬀerent clusters. The repository of clustering results supports two types of analyses:Conﬁdence Analysis and Impact Analysis.Conﬁdence Analysis produces a decom-position of the system based on common trends in the clustering data.Impact Analysis examines each module in the system to determine how it relates to all of the other modules in the system with respect to its placement into clusters.Con-ﬁdence Analysis helps users gain conﬁdence in a

clustering result when no reference decomposition exists.Impact Analysis helps users perform local analyses,which is much simpler than analyzing the entire decomposition of a system.•Visualization:Once the results of the clustering analysis have been produced,our tool presents the results in diagrammatic form.

We next present a small example illustrating the ca-pabilities of the CRAFT framework followed by a de-tailed description of the above mentioned services.

Figure2.The Architecture of the CRAFT

Tool

3.1.A Small Example

In Figure3we illustrate a small example to help ex-plain the usage of the CRAFT framework.The Mod-ule Dependency Graph(MDG),which is illustrated at the upper-left corner of theﬁgure,shows a system consisting of5modules{M1,M2,M3,M4,M5}with 6dependencies between the modules.The MDG is a generic,language independent representation of the structure of the system’s source code components.This representation includes all of the modules(classes)in the system and the set of dependencies that exist be-tween the modules(classes).

Intuitively,based on the dependencies between the modules,we expect that{M1,M2}and{M4,M5} should appear in the same cluster.We also expect that module M3is equally likely to appear in the{M1,M2} or{M4,M5}cluster since it is isomorphic to both of them.

We executed CRAFT with a setting that clustered the example system100times.The data from the clus-tering results repository is shown on the bottom-left corner of Figure3.Each relation in the repository in-dicates the frequency that a pair of modules appears in the same cluster.Visual inspection of this data con-ﬁrms that modules M1and M2,along with modules M4and M5appear in the same cluster100%of the time.The data also shows that module M3appears in the{M1,M2}cluster57%of the time and in the {M4,M5}cluster43%of the time.

The results of the Impact Analysis are shown in the center of Figure3.To conserve space,we have conﬁg-ured the CRAFT user interface to show only the mod-ules that exceed an upper threshold(as shown by the↑icon)that was selected by the user.In Section3.4.2we describe additional features of the Impact Analysis ser-vice inclusive of the thresholds that can be speciﬁed by the user.The Impact Analysis results clearly show that modules M1and M2always appear together,modules M4and M5always appear together,and that module M3does not appear in any particular cluster consis-tently(the icon for M3cannot be expanded).

On the right-side of Figure3we show the results of the Conﬁdence Analysis.The partition of the system shown in the Conﬁdence Analysis result is consistent with our intuitive decomposition,as well as the view produced by the Impact Analysis service.Note that the edges shown in Figure3do not depict dependencies in the system being studied.The edges instead represent the frequency percentage that the two modules appear in the same cluster based on all of the clustering runs. The cluster containing only module M3does not nec-essarily indicate that this module appears alone in any of the clustering results.In the Conﬁdence Analysis visualization,singleton clusters should be interpreted as modules that do not appear in any particular cluster consistently.

Given such a small example,the value of having mul-tiple views of the clustering results data is not appar-ent.In Section4,where we present the results of a case study on a non-trivial example,the beneﬁts of the two views becomes more obvious.Now that a simple example has been has been presented,we return to dis-cussing the architecture and services of CRAFT.

3.2.The User Interface

The user interface of our tool,which is shown in Figure1,collects information that is necessary to ana-lyze the results of a collection of clustering algorithms. The key information collected on the user interf

ace is the textﬁle name of the MDG,the number of times the clustering step will be performed(described in Sec-tion3.3),the name of a Java Bean that will perform the clustering activities,and the thresholds that are used by the data analysis and visualization services. The tab at the bottom of the window is used to select the analysis service to be performed,and to collect the parameters associated with each analysis service.

{M1,M2,100} {M1,M3, 57} {M1,M4, 0}

{M1,M5, 0} {M2,M3, 57}

{M2,M4, 0} {M2,M5, 0} {M3,M4, 43} {M3,M5, 43} {M4,M5,100}

MDG Graph

Impact Analysis Results Visualization Confidence Analysis Visualization

Clustering Results

Repository

Figure 3.A Small Example

3.3.The Clustering Service

Prior to using our tool,a clustering driver class must be created and packaged as a Java Bean.The respon-sibility of this driver is to accept as input:(a)the ﬁle name of the MDG,(b)the name of an output ﬁle that the clustering driver is expected to produce,(c)the current run number and (d)the collection of the pa-rameters speciﬁed on the CRAFT user interface.The clustering driver encapsulates the algorithm(s)used to cluster the system.The input ﬁle is provided in our MDG ﬁle format,and the output ﬁle must adhere to our SIL ﬁle format.Furthermore,the clustering driver must extend the AnalysisTool.IClusterHelper in-terface,which is included in the distribution of our tool.The clustering analysis tool,programmer and user documentation,along with associated ﬁle format speciﬁcations can be obtained online at the Drexel Un-viersity Software Engineering Research Group (SERG)web page at s.drexel.edu/bunch/CRAFT .The need to create an external clustering driver is consistent with our philosophy of developing tools that can be used by other researchers in the Softwa

re En-gineering community.Our approach allows researchers to create clustering drivers in the Java programming language.Clustering tools not developed in Java can also be integrated into our framework using Java’s JNI [8]capability to wrap tools developed in other programming languages.For the convenience of the user,we have created and included two drivers with our tool.Both drivers are based on algorithms sup-ported by Bunch.Speciﬁcally:

•class bunchexper.ClusterHelper :This cluster-ing driver uses the Bunch NAHC clustering algo-rithm [12,11,14].All of the algorithms supported

by Bunch use a randomized optimization approach to form clusters.Thus,repeated runs of Bunch,with the same algorithm,rarely produce the exact same result.

•class bunchexper.ClusterHelperExt :Based on the number of clustering runs speciﬁed by the user,this driver applies the Bunch NAHC clustering al-gorithm 25%of the time,the Bunch SAHC algo-rithm 25%of the time,our generic hill climbing algorithm 40%of the time,and our Genetic Algo-rithm [5]10%of the time.The generic hill climbing algorithm can be tuned using conﬁguration parameters to behave like any algorithm on the continuum between our NAHC and SAHC hill climbing algorithms [12].Overall,10of the Bunch clustering algorithms are included in the bunchexper.ClusterHelperExt driv

er.In addition to the NAHC,SAHC and Genetic algo-rithms,7diﬀerent conﬁgurations of the generic hill climbing algorithm are used.

3.4.The Data Analysis Servicespringframework事务

Once the clustering activity ﬁnishes,the data asso-ciated with each clustering run is stored in the Cluster-ing Results Repository.Next,the data is processed by our analysis tools to support the construction of useful views of the data.Our goal is to consolidate the di-versity in the results produced by the set of clustering tools so that common patterns can be identiﬁed.This approach has shown that good decompositions can be produced in the presence of a set of clustering algo-rithms.Instead of relying on a benchmark decompo-sition to evaluate the result,the result produced by

CRAFT is likely to be good because the set of cluster-ing algorithms“reached a consensus”.

As mentioned above,theﬁrst step performed dur-ing data analysis is the consolidation of the data that has been saved in the clustering results repository.Let S={M1,M2,...M n}be the set of all modules in the system.For each distinct pair of modules(M i,M j), where1≤i,j≤|S|,and M i=M j,we calculate the frequency(α)that this pair appears in the same cluster.Given that the clustering driver executesΠindependent runs,the number of times that a pair of modules can appear in the same cluster

is bounded by 0≤α≤Π.The principle that guides our analysis is that the closerαis toΠ,the more likely that the pair of modules belongs to the same cluster.During the ﬁrst step of the data analysis,the consolidated data is saved in the Clustering Results Repository so that it can be used by our data analysis tools.

Given a system consisting of n modules that has been clusteredΠtimes,letαi,j be the number of times that Modules M i and M j appear in the same clus-ter.We also deﬁne C[m1,m2,α]as the schema2for the data in the clustering repository that includes all of the{M i,M j,αi,j}triples.Each{M i,M j,αi,j}rela-tion represents the number of times that Module M i and Module M j appear in the same cluster.

For convenience we also deﬁne D to be a view on C such that all of the triples in D are sorted in descending order based on D[α].

Now that the above deﬁnitions have been presented, we turn our attention to describing the two data anal-ysis services provided by CRAFT.

3.4.1Conﬁdence Analysis Tool

The Conﬁdence Analysis Tool(CAT)processes the tu-ples in the Clustering Results Repository to produ

ce a reference decomposition of a software system.Letβrepresent a user-deﬁned threshold value,which is set on the user interface(see the left side of Figure1). Given the ordered set of tuples in the Clustering Re-sults ,set D),the user speciﬁed thresh-oldβand the set of all of the modules in the system (i.e.,set S),we create the reference decomposition by applying Algorithm1.

As Algorithm1shows,theﬁrst step in the CAT pro-cess involves forming an initial cluster by taking the re-lation from D with the largestαvalue.The closure of the modules in this relation is then calculated and new modules are added to the cluster by searching for addi-2Given a relation c for the schema C,we use the nota-tion c[m1,m2]to represent the projection of m1and m2on C.For example,given the relation c={parser,scanner,10}, c[m1,m2]={parser,scanner}.tional relations in D such that one of the modules in the triple is already contained in the newly formed cluster, and theαvalue exceeds the user deﬁned thresholdβ. Algorithm1:Conﬁdence Analysis Algorithm

ConﬁdenceAnalysis(Set:D,S;Threshold:β) Let P be a new partition of the system.

foreach relation d∈D sorted

decreasing by d[α]where d[α]≥βdo

if(d[M1]or d[M2])is in S then

1.Create a new cluster C and add it to

partition P.Add the modules from rela-

tion d(d[M1]and/or d[M2])to C that are

in S.Remove the modules that have just

been added to the new cluster C from S.

2.CloseCluster:Search for additional

relations r∈D that have r[α]≥β.Se-

lect the relation with the highest r[α]value

such that one module from r is in C and

the other module from r is in S.Add the

module from r that is in S to C.Once

added to C,remove this module from S.

Repeat this process until no new modules

can be added to C.

end

foreach module s remaining in S do

Create a new cluster C and add it to P.

Add module s to cluster C.

end

return(partition P)

When adding modules to clusters,care is taken so that each module in set S is assigned to no more th

an one cluster,otherwise we would not have a valid par-tition of the MDG.After the closure of the cluster is calculated,a new cluster is created and the process repeats.Eventually,all modules that appear in rela-tions in D that have anαvalue that exceeds the user deﬁned thresholdβwill be assigned to a cluster.In some cases,however,there may be modules for which no relation in D exists with a large enoughαvalue to be assigned to a cluster.In these instances,for each remaining unassigned module in S,a new cluster is created containing a single module.This condition is not an indication that the module belongs to a single-ton cluster,but instead is meant to indicate that the module is somewhat“unstable”in its placement,as it tends not to get assigned to any particular cluster.

Theﬁnal step performed by the CAT tool is to compare the decomposition that it produced with each of the decompositions that were generated during the clustering process.This is accomplished by measur-

688IT编程网

CRAFT A Framework for Evaluating

发表评论

推荐文章

spring调用方法(接口和多个实现类的情况)

feign的作用

AutoWired注解和Lazy延迟加载

Spring@Scheduled定时任务的fixedRate,fixedDelay,cro。。。_百度文 ...

SpringBoot的DeferredResult案例:DeferredResult的超时处理方式

热门文章

TechSeal Division ParFab设计指南-TSD 5420 附录 A - 压缩弹性

SPRINGERLINK数据库使用及SPRINGER期刊投稿指南

Standard application development framework

FatWire 内容服务器版本 6.3 开发者指南说明书

阿里巴巴Java开发手册(2021泰山版整理)

A8接口和增值开发手册

WMS仓储管理系统开发手册

java毕业设计常用参考文献

wpf开发手册

spring4注解配置datasource方式

Spring定时任务中@PostConstruct被多次执行异常的分析与解决_百度文 ...

jpa@Query()参数设置,:冒号方式、?NO.问号方式、实体类对象参数设置_百 ...

kafkalistener 原理

记redis一次Couldnotgetaresourcefromthepool异常的解决过程

动力节点 java笔记

springboot-配置多个数据源

Springboot+redis实现消息发布与订阅的代码

SpringBoot配置Redis自定义过期时间操作

transactiontemplate用法

Spring+Mybatis动态切换数据源的方法

最新文章

feign的作用

AutoWired注解和Lazy延迟加载

Spring@Scheduled定时任务的fixedRate,fixedDelay,cro。。。_百度文 ...

详解springboot+mybatis-plue实现内置的CRUD使用详情

uCMDB产品介绍

高考英语3500词汇表—带音标—中文解释

标签列表

688IT编程网

CRAFT A Framework for Evaluating

发表评论

推荐文章

spring调用方法(接口和多个实现类的情况)

feign的作用

AutoWired注解和Lazy延迟加载

Spring@Scheduled定时任务的fixedRate,fixedDelay,cro。。。_百度文 ...

SpringBoot的DeferredResult案例:DeferredResult的超时处理方式

热门文章

TechSeal Division ParFab设计指南-TSD 5420 附录 A - 压缩 弹性

SPRINGERLINK数据库使用及SPRINGER期刊投稿指南

Standard application development framework

FatWire 内容服务器版本 6.3 开发者指南说明书

阿里巴巴Java开发手册(2021泰山版整理)

A8接口和增值开发手册

WMS仓储管理系统开发手册

java毕业设计常用参考文献

wpf开发手册

spring4注解配置datasource方式

Spring定时任务中@PostConstruct被多次执行异常的分析与解决_百度文 ...

jpa@Query()参数设置,:冒号方式、?NO.问号方式、实体类对象参数设置_百 ...

kafkalistener 原理

记redis一次Couldnotgetaresourcefromthepool异常的解决过程

动力节点 java笔记

springboot-配置多个数据源

Springboot+redis实现消息发布与订阅的代码

SpringBoot配置Redis自定义过期时间操作

transactiontemplate用法

Spring+Mybatis动态切换数据源的方法

最新文章

feign的作用

AutoWired注解和Lazy延迟加载

Spring@Scheduled定时任务的fixedRate,fixedDelay,cro。。。_百度文 ...

详解springboot+mybatis-plue实现内置的CRUD使用详情

uCMDB产品介绍

高考英语3500词汇表—带音标—中文解释

标签列表

TechSeal Division ParFab设计指南-TSD 5420 附录 A - 压缩弹性