Uncovering the overlapping community structure--688IT编程网

a r X i v :p h y s i c s /0506133v 1 [p h y s i c s .s o c -p h ] 15 J u n 2005Uncovering the overlapping

community structure of complex

networks in nature and society

Gergely Palla †‡,Imre Der´e nyi ‡,Ill´e s Farkas †,and Tam´a s Vicsek †‡

†Biological Physics Research Group of HAS,P´a zm´a ny P.stny.1A,H-1117Budapest,Hungary,‡Dept.of Biological Physics,E¨o tv¨o s University,P´a zm´a ny P.stny.1A,H-1117Budapest,Hungary.Many complex systems in nature and society can be described in terms of networks capturing the intricate web of connections among the units they are made of [1,2,3,4].A question of interest is how to interpret the global organisation of such networks as the coexistence of their structural sub-units (communities)associated with more highly interconnected parts.Identifying these a pri-ori unknown building blocks (functionally related proteins [5,6],industrial sectors [7],groups of people [8,9],etc .)is crucial to the understanding of the structural and functional properties of networks.The existing deterministic methods used for large networks ﬁnd separated communi-ties,while most of the actual networks are made of highly overlapping cohesive groups of nodes.Here we introduce an approach to analyse the main statistical features of the interwoven sets of overlapping communities making a step to

wards the uncovering of the modular structure of com-plex systems.After deﬁning a set of new characteristic quantities for the statistics of communities,we apply an efﬁcient technique to explore overlapping communities on a large scale.We ﬁnd that overlaps are signiﬁcant,and the distributions we introduce reveal universal features of networks.Our studies of collaboration,word association,and protein interaction graphs demonstrate that the web of communities has non-trivial correlations and speciﬁc scaling properties.Most real networks typically contain parts in which the nodes (units)are more highly connected to each other than to the rest of the network.The sets of such nodes are usually called clusters,communities,cohesive groups,or modules [8,10,11,12,13],having no widely accepted,unique deﬁnition.In spite of this ambiguity the presence of communities in networks is a signature of the hierarchical nature of complex systems [5,14].The existing methods for ﬁnding communities in large networks are useful if the community structure is such that it can be interpreted in terms of separated sets of communities (see Fig.1b and Refs.[10,15,16,17,18]).However,most real networks are characterised by well deﬁned statistics of overlapping and nested communities.Such a statement can be demonstrated by the numerous communities each of us belongs to,including those related to our scientiﬁc activities or personal life (school,hobby,family)and so on,as illustrated in Fig.1a.Furthermore,members of our communities

have their own communities,resulting in an extremely complicated web of the communities themselves.This has long been appreciated by sociologists [19],but has never been studied systematically for large networks.Another,biological example is that a large fraction of proteins belong to several protein complexes simultaneously [20].

In general,each node i of a network can be characterised by a membership number m i ,which is the number of communities the node belongs to.In turn,any two communities αand βcan share s ov

α,β

nodes,which we deﬁne as the overlap size between these communities.Naturally,the communities also constitute a network with the overlaps being their links.The number of such links of community

friends, schoolmates,

family members, etc.

Figure1:Illustration of the concept of overlapping communities.a)The black dot in the middle rep-resents either of the authors of this Letter,with several of his communities around.Zooming into the scientiﬁc community demonstrates the nested and overlapping structure of the communities,while de-picting the cascades of communities starting from some members exempliﬁes the interwoven structure of the network of communities.b)Divisive and agglomerative methods grossly fail to identify the com-munities when overlaps are signiﬁcant.c)An example of overlapping k-clique-communities at k=4. The yellow community overlaps with the blue one in a single node,whereas it shares two nodes and a link with the green one.These overlapping regions are emphasised in red.Notice that any k-clique (complete subgraph of size k)can be reached only from the k-cliques of the same community through a series of adjacent k-cliques.Two k-cliques are adjacent if they share k−1nodes.

αcan be called as its community degree,d com

.Finally,the size s com

characterise

of any communityαcan most naturally be deﬁned as the number of its nodes.To characterise the community structure of a large network we introduce the distributions of these four basic quantities.In particular,we will focus on their cumulative distribution functions denoted by P(s com),P(d com),P(s ov),and P(m),respectively.For the overlap ,P(s ov)means the proportion of those overlaps that are larger than s ov.Further relevant statistical features will be introduced later.

The basic observation on which our community deﬁnition relies is that a typical community consists of several complete(fully connected)subgraphs that tend to share many of their nodes.Thus,we deﬁne a community,or more precisely,a k-clique-community as a union of all k-cliques(complete subgraphs of size k)that can be reached from each other through a series of adjacent k-cliques(where adjacency means sharing k−1nodes)[21,23,24].This deﬁnition is aimed at representing the fact that it is an essential feature of a community that its members can be reached through well connected subsets of nodes.There are other parts of the whole network that are not reachable from a particular k-clique,but they potentially contain further k-clique-communities.In turn,a single node can belong to several communities.All these can be explored systematically and can result in a large number of overlapping communities(see

Fig.1c for illustration).Note that in most cases relaxing this deﬁ,by allowing incomplete k-cliqu

es)is practically equivalent to lowering the value of k.Forﬁnding meaningful communities,the way they are identiﬁed is expected to satisfy several basic requirements:it cannot be too restrictive, should be based on the density of links,is required to be local,should not yield any cut-node or cut-link(whose removal would disjoin the community)and,of course,should allow overlaps.We employ the community deﬁnition speciﬁed above,because none of the others in the literature satisfy all these requirements simultaneously[21,22].

Although the numerical determination of the full set of k-clique-communities is a polynomial prob-lem,we use an algorithm which is exponential,because it is signiﬁcantly more efﬁcient for the graphs corresponding to actual data.This method is based onﬁrst locating all cliques(maximal complete subgraphs)of the network and then identifying the communities by carrying out a standard component analysis of the clique-clique overlap matrix[21].For more details about the method and its speed see the Supplementary Information.

We use our method for binary ,with undirected and unweighted links).An arbitrary network can always be transformed into a binary one by ignoring any directionality in the links and keeping only those that are stronger than a threshold weight w∗.Changing the threshold is like changing the resolution(as in a microscope)with which the community structure is investigated:by increasing w∗the c

ommunities start to shrink and fall apart.A very similar effect can be observed by changing the value of k as well:increasing k makes the communities smaller and more disintegrated,but at the same time,also more cohesive.

When we are interested in the community structure around a particular node,it is advisable to scan through some ranges of k and w∗,and monitor how its communities change.As an illustration,in Fig.2 we are depicting the communities of three selected nodes of three large networks:(i)the social network of scientiﬁc collaborators[25],(ii)the network of word associations[26]related to cognitive sciences, and(iii)the molecular-biological network of protein-protein interactions[27].These pictures can serve as tests or validations of the efﬁciency of our algorithm.In particular,the communities of the author G. Parisi(who is well known to have been making signiﬁcant contributions in differentﬁelds of physics) shown in Fig.2a are associated with hisﬁelds of interest,as it can be deduced from the titles of the papers involved.The4-clique-communities of the word“bright”(Fig.2b)correspond to the various meanings of this word.An important biological application isﬁnding the communities of proteins,based on their interactions.Indeed,most proteins in the communities shown in Figs.2c and3can be associated with either protein complexes or certain functions,as can be looked up by using the GO-TermFinder package [28]and the online tools of the Saccharomyces Genome Database(SGD)[29].For some protei

ns no function is available yet.Thus,the fact that they show up in our approach as members of communities can be interpreted as a prediction for their functions.One such example can be seen in the enlarged portion of Fig.3.For the protein Ycr072c,which is required for the viability of the cell and appears in the dark green community on the right,SGD provides no biological process(function).By far the most signiﬁcant GO term for the biological process of this community is“ribosome biogenesis/assembly”. Thus,we can infer that Ycr072c is likely to be involved in this process.Also,new cellular processes can be predicted if yet unknown communities are found with our method.

These examples(and further ones included in the Supplementary Information)clearly demonstrate the advantages of our approach over the existing divisive and agglomerative methods recently used for large real networks.Divisive methods cut the network into smaller and smaller pieces,each node is

Figure2:The community structure around a particular node in three different networks.The commu-nities are colour coded,the overlapping nodes and links between them are emphasised in red,and the volume of the balls and the width of the links are proportional to the total number of communities they belong to.For each network the value of k has been set to4.a)The communities of G.Parisi in the co-authorship network of the Los Alamos cond-mat archive(for threshold weight w∗=0.75)can be associated with hisﬁelds of interest.b)The communities of the word“bright”in the South Florida Free Association norms list(for w∗=0.025)represent the different meanings of this word.c)The commu-nities of the protein ZDS1in the DIP core list of the protein-protein interactions visiae can be associated with either protein complexes or certain functions.

forced to remain in only one community and be separated from its other communities,most of which then necessarily fall apart and disappear.This ,with the word“bright”when we apply the method described in Ref.[16]:it tends to stay together mostly with the words of the community related to“light”,while most of its other ,those related to“colors”,see Fig.2b)completely disintegrate(“green”gets to the vegetables,“orange”to the fruits,etc.).Agglomerative methods do the same,but in the reverse direction.For example,when we applied the agglomerative method of Ref.[18], at some point“bright”,as a single word,joined a“community”of890other words.In addition,such methods

inevitably lead to a tree-like hierarchical rendering of the communities,while our approach allows the construction of an unconstrained network of communities.

The networks chosen above have been constructed in the following ways.In the co-authorship net-

work of the Los Alamos e-print archives[25]each article contributes the value1/(n−1)to the weight of the link between every pair of its n authors.In the South Florida Free Association norms list[26] the weight of a directed link from one word to another indicates the frequency that the people in the survey associated the end point of the link with its start point.For our purposes these directed links have been replaced by undirected ones with a weight equal to the sum of the weights of the corresponding two oppositely directed links.In the DIP core list of the protein-protein interactions visiae[27] each interaction represents an unweighted link between the interacting proteins.These networks are very large,consisting of30739,10617,and2609nodes and136065,63788,and6355links,respectively.

Although different values of k and w∗might be optimal for the local community structure around different nodes,we should set some global criterion toﬁx their values if we want to analyse the statistical properties of the community structure of the entire network.The criterion we use is based onﬁnding a community structure as highly structured as possible.In the related percolation phenomena

[24]a giant component appears when the number of links is increased above some critical point.Therefore, to approach this critical point from below,for each selected value of k(typically between3and6)we lower the threshold w∗until the largest community becomes twice as big as the second largest one.In this way we ensure that weﬁnd as many communities as possible,without the negative effect of having a giant community that would smear out the details of the community structure by merging many smaller communities.We denote by f∗the fraction of links stronger than w∗,and use only those values of k for which f∗is not too small(not smaller than0.5).This has led us to k=6and5with f∗=0.93and 0.75,respectively,for the collaboration network,and k=4with f∗=0.67for the word association network.For the former network both sets of parameters result in very similar communities(see the Supplementary Information).Since for unweighted networks no threshold weight can be set,for these we simply select the smallest value of k for which no giant community appears.In case of the protein interaction network this gives k=4,resulting in82communities.Due to this relatively low number,we can depict the entire network of protein communities as presented in Fig.3.

The four distributions characterising the global community structure of these networks are displayed in Fig.4.Although the scaling of the size of non-overlapping communities has already been demon-strated for social networks[18,17],it is striking to observe how this aspect of large real networks is pres

erved even when a more complete picture(allowing overlaps)is investigated.In Fig.4a the power law dependence P(s com)∼(s com)−τwith an exponent ranging betweenτ=1and1.6is well pro-nounced and is valid nearly over the entire range of community sizes.

It is well known[2,3,4]that the nodes of large real networks have a power law degree distri-bution.Will the same kind of distribution hold when we move to the next level of organisation and consider the degrees of the communities?Remarkably,weﬁnd that it is not the case.The community degrees(Fig.4b)have a very unique distribution,consisting of two distinct parts:an exponential decay

P(d com)∼exp(−d com/d com

0)with a characteristic community degree d com

(which is in the order of

d com shown in Table1),followed by a power law tail∼(d com)−τ.This new kind of behaviour is

consistent with the community size distribution assuming that on average each node of a community has a contributionδto the community degree.The tail of the community degree distribution is,therefore, simply proportional to that of the community size distribution.At theﬁrst part of P(d com),on the other h

and,a characteristic scale d com

∼kδappears,because the majority of the communities have a size of

the order of k(see Fig.4a)and their distribution around d com

dominates this part of the curve.Thus, the degree to which P(d com)deviates from a simple scaling depends on k or,in other words,on the

Figure3:Network of the82communities in the DIP core list of the protein-protein interactions of S. cerevisiae for k=4.The area of the circles and the width of the links are proportional to the size of

)and to the size of the overlaps(s ovα,β),respectively.The coloured the corresponding communities(s com

communities are cut out and magniﬁed to reveal their internal structure.In this magniﬁed picture the nodes and links of the original network have the same colour as their communities,those that are shared by more than one community are emphasised in red,and the grey links are not part of these communities. The area of the circles and the width of the links are proportional to the total number of communities they belong to.

prescribed minimum cohesiveness of the communities.

The extent to which different communities overlap is also a relevant property of a network.Although the range of overlap sizes is limited,the behaviour of the cumulative overlap size distribution P(s ov), shown in Fig.4c,is close to a power law for each network,with a rather large exponent.We can conclud

e that there is no characteristic overlap size in the networks.Finally,in Fig.4d we display the cumulative distribution of the membership number,P(m).These plots demonstrate that a node may belong to a number of communities.In the collaboration and the word association network there seems to be no characteristic value for the membership number:the data are close to a power law dependence,with

a large exponent.In the protein interaction network,however,the largest membership number is only

688IT编程网

Uncovering the overlapping community structure

发表评论

推荐文章

应用程序的安全检测方法、装置、电子设备和存储介质

nginx map用法正则

VBA之正则表达式(1)--基础篇

Prometheus监控学习笔记之初识PromQL

关于PHP中的webshell

热门文章

m函数数字提取

jest断言方法大全

中兴ZXSEC US 管理员手册

keras系列(一):参数设置

Qt从QString中提取出数字

element input 金额千分位格式化

freemaker 参数解析正则

C#正则验证数字

form表单验证正则

scanf正则表达式用法

grafana value的正则表达式

Android平台浮点数运算应用

js-(JS正则表达式验证数字)

判断Python输入是否是整数,字符,或浮点数

c语言 sscanf 正则规则

从文本中提取数值技巧

js将整数转换成两位浮点数的方法

vue正则限制浮点数

8到20的结尾的正则

shell 正则表达式最后一行

最新文章

应用程序的安全检测方法、装置、电子设备和存储介质

VBA之正则表达式(1)--基础篇

代码编辑的辅助方法、装置及电子设备

SHELL查字符串中包含字符的命令

String方法中replace和replaceAll的区别详解(源码分析)

双字节符号正则

标签列表

688IT编程网

Uncovering the overlapping community structure

发表评论

推荐文章

应用程序的安全检测方法、装置、电子设备和存储介质

nginx map用法 正则

VBA之正则表达式(1)--基础篇

Prometheus监控学习笔记之初识PromQL

关于PHP中的webshell

热门文章

m函数数字提取

jest断言方法大全

中兴ZXSEC US 管理员手册

keras系列(一):参数设置

Qt从QString中提取出数字

element input 金额千分位格式化

freemaker 参数解析正则

C#正则验证数字

form表单验证正则

scanf正则表达式用法

grafana value的正则表达式

Android平台浮点数运算应用

js-(JS正则表达式验证数字)

判断Python输入是否是整数,字符,或浮点数

c语言 sscanf 正则规则

从文本中提取数值技巧

js将整数转换成两位浮点数的方法

vue正则限制浮点数

8到20的结尾的正则

shell 正则表达式 最后一行

最新文章

应用程序的安全检测方法、装置、电子设备和存储介质

VBA之正则表达式(1)--基础篇

代码编辑的辅助方法、装置及电子设备

SHELL查字符串中包含字符的命令

String方法中replace和replaceAll的区别详解(源码分析)

双字节符号正则

标签列表

nginx map用法正则

shell 正则表达式最后一行