...Workshop Report Data Analysis Working Group Consortium--688IT编程网

Workshop Report: Data Analysis Working Group

Consortium for the Barcode of Life

Muséum National d’Histoire Naturelle, Paris, 6-8 July 2006

Michel Veuille1, Javier Cabrera2 and David E. Schindel3

Introduction

The Data Analysis Working Group (DAWG) of the Consortium for the Barcode of Life (CBOL) held a 2 ½ day workshop hosted by the Museum Nationale d’Histoire Naturelle (MNHN) in Paris on 6-8 July 2006 (see Appendix 1: Call for Participation). Thirty-eight participants from 10 countries attended (see Appendix 2: List of Participants), the majority of whom were doctoral students, postdoctoral fellows or young researchers.

The overall goal of DAWG is to develop protocols, techniques and software that the barcoding community can use to sample, analyze, interpret and display barcode data. The purpose of the Paris workshop was to allow presenters to describe their preliminary results and plans for the coming year, and to receive feedback from the other workshop participants. They will continue their work with the go

al of presenting finished results at an international conference in June 2007. The final results of their work will be published in a proceedings volume of the June 2007 conference, and their protocols and software will be made available on a Data Portal being developed by CBOL for the Barcode of Life Initiative (BOLI). Acknowledgments

Funding for European participants was provided by a grant from the Conservation Genetics Programme of the European Science Foundation. Support for American participants was provided by the Division of Biological Infrastructure, the Office of International Science and Engineering, and the Division of Information and Intelligent Systems of the National Science Foundation. Financial and in-kind support was also provided by the Muséum National

d’Histoire Naturelle, Paris, Ecole pratique des hautes etudes (EPHE), the Center for Discrete Mathematics and Theoretical Computer Sciences (DIMACS), Rutgers University, the Alfred P. Sloan Foundation, and CBOL.

Background

CBOL’s Executive Committee created DAWG in late 2004 for the purpose of developing protocols, methods, and software for the analysis, interpretation and display of barcode data. Michel Veuille wa

s asked to chair DAWG and DIMACS was invited to be a principal partner in the Working Group. DAWG met for the first time at the First International Barcode Conference at the Natural History Museum, London, on 9 February 2005. Planning meetings were held at DIMACS on 26 September 2005 and at MNHN on 15 October 2005. A Steering Committee4 was formed at this second planning meeting.

1Chairman, Département de Systématique et Evolution, Muséum National d'Histoire Naturelle, Paris

2 Department of Statistics and DIMACS, Rutgers University, Piscataway, NJ

3 Executive Secretary, Consortium for the Barcode of Life, National Museum of Natural History, Smithsonian Institution, Washington, DC

4 The DAWG Steering Committee includes M. Veuille, Chair (MNHN); Javier Cabrera (DIMACS); Rob DeSalle (American Museum of Natural History); Brian Golding (McMaster University); D. Hickey (Concordia University); and D. Schindel (CBOL);

Based on discussions during the two planning meetings and interactions with CBOL, the Steering Committee formulated a Program of Work5 whose goals are to catalyze development of the new tec

hniques and tools that will be needed to analyze, interpret and display barcode data. The Program of work included a workshop at which preliminary results would be presented and discussed, and presentation of final results at the Second International Barcode Conference in 2007. The Committee issued a Call for Participation (Appendix 1) to statisticians, computer scientists, taxonomists and others interested in data analysis. The Call included a set of Technical Challenges that were developed by participants in the two planning meetings. The Steering Committee selected 15 abstracts for presentation at the workshop (see Appendix 3, agenda; Appendix 4, abstracts of presentations).

Workshop Structure and Content6

The workshop began with three introductory presentations: M. Veuille greeted participants; D. Schindel described the workshop’s goals; and V. Loeschke and K. Bijlsma described the European Science Foundation’s Conservation Genetics Programme. The balance of the workshop was devoted to presentations of preliminary results by 15 participants who had submitted abstracts. Each presentation lasted for 30 minutes, after which all participants engaged in open discussion. Table 1 indicates the techniques and datasets used in each study. Appendix 4 contains the abstracts submitted by the presenters and Appendix 5 presents brief summaries of each presentation.

The presentations included five categories of techniques, and many presenters used techniques from several categories and compared their effectiveness (see Table 1). The five categories are:

1.Character-based classifications. A number of techniques and of computer programs are

available for classifying objects, in a way that is not limited to biological species. They generally rely on ways to partition sets into subsets based on shared properties

(Classification and Regression Trees, CART, is one such approach presented at the

workshop). In systematics, so-called "informative characters", as used in cladistics,

belong to this category. Since the barcode is not concerned with phylogeny, a simplified form of this approach is used by Character Attribute Organization System (CAOS, also presented at the workshop). However, homoplasy and the segregation of ancestral

polymorphism limit the use of this approach in closely related species, which is the level of differentiation that matter the most in barcoding.

Phylogenetic analysis also uses gene sequence data analyzed as a series of discrete

attributes. CBOL has stressed that barcode data, by themselves, are inadequate bases on which to reconstruct phylogenetic relationships. However, phylogenetic methods can be used to determine affinities among specimens and between specimens and known

taxonomic categories (at the species level and higher in the taxonomic hierarchy). These methods use a variety of parsimony algorithms to build trees.

2.Distance-based clustering methods. When there is no simple discriminating character

between species, distance based clustering methods can be used. The most popular

method in the barcode community appears to be neighbor-joining (NJ), an algorithm

starting from the most closely related clusters of sequences, and then proceeding stepwise 5 DAWG Program of Work is posted at barcoding.si.edu/PDF/Program%20of%20Work%20-

%20DAWG%20-%20FINAL.pdf

6 See meeting agenda, Appendix 4. Presentations linked to the agenda are available at

www.barcoding.si.edu/DAWG_Paris_Workshop.html

to the rest of the sample. It is generally calculated using the K2P distance (Kimura 2-parameter model), the simplest way to deal with nucleotide change when there are very different mutation rates in transitions and transversions, as is the case in mtDNA. The accuracy of these methods matters only for recent nodes, since barcoding is mostly

interested in identifying species. This method of "clustering" sequences does not provide

a tree of species, but a tree of genes.

3.Coalescent theory. Coalescent theory provides a tool for studying the ancestry of a

sample of sequences by looking backwards in time. Contrary to phylogenetic methods, which are based on parsimony principles or on assumptions of the constancy of

evolutionary rates (the “molecular clock”), the coalescent theory is based on our present understanding of the actual mechanism of evolutionary change within species. Models based on the Coalescent theory include parameters that represent forces such as random drift and natural selection. Coalescent theory lends itself easily to computer simulations, allowing one to run a series of simulations (classically between 1,000 and 10,000) to assess the probability of an assumption lea

ding to the observed state of the dataset. Its applications are not limited to the classical mutation-drift equilibrium neutral model. It is thus possible to explore the parameter space along individual axes (e.g., panmixia vs.

population structuring, changes vs. constancy in population size). When there is no

diagnostic character that separates species, it may be counterintuitive to obtain a result in the form of a probability of an accession belonging to some species. However, such

outputs may be useful in further research. For instance, they may also allow one to

estimate the optimum sample size, based on prior information and assuming some

population model. Applications of coalescent theory may thus be intervening steps in a research protocol.

4.Bayesian statistics and maximum likelihood. These are statistical methods that can be

used in a wide range of statistical applications, including in applications referred to

above, such as coalescent theory. They are very powerful, but their use assumes some preliminary knowledge on the model being applied (Maximum Likelihood), or of the distribution of one of the parameters given some knowledge on another one (Bayesian).

The main difficulty with these methods is their high computation time. A minor problem is that it is generally difficult to say what character is the cause of the distinction between two species, which is always counterintuitive. ABC methods (referred to in the meeting) are much less demanding in computer time.

5.Miscellaneous points. As the barcode dataset grows larger, it may be difficult to identify

the reference sequences closest to a query sequence. This question was addressed at the meeting by the proposal to use the Google search engine, and by another aiming to

identify the sister-clade of some query at the appropriate taxonomic level. Two groups (working with CART and the coalescent respectively) have identified an error in the Astraptes dataset.

Meeting Results

In addition to providing the presenters with feedback on their preliminary results, the workshop partici

pants agreed on the need to:

•Develop standard methods for comparing results of competing techniques (e.g., common sample sizes, effective population sizes, mutation rates, other population genetics

parameters). Javier Cabrera agreed to develop a draft standard for comment by the

workshop participants.

•Provide additional online datasets with different characteristics and smaller minimum sample sizes.

•Develop consensus recommendations to the barcoding community concerning: -Adequate sample sizes. Many presenters had recommendations on sample sizes and DAWG will need a mechanism to compile them, promote comparison, and facilitate

discussion leading to a consensus.

-Standard treatment and presentation of cluster diagrams. Many presenters showed cluster diagrams with a variety of filters on branch nodes based on bootstrapping.

DAWG could provide a valuable service by developing recommendations to the

barcoding community on standard presentations.

-Standard vocabulary and usage of statistical terms in discussions of barcode data (e.g., accuracy, precision, error rates, false positives/negatives).

•Identify and engage specialists in data visualizations and display. Several participants mentioned software programs that might be applicable to barcode data, and visualization specialists who might be interested.

•Determine the best way to disseminate the results of the DAWG initiative. In addition to posting software and protocols on the BOLI Data Portal being developed by CBOL, there will be a proceedings volume based on the Second International Barcode Conference.

Participants discussed whether it would be best to publish data analysis papers in the proceedings volume or in another journal, such as Systematic Biology. The Steering Committee needs to facilitate this discussion and promote a consensus.

Next Steps

The DAWG Steering Committee agreed to use the NBII Portal as a platform for sharing information a

nd conducting electronic discussions of the issues described above. CBOL will probably call for submission of proposals for sessions at the Second International Barcode Conference around October 2006. The Committee will apply for a half-day session on data analysis. A Call for submission of abstracts will probably be published in December 2006 or January 2007.

Table 1. Classification of presentations according to techniques Simulation/

Coalescent Model Clustering

Character-

based

algorithms

Bayesian

Analysis

Phylogenetic Datasets

Bautista X European fish

Hajibabaei X X X illustrations with primates and Lepidoptera; no

analysis

Hickerson X X simulations and marine snails Munch X X plants

Bazin broad data compilation Pasaniuc

X DIMACS test data, cowries Austerlitz X

X X

simulated data, Litoria, cowries, Atraptes Sarkar

X Mopalia

Barraclough X Australian tiger beetles, rotifers, land plants Abdo X X Astraptes, simulated data

Rach

X dragonflies, ND2 and COI

Little X X X cycads, nuclear, plastid, mitochondrial regions;

DAWG training set Gemeinholzer X Asteraceae, ITS region

Hickey X fungi, various gene regions Cabrera (for Ching

Ray Yu)

X DAWG training set

Cabre ra-Lo

Ver. 31 July 2006 Page 5

APPENDIX 1 : Call for Participation

reference groupData Analysis Challenges Arising from the DNA Barcode Initiative

The Challenge: The Data Analysis Working Group (DAWG) of the Consortium for the Barcode of Life

(CBOL) has developed interdisciplinary research challenge problems in statistics and computer science arising from DNA barcoding, a method proposed as a tool for differentiating species. Students, postdocs, and researchers from all over the world are challenged to develop new approaches to these problems. Compelling solutions to these challenges will require collaboration among taxonomists, population geneticists, and evolutionary and systematic biologists, so DAWG encourages the formation of multidisciplinary teams.

Presenting Preliminary Ideas at a Workshop in Paris: Preliminary ideas for approaches to these problems will be discussed at a workshop at the National Museum of Natural History in Paris on 6-8 July 2006 (see dimacs.rutgers.edu/Workshops/DNABarcode/). Participation in this workshop will be limited to approximately 40 presenters of preliminary results and attendees who can offer useful feedback to the presenters. Space will therefore be limited and all those wishing to participate in the workshop should register at dimacs.rutgers.edu/Workshops/DNABarcode/registnew.html no later than 29 June 2006. However, you are urged to register early as we will close registration when all spaces are filled.

Travel awards for a limited number of Europeans who would like to give presentations at this workshop will be available through funding from the Conservation Genetics Programme of the Europ

ean Science Foundation. Travel awards for US presenters will also be available, pending funding agency approval. Travel support will focus primarily on increasing the participation of students, postdocs and junior faculty. Presenting More Advanced Results at a Conference in Southeast Asia: The preliminary workshop will be followed by an international conference in southeast Asia in February 2007, during which the most promising approaches to these challenge problems will be presented. Travel awards will also be available (pending funding agency approval).

For the full Call for Participation, including the statement of the research challenges, see:

dimacs.rutgers.edu/Workshops/BarcodeResearchChallenges/.

For instructions on how to submit an abstract for the Paris workshop, see

dimacs.rutgers.edu/Workshops/DNABarcode/abstractsubmissionform.html.

To apply for travel funds to give a presentation at the Paris workshop, see

dimacs.rutgers.edu/Workshops/DNABarcode/travelsupport.html

To register for the workshop, see dimacs.rutgers.edu/Workshops/DNABarcode/registnew.html

For information about the DNA Barcode Initiative, see: dimacs.rutgers.edu/Workshops/DNAInitiative/.

Important dates:

Deadline for submission of abstracts: 2 June 2006

Deadline for submission of requests for travel support: 2 June 2006

Deadline for registration: 29 June 2006

Announcement of final agenda of presenters, awards of travel support:

as early as possible after 2 June 2006

688IT编程网

...Workshop Report Data Analysis Working Group Consortium

发表评论

推荐文章

java正则表达式选择题

一种基于正则表达式的DBC文件解析及报文分析方法[发明专利]

工龄小数点提取

非零金额正则表达式

提取文本中数字的函数

热门文章

利用正则表达式实现文本数据提取与处理

正则表达式零宽断言详解

文本匹配规则

excel中使用正则

1-31正则表达式

anki之高级筛选

BUAA_OO_2021_第一单元总结

insert语句递增写法

sublime text 3在行前插入递增数字序号的方法

字符串只允许数字和英文的正则

powerbuilder 正则表达式

Shell脚本编写的高级技巧利用正则表达式进行字符串匹配

JAVA正则表达式的三种模式:贪婪,勉强和占有的讨论

go regexp匹配规则

oracle regexp_substr 实现原理

基本的元字符回溯引用和前后查匹配模式

elasticsearch query dsl正则

oracle sql正则表达式

GA-设置目标

仅匹配全角片假名的正则表达式

最新文章

java正则表达式选择题

工龄小数点提取

非零金额正则表达式

提取文本中数字的函数

vue数字相加小数点变长-概述说明以及解释

vue validate 正则验证小数长度

标签列表

688IT编程网

...Workshop Report Data Analysis Working Group Consortium

发表评论

推荐文章

java正则表达式 选择题

一种基于正则表达式的DBC文件解析及报文分析方法[发明专利]

工龄小数点提取

非零金额 正则表达式

提取文本中数字的函数

热门文章

利用正则表达式实现文本数据提取与处理

正则表达式零宽断言详解

文本匹配规则

excel中使用正则

1-31正则表达式

anki之高级筛选

BUAA_OO_2021_第一单元总结

insert语句递增写法

sublime text 3在行前插入递增数字序号的方法

字符串只允许数字和英文的正则

powerbuilder 正则表达式

Shell脚本编写的高级技巧利用正则表达式进行字符串匹配

JAVA正则表达式的三种模式:贪婪,勉强和占有的讨论

go regexp匹配规则

oracle regexp_substr 实现原理

基本的元字符 回溯引用和前后查 匹配模式

elasticsearch query dsl正则

oracle sql正则表达式

GA-设置目标

仅匹配全角片假名的正则表达式

最新文章

java正则表达式 选择题

工龄小数点提取

非零金额 正则表达式

提取文本中数字的函数

vue数字相加小数点变长-概述说明以及解释

vue validate 正则验证小数长度

标签列表

java正则表达式选择题

非零金额正则表达式

基本的元字符回溯引用和前后查匹配模式

java正则表达式选择题

非零金额正则表达式