Workshop Report: Data Analysis Working Group
Consortium for the Barcode of Life
Muséum National d’Histoire Naturelle, Paris, 6-8 July 2006
Michel Veuille1, Javier Cabrera2 and David E. Schindel3
Introduction
The Data Analysis Working Group (DAWG) of the Consortium for the Barcode of Life (CBOL) held a 2 ½ day workshop hosted by the Museum Nationale d’Histoire Naturelle (MNHN) in Paris on 6-8 July 2006 (see Appendix 1: Call for Participation). Thirty-eight participants from 10 countries attended (see Appendix 2: List of Participants), the majority of whom were doctoral students, postdoctoral fellows or young researchers.
The overall goal of DAWG is to develop protocols, techniques and software that the barcoding community can use to sample, analyze, interpret and display barcode data. The purpose of the Paris workshop was to allow presenters to describe their preliminary results and plans for the coming year, and to receive feedback from the other workshop participants. They will continue their work with the go
al of presenting finished results at an international conference in June 2007. The final results of their work will be published in a proceedings volume of the June 2007 conference, and their protocols and software will be made available on a Data Portal being developed by CBOL for the Barcode of Life Initiative (BOLI). Acknowledgments
Funding for European participants was provided by a grant from the Conservation Genetics Programme of the European Science Foundation. Support for American participants was provided by the Division of Biological Infrastructure, the Office of International Science and Engineering, and the Division of Information and Intelligent Systems of the National Science Foundation. Financial and in-kind support was also provided by the Muséum National
d’Histoire Naturelle, Paris, Ecole pratique des hautes etudes (EPHE), the Center for Discrete Mathematics and Theoretical Computer Sciences (DIMACS), Rutgers University, the Alfred P. Sloan Foundation, and CBOL.
Background
CBOL’s Executive Committee created DAWG in late 2004 for the purpose of developing protocols, methods, and software for the analysis, interpretation and display of barcode data. Michel Veuille wa
s asked to chair DAWG and DIMACS was invited to be a principal partner in the Working Group. DAWG met for the first time at the First International Barcode Conference at the Natural History Museum, London, on 9 February 2005. Planning meetings were held at DIMACS on 26 September 2005 and at MNHN on 15 October 2005. A Steering Committee4 was formed at this second planning meeting.
1Chairman, Département de Systématique et Evolution, Muséum National d'Histoire Naturelle, Paris
2 Department of Statistics and DIMACS, Rutgers University, Piscataway, NJ
3 Executive Secretary, Consortium for the Barcode of Life, National Museum of Natural History, Smithsonian Institution, Washington, DC
4 The DAWG Steering Committee includes M. Veuille, Chair (MNHN); Javier Cabrera (DIMACS); Rob DeSalle (American Museum of Natural History); Brian Golding (McMaster University); D. Hickey (Concordia University); and D. Schindel (CBOL);
Based on discussions during the two planning meetings and interactions with CBOL, the Steering Committee formulated a Program of Work5 whose goals are to catalyze development of the new tec
hniques and tools that will be needed to analyze, interpret and display barcode data. The Program of work included a workshop at which preliminary results would be presented and discussed, and presentation of final results at the Second International Barcode Conference in 2007. The Committee issued a Call for Participation (Appendix 1) to statisticians, computer scientists, taxonomists and others interested in data analysis. The Call included a set of Technical Challenges that were developed by participants in the two planning meetings. The Steering Committee selected 15 abstracts for presentation at the workshop (see Appendix 3, agenda; Appendix 4, abstracts of presentations).
Workshop Structure and Content6
The workshop began with three introductory presentations: M. Veuille greeted participants; D. Schindel described the workshop’s goals; and V. Loeschke and K. Bijlsma described the European Science Foundation’s Conservation Genetics Programme. The balance of the workshop was devoted to presentations of preliminary results by 15 participants who had submitted abstracts. Each presentation lasted for 30 minutes, after which all participants engaged in open discussion. Table 1 indicates the techniques and datasets used in each study. Appendix 4 contains the abstracts submitted by the presenters and Appendix 5 presents brief summaries of each presentation.
The presentations included five categories of techniques, and many presenters used techniques from several categories and compared their effectiveness (see Table 1). The five categories are:
1.Character-based classifications. A number of techniques and of computer programs are
available for classifying objects, in a way that is not limited to biological species. They generally rely on ways to partition sets into subsets based on shared properties
(Classification and Regression Trees, CART, is one such approach presented at the
workshop). In systematics, so-called "informative characters", as used in cladistics,
belong to this category. Since the barcode is not concerned with phylogeny, a simplified form of this approach is used by Character Attribute Organization System (CAOS, also presented at the workshop). However, homoplasy and the segregation of ancestral
polymorphism limit the use of this approach in closely related species, which is the level of differentiation that matter the most in barcoding.
Phylogenetic analysis also uses gene sequence data analyzed as a series of discrete
attributes. CBOL has stressed that barcode data, by themselves, are inadequate bases on which to reconstruct phylogenetic relationships. However, phylogenetic methods can be used to determine affinities among specimens and between specimens and known
taxonomic categories (at the species level and higher in the taxonomic hierarchy). These methods use a variety of parsimony algorithms to build trees.
2.Distance-based clustering methods. When there is no simple discriminating character
between species, distance based clustering methods can be used. The most popular
method in the barcode community appears to be neighbor-joining (NJ), an algorithm
starting from the most closely related clusters of sequences, and then proceeding stepwise 5 DAWG Program of Work is posted at barcoding.si.edu/PDF/Program%20of%20Work%20-
%20DAWG%20-%20FINAL.pdf
6 See meeting agenda, Appendix 4. Presentations linked to the agenda are available at
www.barcoding.si.edu/DAWG_Paris_Workshop.html
to the rest of the sample. It is generally calculated using the K2P distance (Kimura 2-parameter model), the simplest way to deal with nucleotide change when there are very different mutation rates in transitions and transversions, as is the case in mtDNA. The accuracy of these methods matters only for recent nodes, since barcoding is mostly
interested in identifying species. This method of "clustering" sequences does not provide
a tree of species, but a tree of genes.
3.Coalescent theory. Coalescent theory provides a tool for studying the ancestry of a
sample of sequences by looking backwards in time. Contrary to phylogenetic methods, which are based on parsimony principles or on assumptions of the constancy of
evolutionary rates (the “molecular clock”), the coalescent theory is based on our present understanding of the actual mechanism of evolutionary change within species. Models based on the Coalescent theory include parameters that represent forces such as random drift and natural selection. Coalescent theory lends itself easily to computer simulations, allowing one to run a series of simulations (classically between 1,000 and 10,000) to assess the probability of an assumption lea
ding to the observed state of the dataset. Its applications are not limited to the classical mutation-drift equilibrium neutral model. It is thus possible to explore the parameter space along individual axes (e.g., panmixia vs.
population structuring, changes vs. constancy in population size). When there is no
diagnostic character that separates species, it may be counterintuitive to obtain a result in the form of a probability of an accession belonging to some species. However, such
outputs may be useful in further research. For instance, they may also allow one to
estimate the optimum sample size, based on prior information and assuming some
population model. Applications of coalescent theory may thus be intervening steps in a research protocol.
4.Bayesian statistics and maximum likelihood. These are statistical methods that can be
used in a wide range of statistical applications, including in applications referred to
above, such as coalescent theory. They are very powerful, but their use assumes some preliminary knowledge on the model being applied (Maximum Likelihood), or of the distribution of one of the parameters given some knowledge on another one (Bayesian).
The main difficulty with these methods is their high computation time. A minor problem is that it is generally difficult to say what character is the cause of the distinction between two species, which is always counterintuitive. ABC methods (referred to in the meeting) are much less demanding in computer time.
5.Miscellaneous points. As the barcode dataset grows larger, it may be difficult to identify
the reference sequences closest to a query sequence. This question was addressed at the meeting by the proposal to use the Google search engine, and by another aiming to
identify the sister-clade of some query at the appropriate taxonomic level. Two groups (working with CART and the coalescent respectively) have identified an error in the Astraptes dataset.
Meeting Results
In addition to providing the presenters with feedback on their preliminary results, the workshop partici
pants agreed on the need to:
•Develop standard methods for comparing results of competing techniques (e.g., common sample sizes, effective population sizes, mutation rates, other population genetics
parameters). Javier Cabrera agreed to develop a draft standard for comment by the
workshop participants.
•Provide additional online datasets with different characteristics and smaller minimum sample sizes.
•Develop consensus recommendations to the barcoding community concerning: -Adequate sample sizes. Many presenters had recommendations on sample sizes and DAWG will need a mechanism to compile them, promote comparison, and facilitate
discussion leading to a consensus.
-Standard treatment and presentation of cluster diagrams. Many presenters showed cluster diagrams with a variety of filters on branch nodes based on bootstrapping.
DAWG could provide a valuable service by developing recommendations to the
barcoding community on standard presentations.
-Standard vocabulary and usage of statistical terms in discussions of barcode data (e.g., accuracy, precision, error rates, false positives/negatives).
•Identify and engage specialists in data visualizations and display. Several participants mentioned software programs that might be applicable to barcode data, and visualization specialists who might be interested.
•Determine the best way to disseminate the results of the DAWG initiative. In addition to posting software and protocols on the BOLI Data Portal being developed by CBOL, there will be a proceedings volume based on the Second International Barcode Conference.
Participants discussed whether it would be best to publish data analysis papers in the proceedings volume or in another journal, such as Systematic Biology. The Steering Committee needs to facilitate this discussion and promote a consensus.
Next Steps
The DAWG Steering Committee agreed to use the NBII Portal as a platform for sharing information a
nd conducting electronic discussions of the issues described above. CBOL will probably call for submission of proposals for sessions at the Second International Barcode Conference around October 2006. The Committee will apply for a half-day session on data analysis. A Call for submission of abstracts will probably be published in December 2006 or January 2007.
Table 1. Classification of presentations according to techniques Simulation/
Coalescent Model Clustering
Character-
based
Search
algorithms
Bayesian
Analysis
Phylogenetic Datasets
Bautista X European fish
Hajibabaei X X X illustrations with primates and Lepidoptera; no
analysis
Hickerson X X simulations and marine snails Munch X X plants
Bazin broad data compilation Pasaniuc
X
X DIMACS test data, cowries Austerlitz X
X
X X
simulated data, Litoria, cowries, Atraptes Sarkar
X
X Mopalia
Barraclough X Australian tiger beetles, rotifers, land plants Abdo X X Astraptes, simulated data
Rach
X
X dragonflies, ND2 and COI
Little X X X cycads, nuclear, plastid, mitochondrial regions;
DAWG training set Gemeinholzer X Asteraceae, ITS region
Hickey X fungi, various gene regions Cabrera (for Ching
Ray Yu)
X DAWG training set
Cabre ra-Lo
Ver. 31 July 2006 Page 5
APPENDIX 1 : Call for Participation
reference groupData Analysis Challenges Arising from the DNA Barcode Initiative
The Challenge: The Data Analysis Working Group (DAWG) of the Consortium for the Barcode of Life
(CBOL) has developed interdisciplinary research challenge problems in statistics and computer science arising from DNA barcoding, a method proposed as a tool for differentiating species. Students, postdocs, and researchers from all over the world are challenged to develop new approaches to these problems. Compelling solutions to these challenges will require collaboration among taxonomists, population geneticists, and evolutionary and systematic biologists, so DAWG encourages the formation of multidisciplinary teams.
Presenting Preliminary Ideas at a Workshop in Paris: Preliminary ideas for approaches to these problems will be discussed at a workshop at the National Museum of Natural History in Paris on 6-8 July 2006 (see dimacs.rutgers.edu/Workshops/DNABarcode/). Participation in this workshop will be limited to approximately 40 presenters of preliminary results and attendees who can offer useful feedback to the presenters. Space will therefore be limited and all those wishing to participate in the workshop should register at dimacs.rutgers.edu/Workshops/DNABarcode/registnew.html no later than 29 June 2006. However, you are urged to register early as we will close registration when all spaces are filled.
Travel awards for a limited number of Europeans who would like to give presentations at this workshop will be available through funding from the Conservation Genetics Programme of the Europ
ean Science Foundation. Travel awards for US presenters will also be available, pending funding agency approval. Travel support will focus primarily on increasing the participation of students, postdocs and junior faculty. Presenting More Advanced Results at a Conference in Southeast Asia: The preliminary workshop will be followed by an international conference in southeast Asia in February 2007, during which the most promising approaches to these challenge problems will be presented. Travel awards will also be available (pending funding agency approval).
For the full Call for Participation, including the statement of the research challenges, see:
dimacs.rutgers.edu/Workshops/BarcodeResearchChallenges/.
For instructions on how to submit an abstract for the Paris workshop, see
dimacs.rutgers.edu/Workshops/DNABarcode/abstractsubmissionform.html.
To apply for travel funds to give a presentation at the Paris workshop, see
dimacs.rutgers.edu/Workshops/DNABarcode/travelsupport.html
To register for the workshop, see dimacs.rutgers.edu/Workshops/DNABarcode/registnew.html
For information about the DNA Barcode Initiative, see: dimacs.rutgers.edu/Workshops/DNAInitiative/.
Important dates:
Deadline for submission of abstracts: 2 June 2006
Deadline for submission of requests for travel support: 2 June 2006
Deadline for registration: 29 June 2006
Announcement of final agenda of presenters, awards of travel support:
as early as possible after 2 June 2006
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系QQ:729038198,我们将在24小时内删除。
发表评论