Michael Waterman

Levenshtein Distance, Sequence Comparison and Biological Database Search

Bonnie Berger — Wed, 14 Jul 2021 06:00:00 -0400

IEEE Trans Inf Theory. 2021 Jun;67(6):3287-3294. doi: 10.1109/tit.2020.2996543. Epub 2020 May 21.

ABSTRACT

Levenshtein edit distance has played a central role-both past and present-in sequence alignment in particular and biological database similarity search in general. We start our review with a history of dynamic programming algorithms for computing Levenshtein distance and sequence alignments. Following, we describe how those algorithms led to heuristics employed in the most widely used software in bioinformatics, BLAST, a program to search DNA and protein databases for evolutionarily relevant similarities. More recently, the advent of modern genomic sequencing and the volume of data it generates has resulted in a return to the problem of local alignment. We conclude with how the mathematical formulation of Levenshtein distance as a metric made possible additional optimizations to similarity search in biological contexts. These modern optimizations are built around the low metric entropy and fractional dimensionality of biological databases, enabling orders of magnitude acceleration of biological similarity search.

PMID:34257466 | PMC:PMC8274556 | DOI:10.1109/tit.2020.2996543

Multiscale Feedback Loops in SARS-CoV-2 Viral Evolution

Christopher Barrett — Fri, 04 Dec 2020 06:00:00 -0500

J Comput Biol. 2021 Mar;28(3):248-256. doi: 10.1089/cmb.2020.0343. Epub 2020 Dec 1.

ABSTRACT

COVID-19 is an infectious disease caused by the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). The viral genome is considered to be relatively stable and the mutations that have been observed and reported thus far are mainly focused on the coding region. This article provides evidence that macrolevel pandemic dynamics, such as social distancing, modulate the genomic evolution of SARS-CoV-2. This view complements the prevalent paradigm that microlevel observables control macrolevel parameters such as death rates and infection patterns. First, we observe differences in mutational signals for geospatially separated populations such as the prevalence of A23404G in CA versus NY and WA. We show that the feedback between macrolevel dynamics and the viral population can be captured employing a transfer entropy framework. Second, we observe complex interactions within mutational clades. Namely, when C14408T first appeared in the viral population, the frequency of A23404G spiked in the subsequent week. Third, we identify a noncoding mutation, G29540A, within the segment between the coding gene of the N protein and the ORF10 gene, which is largely confined to NY (>95%). These observations indicate that macrolevel sociobehavioral measures have an impact on the viral genomics and may be useful for the dashboard-like tracking of its evolution. Finally, despite the fact that SARS-CoV-2 is a genetically robust organism, our findings suggest that we are dealing with a high degree of adaptability. Owing to its ample spread, mutations of unusual form are observed and a high complexity of mutational interaction is exhibited.

PMID:33275493 | DOI:10.1089/cmb.2020.0343

Benchmarking of alignment-free sequence comparison methods

Andrzej Zielezinski — Sat, 27 Jul 2019 06:00:00 -0400

Genome Biol. 2019 Jul 25;20(1):144. doi: 10.1186/s13059-019-1755-7.

ABSTRACT

BACKGROUND: Alignment-free (AF) sequence comparison is attracting persistent interest driven by data-intensive applications. Hence, many AF procedures have been proposed in recent years, but a lack of a clearly defined benchmarking consensus hampers their performance assessment.

RESULTS: Here, we present a community resource (http://afproject.org) to establish standards for comparing alignment-free approaches across different areas of sequence-based research. We characterize 74 AF methods available in 24 software tools for five research applications, namely, protein sequence classification, gene tree inference, regulatory element detection, genome-based phylogenetic inference, and reconstruction of species trees under horizontal gene transfer and recombination events.

CONCLUSION: The interactive web service allows researchers to explore the performance of alignment-free tools relevant to their data types and analytical goals. It also allows method developers to assess their own algorithms and compare them with current state-of-the-art tools, accelerating the development of new, more accurate AF solutions.

PMID:31345254 | PMC:PMC6659240 | DOI:10.1186/s13059-019-1755-7

A new statistic for efficient detection of repetitive sequences

Sijie Chen — Thu, 18 Apr 2019 06:00:00 -0400

Bioinformatics. 2019 Nov 1;35(22):4596-4606. doi: 10.1093/bioinformatics/btz262.

ABSTRACT

MOTIVATION: Detecting sequences containing repetitive regions is a basic bioinformatics task with many applications. Several methods have been developed for various types of repeat detection tasks. An efficient generic method for detecting most types of repetitive sequences is still desirable. Inspired by the excellent properties and successful applications of the D2 family of statistics in comparative analyses of genomic sequences, we developed a new statistic D2R that can efficiently discriminate sequences with or without repetitive regions.

RESULTS: Using the statistic, we developed an algorithm of linear time and space complexity for detecting most types of repetitive sequences in multiple scenarios, including finding candidate clustered regularly interspaced short palindromic repeats regions from bacterial genomic or metagenomics sequences. Simulation and real data experiments show that the method works well on both assembled sequences and unassembled short reads.

AVAILABILITY AND IMPLEMENTATION: The codes are available at https://github.com/XuegongLab/D2R_codes under GPL 3.0 license.

SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

PMID:30993316 | PMC:PMC7963086 | DOI:10.1093/bioinformatics/btz262

Generalized correlation measure using count statistics for gene expression data with ordered samples

Y X Rachel Wang — Wed, 18 Oct 2017 06:00:00 -0400

Bioinformatics. 2018 Feb 15;34(4):617-624. doi: 10.1093/bioinformatics/btx641.

ABSTRACT

MOTIVATION: Capturing association patterns in gene expression levels under different conditions or time points is important for inferring gene regulatory interactions. In practice, temporal changes in gene expression may result in complex association patterns that require more sophisticated detection methods than simple correlation measures. For instance, the effect of regulation may lead to time-lagged associations and interactions local to a subset of samples. Furthermore, expression profiles of interest may not be aligned or directly comparable (e.g. gene expression profiles from two species).

RESULTS: We propose a count statistic for measuring association between pairs of gene expression profiles consisting of ordered samples (e.g. time-course), where correlation may only exist locally in subsequences separated by a position shift. The statistic is simple and fast to compute, and we illustrate its use in two applications. In a cross-species comparison of developmental gene expression levels, we show our method not only measures association of gene expressions between the two species, but also provides alignment between different developmental stages. In the second application, we applied our statistic to expression profiles from two distinct phenotypic conditions, where the samples in each profile are ordered by the associated phenotypic values. The detected associations can be useful in building correspondence between gene association networks under different phenotypes. On the theoretical side, we provide asymptotic distributions of the statistic for different regions of the parameter space and test its power on simulated data.

AVAILABILITY AND IMPLEMENTATION: The code used to perform the analysis is available as part of the Supplementary Material.

CONTACT: msw@usc.edu or hhuang@stat.berkeley.edu.

SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

PMID:29040382 | PMC:PMC5860612 | DOI:10.1093/bioinformatics/btx641

CAFE: aCcelerated Alignment-FrEe sequence analysis

Yang Young Lu — Fri, 05 May 2017 06:00:00 -0400

Nucleic Acids Res. 2017 Jul 3;45(W1):W554-W559. doi: 10.1093/nar/gkx351.

ABSTRACT

Alignment-free genome and metagenome comparisons are increasingly important with the development of next generation sequencing (NGS) technologies. Recently developed state-of-the-art k-mer based alignment-free dissimilarity measures including CVTree, $d_2^*$ and $d_2^S$ are more computationally expensive than measures based solely on the k-mer frequencies. Here, we report a standalone software, aCcelerated Alignment-FrEe sequence analysis (CAFE), for efficient calculation of 28 alignment-free dissimilarity measures. CAFE allows for both assembled genome sequences and unassembled NGS shotgun reads as input, and wraps the output in a standard PHYLIP format. In downstream analyses, CAFE can also be used to visualize the pairwise dissimilarity measures, including dendrograms, heatmap, principal coordinate analysis and network display. CAFE serves as a general k-mer based alignment-free analysis platform for studying the relationships among genomes and metagenomes, and is freely available at https://github.com/younglululu/CAFE.

PMID:28472388 | PMC:PMC5793812 | DOI:10.1093/nar/gkx351

Gene coexpression measures in large heterogeneous samples using count statistics

Y X Rachel Wang — Wed, 08 Oct 2014 06:00:00 -0400

Proc Natl Acad Sci U S A. 2014 Nov 18;111(46):16371-6. doi: 10.1073/pnas.1417128111. Epub 2014 Oct 6.

ABSTRACT

With the advent of high-throughput technologies making large-scale gene expression data readily available, developing appropriate computational tools to process these data and distill insights into systems biology has been an important part of the "big data" challenge. Gene coexpression is one of the earliest techniques developed that is still widely in use for functional annotation, pathway analysis, and, most importantly, the reconstruction of gene regulatory networks, based on gene expression data. However, most coexpression measures do not specifically account for local features in expression profiles. For example, it is very likely that the patterns of gene association may change or only exist in a subset of the samples, especially when the samples are pooled from a range of experiments. We propose two new gene coexpression statistics based on counting local patterns of gene expression ranks to take into account the potentially diverse nature of gene interactions. In particular, one of our statistics is designed for time-course data with local dependence structures, such as time series coupled over a subregion of the time domain. We provide asymptotic analysis of their distributions and power, and evaluate their performance against a wide range of existing coexpression measures on simulated and real data. Our new statistics are fast to compute, robust against outliers, and show comparable and often better general performance.

PMID:25288767 | PMC:PMC4246260 | DOI:10.1073/pnas.1417128111

New developments of alignment-free sequence comparison: measures, statistics and next-generation sequencing

Kai Song — Thu, 26 Sep 2013 06:00:00 -0400

Brief Bioinform. 2014 May;15(3):343-53. doi: 10.1093/bib/bbt067. Epub 2013 Sep 23.

ABSTRACT

With the development of next-generation sequencing (NGS) technologies, a large amount of short read data has been generated. Assembly of these short reads can be challenging for genomes and metagenomes without template sequences, making alignment-based genome sequence comparison difficult. In addition, sequence reads from NGS can come from different regions of various genomes and they may not be alignable. Sequence signature-based methods for genome comparison based on the frequencies of word patterns in genomes and metagenomes can potentially be useful for the analysis of short reads data from NGS. Here we review the recent development of alignment-free genome and metagenome comparison based on the frequencies of word patterns with emphasis on the dissimilarity measures between sequences, the statistical power of these measures when two sequences are related and the applications of these measures to NGS data.

PMID:24064230 | PMC:PMC4017329 | DOI:10.1093/bib/bbt067

A geometric interpretation for local alignment-free sequence comparison

Ehsan Behnam — Tue, 09 Jul 2013 06:00:00 -0400

J Comput Biol. 2013 Jul;20(7):471-85. doi: 10.1089/cmb.2012.0280.

ABSTRACT

Local alignment-free sequence comparison arises in the context of identifying similar segments of sequences that may not be alignable in the traditional sense. We propose a randomized approximation algorithm that is both accurate and efficient. We show that under D2 and its important variant [Formula: see text] as the similarity measure, local alignment-free comparison between a pair of sequences can be formulated as the problem of finding the maximum bichromatic dot product between two sets of points in high dimensions. We introduce a geometric framework that reduces this problem to that of finding the bichromatic closest pair (BCP), allowing the properties of the underlying metric to be leveraged. Local alignment-free sequence comparison can be solved by making a quadratic number of alignment-free substring comparisons. We show both theoretically and through empirical results on simulated data that our approximation algorithm requires a subquadratic number of such comparisons and trades only a small amount of accuracy to achieve this efficiency. Therefore, our algorithm can extend the current usage of alignment-free-based methods and can also be regarded as a substitute for local alignment algorithms in many biological studies.

PMID:23829649 | PMC:PMC3704055 | DOI:10.1089/cmb.2012.0280

Topological classification and enumeration of RNA structures by genus

J E Andersen — Fri, 12 Oct 2012 06:00:00 -0400

J Math Biol. 2013 Nov;67(5):1261-78. doi: 10.1007/s00285-012-0594-x. Epub 2012 Oct 2.

ABSTRACT

To an RNA pseudoknot structure is naturally associated a topological surface, which has its associated genus, and structures can thus be classified by the genus. Based on earlier work of Harer-Zagier, we compute the generating function Dg,σ (z) = ∑n dg,σ (n)zn for the number dg,σ (n) of those structures of fixed genus g and minimum stack size σ with n nucleotides so that no two consecutive nucleotides are basepaired and show that Dg,σ (z) is algebraic. In particular, we prove that dg,2(n) ∼ kg n3(g−1/2 )γ n2, where γ2 ≈ 1.9685. Thus, for stack size at least two, the genus only enters through the sub-exponential factor, and the slow growth rate compared to the number of RNA molecules implies the existence of neutral networks of distinct molecules with the same structure of any genus. Certain RNA structures called shapes are shown to be in natural one-to-one correspondence with the cells in the Penner-Strebel decomposition of Riemann's moduli space of a surface of genus g with one boundary component, thus providing a link between RNA enumerative problems and the geometry of Riemann's moduli space.

PMID:23053535 | DOI:10.1007/s00285-012-0594-x

Normal and compound poisson approximations for pattern occurrences in NGS reads

Zhiyuan Zhai — Sat, 16 Jun 2012 06:00:00 -0400

J Comput Biol. 2012 Jun;19(6):839-54. doi: 10.1089/cmb.2012.0029.

ABSTRACT

Next generation sequencing (NGS) technologies are now widely used in many biological studies. In NGS, sequence reads are randomly sampled from the genome sequence of interest. Most computational approaches for NGS data first map the reads to the genome and then analyze the data based on the mapped reads. Since many organisms have unknown genome sequences and many reads cannot be uniquely mapped to the genomes even if the genome sequences are known, alternative analytical methods are needed for the study of NGS data. Here we suggest using word patterns to analyze NGS data. Word pattern counting (the study of the probabilistic distribution of the number of occurrences of word patterns in one or multiple long sequences) has played an important role in molecular sequence analysis. However, no studies are available on the distribution of the number of occurrences of word patterns in NGS reads. In this article, we build probabilistic models for the background sequence and the sampling process of the sequence reads from the genome. Based on the models, we provide normal and compound Poisson approximations for the number of occurrences of word patterns from the sequence reads, with bounds on the approximation error. The main challenge is to consider the randomness in generating the long background sequence, as well as in the sampling of the reads using NGS. We show the accuracy of these approximations under a variety of conditions for different patterns with various characteristics. Under realistic assumptions, the compound Poisson approximation seems to outperform the normal approximation in most situations. These approximate distributions can be used to evaluate the statistical significance of the occurrence of patterns from NGS data. The theory and the computational algorithm for calculating the approximate distributions are then used to analyze ChIP-Seq data using transcription factor GABP. Software is available online (www-rcf.usc.edu/∼fsun/Programs/NGS_motif_power/NGS_motif_power.html). In addition, Supplementary Material can be found online (www.liebertonline.com/cmb).

PMID:22697250 | PMC:PMC3375642 | DOI:10.1089/cmb.2012.0029

New Generations: Sequencing Machines and Their Computational Challenges

David C Schwartz — Tue, 29 Nov 2011 06:00:00 -0500

J Comput Sci Technol. 2010 Jan 1;25(1):3-9. doi: 10.1007/s11390-010-9300-x.

ABSTRACT

New generation sequencing systems are changing how molecular biology is practiced. The widely promoted $1000 genome will be a reality with attendant changes for healthcare, including personalized medicine. More broadly the genomes of many new organisms with large samplings from populations will be commonplace. What is less appreciated is the explosive demands on computation, both for CPU cycles and storage as well as the need for new computational methods. In this article we will survey some of these developments and demands.

PMID:22121326 | PMC:PMC3222932 | DOI:10.1007/s11390-010-9300-x

New powerful statistics for alignment-free sequence comparison under a pattern transfer model

Xuemei Liu — Tue, 05 Jul 2011 06:00:00 -0400

J Theor Biol. 2011 Sep 7;284(1):106-16. doi: 10.1016/j.jtbi.2011.06.020. Epub 2011 Jun 25.

ABSTRACT

Alignment-free sequence comparison is widely used for comparing gene regulatory regions and for identifying horizontally transferred genes. Recent studies on the power of a widely used alignment-free comparison statistic D2 and its variants D*2 and D(s)2 showed that their power approximates a limit smaller than 1 as the sequence length tends to infinity under a pattern transfer model. We develop new alignment-free statistics based on D2, D*2 and D(s)2 by comparing local sequence pairs and then summing over all the local sequence pairs of certain length. We show that the new statistics are much more powerful than the corresponding statistics and the power tends to 1 as the sequence length tends to infinity under the pattern transfer model.

PMID:21723298 | PMC:PMC3146591 | DOI:10.1016/j.jtbi.2011.06.020

Integrative analysis of many weighted co-expression networks using tensor computation

Wenyuan Li — Fri, 24 Jun 2011 06:00:00 -0400

PLoS Comput Biol. 2011 Jun;7(6):e1001106. doi: 10.1371/journal.pcbi.1001106. Epub 2011 Jun 16.

ABSTRACT

The rapid accumulation of biological networks poses new challenges and calls for powerful integrative analysis tools. Most existing methods capable of simultaneously analyzing a large number of networks were primarily designed for unweighted networks, and cannot easily be extended to weighted networks. However, it is known that transforming weighted into unweighted networks by dichotomizing the edges of weighted networks with a threshold generally leads to information loss. We have developed a novel, tensor-based computational framework for mining recurrent heavy subgraphs in a large set of massive weighted networks. Specifically, we formulate the recurrent heavy subgraph identification problem as a heavy 3D subtensor discovery problem with sparse constraints. We describe an effective approach to solving this problem by designing a multi-stage, convex relaxation protocol, and a non-uniform edge sampling technique. We applied our method to 130 co-expression networks, and identified 11,394 recurrent heavy subgraphs, grouped into 2,810 families. We demonstrated that the identified subgraphs represent meaningful biological modules by validating against a large set of compiled biological knowledge bases. We also showed that the likelihood for a heavy subgraph to be meaningful increases significantly with its recurrence in multiple networks, highlighting the importance of the integrative approach to biological network analysis. Moreover, our approach based on weighted graphs detects many patterns that would be overlooked using unweighted graphs. In addition, we identified a large number of modules that occur predominately under specific phenotypes. This analysis resulted in a genome-wide mapping of gene network modules onto the phenome. Finally, by comparing module activities across many datasets, we discovered high-order dynamic cooperativeness in protein complex networks and transcriptional regulatory networks.

PMID:21698123 | PMC:PMC3116899 | DOI:10.1371/journal.pcbi.1001106

Sequence alignment as hypothesis testing

Lu Meng — Wed, 11 May 2011 06:00:00 -0400

J Comput Biol. 2011 May;18(5):677-91. doi: 10.1089/cmb.2010.0328.

ABSTRACT

Sequence alignment depends on the scoring function that defines similarity between pairs of letters. For local alignment, the computational algorithm searches for the most similar segments in the sequences according to the scoring function. The choice of this scoring function is important for correctly detecting segments of interest. We formulate sequence alignment as a hypothesis testing problem, and conduct extensive simulation experiments to study the relationship between the scoring function and the distribution of aligned pairs within the aligned segment under this framework. We cut through the many ways to construct scoring functions and showed that any scoring function with negative expectation used in local alignment corresponds to a hypothesis test between the background distribution of sequence letters and a statistical distribution of letter pairs determined by the scoring function. The results indicate that the log-likelihood ratio scoring function is statistically most powerful and has the highest accuracy for detecting the segments of interest that are defined by the statistical distribution of aligned letter pairs.

PMID:21554016 | PMC:PMC3122928 | DOI:10.1089/cmb.2010.0328

Alignment-free sequence comparison (II): theoretical power of comparison statistics

Lin Wan — Wed, 27 Oct 2010 06:00:00 -0400

J Comput Biol. 2010 Nov;17(11):1467-90. doi: 10.1089/cmb.2010.0056. Epub 2010 Oct 25.

ABSTRACT

Rapid methods for alignment-free sequence comparison make large-scale comparisons between sequences increasingly feasible. Here we study the power of the statistic D2, which counts the number of matching k-tuples between two sequences, as well as D2*, which uses centralized counts, and D2S, which is a self-standardized version, both from a theoretical viewpoint and numerically, providing an easy to use program. The power is assessed under two alternative hidden Markov models; the first one assumes that the two sequences share a common motif, whereas the second model is a pattern transfer model; the null model is that the two sequences are composed of independent and identically distributed letters and they are independent. Under the first alternative model, the means of the tuple counts in the individual sequences change, whereas under the second alternative model, the marginal means are the same as under the null model. Using the limit distributions of the count statistics under the null and the alternative models, we find that generally, asymptotically D2S has the largest power, followed by D2*, whereas the power of D2 can even be zero in some cases. In contrast, even for sequences of length 140,000 bp, in simulations D2* generally has the largest power. Under the first alternative model of a shared motif, the power of D2*approaches 100% when sufficiently many motifs are shared, and we recommend the use of D2* for such practical applications. Under the second alternative model of pattern transfer,the power for all three count statistics does not increase with sequence length when the sequence is sufficiently long, and hence none of the three statistics under consideration canbe recommended in such a situation. We illustrate the approach on 323 transcription factor binding motifs with length at most 10 from JASPAR CORE (October 12, 2009 version),verifying that D2* is generally more powerful than D2. The program to calculate the power of D2, D2* and D2S can be downloaded from http://meta.cmb.usc.edu/d2. Supplementary Material is available at www.liebertonline.com/cmb.

PMID:20973742 | PMC:PMC3123933 | DOI:10.1089/cmb.2010.0056

High-resolution human genome structure by single-molecule analysis

Brian Teague — Fri, 11 Jun 2010 06:00:00 -0400

Proc Natl Acad Sci U S A. 2010 Jun 15;107(24):10848-53. doi: 10.1073/pnas.0914638107. Epub 2010 Jun 1.

ABSTRACT

Variation in genome structure is an important source of human genetic polymorphism: It affects a large proportion of the genome and has a variety of phenotypic consequences relevant to health and disease. In spite of this, human genome structure variation is incompletely characterized due to a lack of approaches for discovering a broad range of structural variants in a global, comprehensive fashion. We addressed this gap with Optical Mapping, a high-throughput, high-resolution single-molecule system for studying genome structure. We used Optical Mapping to create genome-wide restriction maps of a complete hydatidiform mole and three lymphoblast-derived cell lines, and we validated the approach by demonstrating a strong concordance with existing methods. We also describe thousands of new variants with sizes ranging from kb to Mb.

PMID:20534489 | PMC:PMC2890719 | DOI:10.1073/pnas.0914638107

The power of detecting enriched patterns: an HMM approach

Zhiyuan Zhai — Fri, 30 Apr 2010 06:00:00 -0400

J Comput Biol. 2010 Apr;17(4):581-92. doi: 10.1089/cmb.2009.0218.

ABSTRACT

The identification of binding sites of transcription factors (TF) and other regulatory regions, referred to as motifs, located in a set of molecular sequences is of fundamental importance in genomic research. Many computational and experimental approaches have been developed to locate motifs. The set of sequences of interest can be concatenated to form a long sequence of length n. One of the successful approaches for motif discovery is to identify statistically over- or under-represented patterns in this long sequence. A pattern refers to a fixed word W over the alphabet. In the example of interest, W is a word in the set of patterns of the motif. Despite extensive studies on motif discovery, no studies have been carried out on the power of detecting statistically over- or under-represented patterns Here we address the issue of how the known presence of random instances of a known motif affects the power of detecting patterns, such as patterns within the motif. Let N(W)(n) be the number of possibly overlapping occurrences of a pattern W in the sequence that contains instances of a known motif; such a sequence is modeled here by a Hidden Markov Model (HMM). First, efficient computational methods for calculating the mean and variance of N(W)(n) are developed. Second, efficient computational methods for calculating parameters involved in the normal approximation of N(W)(n) for frequent patterns and compound Poisson approximation of N(W)(n) for rare patterns are developed. Third, an easy to use web program is developed to calculate the power of detecting patterns and the program is used to study the power of detection in several interesting biological examples.

PMID:20426691 | PMC:PMC3203519 | DOI:10.1089/cmb.2009.0218

An integrative modular approach to systematically predict gene-phenotype associations

Michael R Mehan — Thu, 04 Feb 2010 06:00:00 -0500

BMC Bioinformatics. 2010 Jan 18;11 Suppl 1(Suppl 1):S62. doi: 10.1186/1471-2105-11-S1-S62.

ABSTRACT

BACKGROUND: Complex human diseases are often caused by multiple mutations, each of which contributes only a minor effect to the disease phenotype. To study the basis for these complex phenotypes, we developed a network-based approach to identify coexpression modules specifically activated in particular phenotypes. We integrated these modules, protein-protein interaction data, Gene Ontology annotations, and our database of gene-phenotype associations derived from literature to predict novel human gene-phenotype associations. Our systematic predictions provide us with the opportunity to perform a global analysis of human gene pleiotropy and its underlying regulatory mechanisms.

RESULTS: We applied this method to 338 microarray datasets, covering 178 phenotype classes, and identified 193,145 phenotype-specific coexpression modules. We trained random forest classifiers for each phenotype and predicted a total of 6,558 gene-phenotype associations. We showed that 40.9% genes are pleiotropic, highlighting that pleiotropy is more prevalent than previously expected. We collected 77 ChIP-chip datasets studying 69 transcription factors binding over 16,000 targets under various phenotypic conditions. Utilizing this unique data source, we confirmed that dynamic transcriptional regulation is an important force driving the formation of phenotype specific gene modules.

CONCLUSION: We created a genome-wide gene to phenotype mapping that has many potential implications, including providing potential new drug targets and uncovering the basis for human disease phenotypes. Our analysis of these phenotype-specific coexpression modules reveals a high prevalence of gene pleiotropy, and suggests that phenotype-specific transcription factor binding may contribute to phenotypic diversity. All resources from our study are made freely available on our online Phenotype Prediction Database.

PMID:20122238 | PMC:PMC3009536 | DOI:10.1186/1471-2105-11-S1-S62

Alignment-free sequence comparison (I): statistics and power

Gesine Reinert — Thu, 17 Dec 2009 06:00:00 -0500

J Comput Biol. 2009 Dec;16(12):1615-34. doi: 10.1089/cmb.2009.0198.

ABSTRACT

Large-scale comparison of the similarities between two biological sequences is a major issue in computational biology; a fast method, the D(2) statistic, relies on the comparison of the k-tuple content for both sequences. Although it has been known for some years that the D(2) statistic is not suitable for this task, as it tends to be dominated by single-sequence noise, to date no suitable adjustments have been proposed. In this article, we suggest two new variants of the D(2) word count statistic, which we call D(2)(S) and D(2)(*). For D(2)(S), which is a self-standardized statistic, we show that the statistic is asymptotically normally distributed, when sequence lengths tend to infinity, and not dominated by the noise in the individual sequences. The second statistic, D(2)(*), outperforms D(2)(S) in terms of power for detecting the relatedness between the two sequences in our examples; but although it is straightforward to simulate from the asymptotic distribution of D(2)(*), we cannot provide a closed form for power calculations.

PMID:20001252 | PMC:PMC2818754 | DOI:10.1089/cmb.2009.0198

A single molecule scaffold for the maize genome

Shiguo Zhou — Thu, 26 Nov 2009 06:00:00 -0500

PLoS Genet. 2009 Nov;5(11):e1000711. doi: 10.1371/journal.pgen.1000711. Epub 2009 Nov 20.

ABSTRACT

About 85% of the maize genome consists of highly repetitive sequences that are interspersed by low-copy, gene-coding sequences. The maize community has dealt with this genomic complexity by the construction of an integrated genetic and physical map (iMap), but this resource alone was not sufficient for ensuring the quality of the current sequence build. For this purpose, we constructed a genome-wide, high-resolution optical map of the maize inbred line B73 genome containing >91,000 restriction sites (averaging 1 site/ approximately 23 kb) accrued from mapping genomic DNA molecules. Our optical map comprises 66 contigs, averaging 31.88 Mb in size and spanning 91.5% (2,103.93 Mb/ approximately 2,300 Mb) of the maize genome. A new algorithm was created that considered both optical map and unfinished BAC sequence data for placing 60/66 (2,032.42 Mb) optical map contigs onto the maize iMap. The alignment of optical maps against numerous data sources yielded comprehensive results that proved revealing and productive. For example, gaps were uncovered and characterized within the iMap, the FPC (fingerprinted contigs) map, and the chromosome-wide pseudomolecules. Such alignments also suggested amended placements of FPC contigs on the maize genetic map and proactively guided the assembly of chromosome-wide pseudomolecules, especially within complex genomic regions. Lastly, we think that the full integration of B73 optical maps with the maize iMap would greatly facilitate maize sequence finishing efforts that would make it a valuable reference for comparative studies among cereals, or other maize inbred lines and cultivars.

PMID:19936062 | PMC:PMC2774507 | DOI:10.1371/journal.pgen.1000711

An integrative network approach to map the transcriptome to the phenome

Michael R Mehan — Tue, 28 Jul 2009 06:00:00 -0400

J Comput Biol. 2009 Aug;16(8):1023-34. doi: 10.1089/cmb.2009.0037.

ABSTRACT

Although many studies have been successful in the discovery of cooperating groups of genes, mapping these groups to phenotypes has proved a much more challenging task. In this article, we present the first genome-wide mapping of gene coexpression modules onto the phenome. We annotated coexpression networks from 136 microarray datasets with phenotypes from the Unified Medical Language System (UMLS). We then designed an efficient graph-based simulated annealing approach to identify coexpression modules frequently and specifically occurring in datasets related to individual phenotypes. By requiring phenotype-specific recurrence, we ensure the robustness of our findings. We discovered 118,772 modules specific to 42 phenotypes, and developed validation tests combining Gene Ontology, GeneRIF and UMLS. Our method is generally applicable to any kind of abundant network data with defined phenotype association, and thus paves the way for genome-wide, gene network-phenotype maps.

PMID:19630539 | PMC:PMC3154457 | DOI:10.1089/cmb.2009.0037

HAPLOWSER: a whole-genome haplotype browser for personal genome and metagenome

Jong Hyun Kim — Tue, 30 Jun 2009 06:00:00 -0400

Bioinformatics. 2009 Sep 15;25(18):2430-1. doi: 10.1093/bioinformatics/btp399. Epub 2009 Jun 27.

ABSTRACT

SUMMARY: Haplotype assembly is becoming a very important tool in genome sequencing of human and other organisms. Although haplotypes were previously inferred from genome assemblies, there has never been a comparative haplotype browser that depicts a global picture of whole-genome alignments among haplotypes of different organisms. We introduce a whole-genome HAPLotype brOWSER (HAPLOWSER), providing evolutionary perspectives from multiple aligned haplotypes and functional annotations. Haplowser enables the comparison of haplotypes from metagenomes, and associates conserved regions or the bases at the conserved regions with functional annotations and custom tracks. The associations are quantified for further analysis and presented as pie charts. Functional annotations and custom tracks that are projected onto haplotypes are saved as multiple files in FASTA format. Haplowser provides a user-friendly interface, and can display alignments of haplotypes with functional annotations at any resolution.

AVAILABILITY: Haplowser, written in Java, supports multiple platforms including Windows and Linux. Haplowser is publicly available at http://embio.yonsei.ac.kr/haplowser .

PMID:19561337 | PMC:PMC2735662 | DOI:10.1093/bioinformatics/btp399

The Seventh Asia Pacific Bioinformatics Conference (APBC2009)

Michael Q Zhang — Thu, 12 Feb 2009 06:00:00 -0500

BMC Bioinformatics. 2009 Jan 30;10 Suppl 1(Suppl 1):S1. doi: 10.1186/1471-2105-10-S1-S1.

NO ABSTRACT

PMID:19208108 | PMC:PMC2648764 | DOI:10.1186/1471-2105-10-S1-S1

A graph-based approach to systematically reconstruct human transcriptional regulatory modules

Xifeng Yan — Wed, 25 Jul 2007 06:00:00 -0400

Bioinformatics. 2007 Jul 1;23(13):i577-86. doi: 10.1093/bioinformatics/btm227.

ABSTRACT

MOTIVATION: A major challenge in studying gene regulation is to systematically reconstruct transcription regulatory modules, which are defined as sets of genes that are regulated by a common set of transcription factors. A commonly used approach for transcription module reconstruction is to derive coexpression clusters from a microarray dataset. However, such results often contain false positives because genes from many transcription modules may be simultaneously perturbed upon a given type of conditions. In this study, we propose and validate that genes, which form a coexpression cluster in multiple microarray datasets across diverse conditions, are more likely to form a transcription module. However, identifying genes coexpressed in a subset of many microarray datasets is not a trivial computational problem.

RESULTS: We propose a graph-based data-mining approach to efficiently and systematically identify frequent coexpression clusters. Given m microarray datasets, we model each microarray dataset as a coexpression graph, and search for vertex sets which are frequently densely connected across [theta m] datasets (0 < or = theta < or = 1). For this novel graph-mining problem, we designed two techniques to narrow down the search space: (1) partition the input graphs into (overlapping) groups sharing common properties; (2) summarize the vertex neighbor information from the partitioned datasets onto the 'Neighbor Association Summary Graph's for effective mining. We applied our method to 105 human microarray datasets, and identified a large number of potential transcription modules, activated under different subsets of conditions. Validation by ChIP-chip data demonstrated that the likelihood of a coexpression cluster being a transcription module increases significantly with its recurrence. Our method opens a new way to exploit the vast amount of existing microarray data accumulation for gene regulation study. Furthermore, the algorithm is applicable to other biological networks for approximate network module mining.

AVAILABILITY: http://zhoulab.usc.edu/NeMo/.

PMID:17646346 | DOI:10.1093/bioinformatics/btm227

Systematic discovery of functional modules and context-specific functional annotation of human genome

Yu Huang — Wed, 25 Jul 2007 06:00:00 -0400

Bioinformatics. 2007 Jul 1;23(13):i222-9. doi: 10.1093/bioinformatics/btm222.

ABSTRACT

MOTIVATION: The rapid accumulation of microarray datasets provides unique opportunities to perform systematic functional characterization of the human genome. We designed a graph-based approach to integrate cross-platform microarray data, and extract recurrent expression patterns. A series of microarray datasets can be modeled as a series of co-expression networks, in which we search for frequently occurring network patterns. The integrative approach provides three major advantages over the commonly used microarray analysis methods: (1) enhance signal to noise separation (2) identify functionally related genes without co-expression and (3) provide a way to predict gene functions in a context-specific way.

RESULTS: We integrate 65 human microarray datasets, comprising 1105 experiments and over 11 million expression measurements. We develop a data mining procedure based on frequent itemset mining and biclustering to systematically discover network patterns that recur in at least five datasets. This resulted in 143,401 potential functional modules. Subsequently, we design a network topology statistic based on graph random walk that effectively captures characteristics of a gene's local functional environment. Function annotations based on this statistic are then subject to the assessment using the random forest method, combining six other attributes of the network modules. We assign 1126 functions to 895 genes, 779 known and 116 unknown, with a validation accuracy of 70%. Among our assignments, 20% genes are assigned with multiple functions based on different network environments.

AVAILABILITY: http://zhoulab.usc.edu/ContextAnnotation.

PMID:17646300 | DOI:10.1093/bioinformatics/btm222

Diploid genome reconstruction of Ciona intestinalis and comparative analysis with Ciona savignyi

Jong Hyun Kim — Fri, 15 Jun 2007 06:00:00 -0400

Genome Res. 2007 Jul;17(7):1101-10. doi: 10.1101/gr.5894107. Epub 2007 Jun 13.

ABSTRACT

One of the main goals in genome sequencing projects is to determine a haploid consensus sequence even when clone libraries are constructed from homologous chromosomes. However, it has been noticed that haplotypes can be inferred from genome assemblies by investigating phase conservation in sequenced reads. In this study, we seek to infer haplotypes, a diploid consensus sequence, from the genome assembly of an organism, Ciona intestinalis. The Ciona intestinalis genome is an ideal resource from which haplotypes can be inferred because of the high polymorphism rate (1.2%). The haplotype estimation scheme consists of polymorphism detection and phase estimation. The core step of our method is a Gibbs sampling procedure. The mate-pair information from two-end sequenced clone inserts is exploited to provide long-range continuity. We estimate the polymorphism rate of Ciona intestinalis to be 1.2% and 1.5%, according to two different polymorphism counting schemes. The distribution of heterozygosity number is well fit by a compound Poisson distribution. The N50 length of haplotype segments is 37.9 kb in our assembly, while the N50 scaffold length of the Ciona intestinalis assembly is 190 kb. We also infer diploid gene sequences from haplotype segments. According to our reconstruction, 85.4% of predicted gene sequences are continuously covered by single haplotype segments. Our results indicate 97% accuracy in haplotype estimation, based on a simulated data set. We conduct a comparative analysis with Ciona savignyi, and discover interesting patterns of conserved DNA elements in chordates.

PMID:17567986 | PMC:PMC1899121 | DOI:10.1101/gr.5894107

A quantile method for sizing optical maps

Haifeng Li — Fri, 15 Jun 2007 06:00:00 -0400

J Comput Biol. 2007 Apr;14(3):255-66. doi: 10.1089/cmb.2006.0006.

ABSTRACT

Optical mapping is an integrated system for the analysis of single DNA molecules. It constructs restriction maps (noted as "optical map" ) from individual DNA molecules presented on surfaces after they are imaged by fluorescence microscopy. Because restriction digestion and fluorochrome staining are performed after molecules are mounted, resulting restriction fragments retain their order. Maps of fragment sizes and order are constructed by image processing techniques employing integrated fluorescence intensity measurements. Such analysis, in place of molecular length measurements, obviates need for uniformly elongated molecules, but requires samples containing small fluorescent reference molecules for accurate sizing. Although robust in practice, elimination of internal reference molecules would reduce errors and extend single molecule analysis to other platforms. In this paper, we introduce a new approach that does not use reference molecules for direct estimation of restriction fragment sizes, by the exploitation of the quantiles associated with their expected distribution. We show that this approach is comparable to the current reference-based method as evaluated by map alignment techniques in terms of the rate of placement of optical maps to published sequence.

PMID:17563310 | DOI:10.1089/cmb.2006.0006

Auditory efferent feedback system deficits precede age-related hearing loss: contralateral suppression of otoacoustic emissions in mice

Xiaoxia Zhu — Fri, 15 Jun 2007 06:00:00 -0400

J Comp Neurol. 2007 Aug 10;503(5):593-604. doi: 10.1002/cne.21402.

ABSTRACT

The C57BL/6J mouse has been a useful model of presbycusis, as it displays an accelerated age-related peripheral hearing loss. The medial olivocochlear efferent feedback (MOC) system plays a role in suppressing cochlear outer hair cell (OHC) responses, particularly for background noise. Neurons of the MOC system are located in the superior olivary complex, particularly in the dorsomedial periolivary nucleus (DMPO) and in the ventral nucleus of the trapezoid body (VNTB). We previously discovered that the function of the MOC system declines with age prior to OHC degeneration, as measured by contralateral suppression (CS) of distortion product otoacoustic emissions (DPOAEs) in humans and CBA mice. The present study aimed to determine the time course of age changes in MOC function in C57s. DPOAE amplitudes and CS of DPOAEs were collected for C57s from 6 to 40 weeks of age. MOC responses were observed at 6 weeks but were gone at middle (15-30 kHz) and high (30-45 kHz) frequencies by 8 weeks. Quantitative stereological analyses of Nissl sections revealed smaller neurons in the DMPO and VNTB of young adult C57s compared with CBAs. These findings suggest that reduced neuron size may underlie part of the noteworthy rapid decline of the C57 efferent system. In conclusion, the C57 mouse has MOC function at 6 weeks, but it declines quickly, preceding the progression of peripheral age-related sensitivity deficits and hearing loss in this mouse strain.

PMID:17559088 | DOI:10.1002/cne.21402

On the length of the longest exact position match in a random sequence

Gesine Reinert — Tue, 06 Feb 2007 06:00:00 -0500

IEEE/ACM Trans Comput Biol Bioinform. 2007 Jan-Mar;4(1):153-6. doi: 10.1109/TCBB.2007.1023.

ABSTRACT

A mixed Poisson approximation and a Poisson approximation for the length of the longest exact match of a random sequence across another sequence are provided, where the match is required to start at position 1 in the first sequence. This problem arises when looking for suitable anchors in whole genome alignments.

PMID:17277422 | DOI:10.1109/TCBB.2007.1023

Accuracy assessment of diploid consensus sequences

Jong Hyun Kim — Tue, 06 Feb 2007 06:00:00 -0500

IEEE/ACM Trans Comput Biol Bioinform. 2007 Jan-Mar;4(1):88-97. doi: 10.1109/TCBB.2007.1007.

ABSTRACT

If the origins of fragments are known in genome sequencing projects, it is straightforward to reconstruct diploid consensus sequences. In reality, however, this is not true. Although there are proposed methods to reconstruct haplotypes from genome sequencing projects, an accuracy assessment is required to evaluate the confidence of the estimated diploid consensus sequences. In this paper, we define the confidence score of diploid consensus sequences. It requires the calculation of the likelihood of an assembly. To calculate the likelihood, we propose a linear time algorithm with respect to the number of polymorphic sites. The likelihood calculation and confidence score are used for further improvements of haplotype estimation in two directions. One direction is that low-scored phases are disconnected. The other direction is that, instead of using nominal frequency 1/2, the haplotype frequency is estimated to reflect the actual contribution of each haplotype. Our method was evaluated on the simulated data whose polymorphism rate (1.2 percent) was based on Ciona intestinalis. As a result, the high accuracy of our algorithm was indicated: The true positive rate of the haplotype estimation was greater than 97 percent.

PMID:17277416 | DOI:10.1109/TCBB.2007.1007

Gene Aging Nexus: a web database and data mining platform for microarray data on aging

Fei Pan — Thu, 09 Nov 2006 06:00:00 -0500

Nucleic Acids Res. 2007 Jan;35(Database issue):D756-9. doi: 10.1093/nar/gkl798. Epub 2006 Nov 7.

ABSTRACT

The recent development of microarray technology provided unprecedented opportunities to understand the genetic basis of aging. So far, many microarray studies have addressed aging-related expression patterns in multiple organisms and under different conditions. The number of relevant studies continues to increase rapidly. However, efficient exploitation of these vast data is frustrated by the lack of an integrated data mining platform or other unifying bioinformatic resource to enable convenient cross-laboratory searches of array signals. To facilitate the integrative analysis of microarray data on aging, we developed a web database and analysis platform 'Gene Aging Nexus' (GAN) that is freely accessible to the research community to query/analyze/visualize cross-platform and cross-species microarray data on aging. By providing the possibility of integrative microarray analysis, GAN should be useful in building the systems-biology understanding of aging. GAN is accessible at http://gan.usc.edu.

PMID:17090592 | PMC:PMC1669755 | DOI:10.1093/nar/gkl798

An algorithm for assembly of ordered restriction maps from single DNA molecules

Anton Valouev — Wed, 18 Oct 2006 06:00:00 -0400

Proc Natl Acad Sci U S A. 2006 Oct 24;103(43):15770-5. doi: 10.1073/pnas.0604040103. Epub 2006 Oct 16.

ABSTRACT

The restriction mapping of a massive number of individual DNA molecules by optical mapping enables assembly of physical maps spanning mammalian and plant genomes; however, not through computational means permitting completely de novo assembly. Existing algorithms are not practical for genomes larger than lower eukaryotes due to their high time and space complexity. In many ways, sequence assembly parallels map assembly, so that the overlap-layout-consensus strategy, recently shown effective in assembling very large genomes in feasible time, sheds new light on solving map construction issues associated with single molecule substrates. Accordingly, we report an adaptation of this approach as the formal basis for de novo optical map assembly and demonstrate its computational feasibility for assembly of very large genomes. As such, we discuss assembly results for a series of genomes: human, plant, lower eukaryote and bacterial. Unlike sequence assembly, the optical map assembly problem is actually more complex because restriction maps from single molecules are constructed, manifesting errors stemming from: missing cuts, false cuts, and high variance of estimated fragment sizes; chimeric maps resulting from artifactually merged molecules; and true overlap scores that are "in the noise" or "slightly above the noise." We address these problems, fundamental to many single molecule measurements, by an effective error correction method using global overlap information to eliminate spurious overlaps and chimeric maps that are otherwise difficult to identify.

PMID:17043225 | PMC:PMC1635078 | DOI:10.1073/pnas.0604040103

Integrative missing value estimation for microarray data

Jianjun Hu — Sat, 14 Oct 2006 06:00:00 -0400

BMC Bioinformatics. 2006 Oct 12;7:449. doi: 10.1186/1471-2105-7-449.

ABSTRACT

BACKGROUND: Missing value estimation is an important preprocessing step in microarray analysis. Although several methods have been developed to solve this problem, their performance is unsatisfactory for datasets with high rates of missing data, high measurement noise, or limited numbers of samples. In fact, more than 80% of the time-series datasets in Stanford Microarray Database contain less than eight samples.

RESULTS: We present the integrative Missing Value Estimation method (iMISS) by incorporating information from multiple reference microarray datasets to improve missing value estimation. For each gene with missing data, we derive a consistent neighbor-gene list by taking reference data sets into consideration. To determine whether the given reference data sets are sufficiently informative for integration, we use a submatrix imputation approach. Our experiments showed that iMISS can significantly and consistently improve the accuracy of the state-of-the-art Local Least Square (LLS) imputation algorithm by up to 15% improvement in our benchmark tests.

CONCLUSION: We demonstrated that the order-statistics-based integrative imputation algorithms can achieve significant improvements over the state-of-the-art missing value estimation approaches such as LLS and is especially good for imputing microarray datasets with a limited number of samples, high rates of missing data, or very noisy measurements. With the rapid accumulation of microarray datasets, the performance of our approach can be further improved by incorporating larger and more appropriate reference datasets.

PMID:17038176 | PMC:PMC1622759 | DOI:10.1186/1471-2105-7-449

Alignment of optical maps

Anton Valouev — Fri, 07 Apr 2006 06:00:00 -0400

J Comput Biol. 2006 Mar;13(2):442-62. doi: 10.1089/cmb.2006.13.442.

ABSTRACT

We introduce a new scoring method for calculation of alignments of optical maps. Missing cuts, false cuts, and sizing errors present in optical maps are addressed by our alignment score through calculation of corresponding likelihoods. The size error model is derived through the application of Central Limit Theorem and validated by residual plots collected from real data. Missing cuts and false cuts are modeled as Bernoulli and Poisson events, respectively, as suggested by previous studies. Likelihoods are used to derive an alignment score through calculation of likelihood ratios for a certain hypothesis test. This allows us to achieve maximal descriminative power for the alignment score. Our scoring method is naturally embedded within a well known DP framework for finding optimal alignments.

PMID:16597251 | DOI:10.1089/cmb.2006.13.442

Refinement of optical map assemblies

Anton Valouev — Tue, 28 Feb 2006 06:00:00 -0500

Bioinformatics. 2006 May 15;22(10):1217-24. doi: 10.1093/bioinformatics/btl063. Epub 2006 Feb 24.

ABSTRACT

MOTIVATION: Genomic mutations and variations provide insightful information about the functionality of sequence elements and their association with human diseases. Traditionally, variations are identified through analysis of short DNA sequences, usually shorter than 1000 bp per fragment. Optical maps provide both faster and more cost-efficient means for detecting such differences, because a single map can span over 1 million bp. Optical maps are assembled to cover the whole genome, and the accuracy of assembly is critical.

RESULTS: We present a computationally efficient model-based method for improving quality of such assemblies. Our method provides very high accuracy even with moderate coverage (<20 x). We utilize a hidden Markov model to represent the consensus map and use the expectation-Maximization algorithm to drive the refinement process. We also provide quality scores to assess the quality of the finished map.

AVAILABILITY: Code is available from www.cmb.usc.edu/people/valouev/

PMID:16500933 | DOI:10.1093/bioinformatics/btl063

An Eulerian path approach to local multiple alignment for DNA sequences

Yu Zhang — Wed, 26 Jan 2005 06:00:00 -0500

Proc Natl Acad Sci U S A. 2005 Feb 1;102(5):1285-90. doi: 10.1073/pnas.0409240102. Epub 2005 Jan 24.

ABSTRACT

Expensive computation in handling a large number of sequences limits the application of local multiple sequence alignment. We present an Eulerian path approach to local multiple alignment for DNA sequences. The computational time and memory usage of this approach is approximately linear to the total size of sequences analyzed; hence, it can handle thousands of sequences or millions of letters simultaneously. By constructing a De Bruijn graph, most of the conserved segments are amplified as heavy Eulerian paths in the graph, and the original patterns distributed in sequences are recovered even if they do not exist in any single sequence. This approach can accurately detect unknown conserved regions, for both short and long, conserved and degenerate patterns. We further present a Poisson heuristic to estimate the significance of a local multiple alignment. The performance of our method is demonstrated by finding Alu repeats in the human genome. We compare the results with Alus marked by repeatmasker, where the two programs are in good agreement. Our method is robust under various conditions and superior to other methods in terms of efficiency and accuracy.

PMID:15668398 | PMC:PMC547885 | DOI:10.1073/pnas.0409240102

DNA sequence assembly and multiple sequence alignment by an Eulerian path approach

Y Zhang — Thu, 02 Sep 2004 06:00:00 -0400

Cold Spring Harb Symp Quant Biol. 2003;68:205-12. doi: 10.1101/sqb.2003.68.205.

NO ABSTRACT

PMID:15338619 | DOI:10.1101/sqb.2003.68.205

HapBlock: haplotype block partitioning and tag SNP selection software using a set of dynamic programming algorithms

Kui Zhang — Tue, 31 Aug 2004 06:00:00 -0400

Bioinformatics. 2005 Jan 1;21(1):131-4. doi: 10.1093/bioinformatics/bth482. Epub 2004 Aug 27.

ABSTRACT

Recent studies have revealed that linkage disequilibrium (LD) patterns vary across the human genome with some regions of high LD interspersed with regions of low LD. Such LD patterns make it possible to select a set of single nucleotide polymorphism (SNPs; tag SNPs) for genome-wide association studies. We have developed a suite of computer programs to analyze the block-like LD patterns and to select the corresponding tag SNPs. Compared to other programs for haplotype block partitioning and tag SNP selection, our program has several notable features. First, the dynamic programming algorithms implemented are guaranteed to find the block partition with minimum number of tag SNPs for the given criteria of blocks and tag SNPs. Second, both haplotype data and genotype data from unrelated individuals and/or from general pedigrees can be analyzed. Third, several existing measures/criteria for haplotype block partitioning and tag SNP selection have been implemented in the program. Finally, the programs provide flexibility to include specific SNPs (e.g. non-synonymous SNPs) as tag SNPs.

AVAILABILITY: The HapBlock program and its supplemental documents can be downloaded from the website http://www.cmb.usc.edu/~msms/HapBlock.

PMID:15333454 | DOI:10.1093/bioinformatics/bth482

Haplotype reconstruction from SNP alignment

Lei M Li — Tue, 03 Aug 2004 06:00:00 -0400

J Comput Biol. 2004;11(2-3):505-16. doi: 10.1089/1066527041410454.

ABSTRACT

In this paper, we describe a method for statistical reconstruction of haplotypes from a set of aligned SNP fragments. We consider the case of a pair of homologous human chromosomes, one from the mother and the other from the father. After fragment assembly, we wish to reconstruct the two haplotypes of the parents. Given a set of potential SNP sites inferred from the assembly alignment, we wish to divide the fragment set into two subsets, each of which represents one chromosome. Our method is based on a statistical model of sequencing errors, compositional information, and haplotype memberships. We calculate probabilities of different haplotypes conditional on the alignment. Due to computational complexity, we first determine phases for neighboring SNPs. Then we connect them and construct haplotype segments. Also, we compute the accuracy or confidence of the reconstructed haplotypes. We discuss other issues, such as alternative methods, parameter estimation, computational efficiency, and relaxation of assumptions.

PMID:15285905 | DOI:10.1089/1066527041410454

Haplotype block partitioning and tag SNP selection using genotype data and their applications to association studies

Kui Zhang — Wed, 14 Apr 2004 06:00:00 -0400

Genome Res. 2004 May;14(5):908-16. doi: 10.1101/gr.1837404. Epub 2004 Apr 12.

ABSTRACT

Recent studies have revealed that linkage disequilibrium (LD) patterns vary across the human genome with some regions of high LD interspersed by regions of low LD. A small fraction of SNPs (tag SNPs) is sufficient to capture most of the haplotype structure of the human genome. In this paper, we develop a method to partition haplotypes into blocks and to identify tag SNPs based on genotype data by combining a dynamic programming algorithm for haplotype block partitioning and tag SNP selection based on haplotype data with a variation of the expectation maximization (EM) algorithm for haplotype inference. We assess the effects of using either haplotype or genotype data in haplotype block identification and tag SNP selection as a function of several factors, including sample size, density or number of SNPs studied, allele frequencies, fraction of missing data, and genotyping error rate, using extensive simulations. We find that a modest number of haplotype or genotype samples will result in consistent block partitions and tag SNP selection. The power of association studies based on tag SNPs using genotype data is similar to that using haplotype data.

PMID:15078859 | PMC:PMC479119 | DOI:10.1101/gr.1837404

An Eulerian path approach to global multiple alignment for DNA sequences

Yu Zhang — Thu, 26 Feb 2004 06:00:00 -0500

J Comput Biol. 2003;10(6):803-19. doi: 10.1089/106652703322756096.

ABSTRACT

With the rapid increase in the dataset of genome sequences, the multiple sequence alignment problem is increasingly important and frequently involves the alignment of a large number of sequences. Many heuristic algorithms have been proposed to improve the speed of computation and the quality of alignment. We introduce a novel approach that is fundamentally different from all currently available methods. Our motivation comes from the Eulerian method for fragment assembly in DNA sequencing that transforms all DNA fragments into a de Bruijn graph and then reduces sequence assembly to a Eulerian path problem. The paper focuses on global multiple alignment of DNA sequences, where entire sequences are aligned into one configuration. Our main result is an algorithm with almost linear computational speed with respect to the total size (number of letters) of sequences to be aligned. Five hundred simulated sequences (averaging 500 bases per sequence and as low as 70% pairwise identity) have been aligned within three minutes on a personal computer, and the quality of alignment is satisfactory. As a result, accurate and simultaneous alignment of thousands of long sequences within a reasonable amount of time becomes possible. Data from an Arabidopsis sequencing project is used to demonstrate the performance.

PMID:14980012 | DOI:10.1089/106652703322756096

Whole-genome shotgun assembly and comparison of human genome assemblies

Sorin Istrail — Wed, 11 Feb 2004 06:00:00 -0500

Proc Natl Acad Sci U S A. 2004 Feb 17;101(7):1916-21. doi: 10.1073/pnas.0307971100. Epub 2004 Feb 9.

ABSTRACT

We report a whole-genome shotgun assembly (called WGSA) of the human genome generated at Celera in 2001. The Celera-generated shotgun data set consisted of 27 million sequencing reads organized in pairs by virtue of end-sequencing 2-kbp, 10-kbp, and 50-kbp inserts from shotgun clone libraries. The quality-trimmed reads covered the genome 5.3 times, and the inserts from which pairs of reads were obtained covered the genome 39 times. With the nearly complete human DNA sequence [National Center for Biotechnology Information (NCBI) Build 34] now available, it is possible to directly assess the quality, accuracy, and completeness of WGSA and of the first reconstructions of the human genome reported in two landmark papers in February 2001 [Venter, J. C., Adams, M. D., Myers, E. W., Li, P. W., Mural, R. J., Sutton, G. G., Smith, H. O., Yandell, M., Evans, C. A., Holt, R. A., et al. (2001) Science 291, 1304-1351; International Human Genome Sequencing Consortium (2001) Nature 409, 860-921]. The analysis of WGSA shows 97% order and orientation agreement with NCBI Build 34, where most of the 3% of sequence out of order is due to scaffold placement problems as opposed to assembly errors within the scaffolds themselves. In addition, WGSA fills some of the remaining gaps in NCBI Build 34. The early genome sequences all covered about the same amount of the genome, but they did so in different ways. The Celera results provide more order and orientation, and the consortium sequence provides better coverage of exact and nearly exact repeats.

PMID:14769938 | PMC:PMC357027 | DOI:10.1073/pnas.0307971100

Estimating the repeat structure and length of DNA sequences using L-tuples

Xiaoman Li — Thu, 07 Aug 2003 06:00:00 -0400

Genome Res. 2003 Aug;13(8):1916-22. doi: 10.1101/gr.1251803.

ABSTRACT

In shotgun sequencing projects, the genome or BAC length is not always known. We approach estimating genome length by first estimating the repeat structure of the genome or BAC, sometimes of interest in its own right, on the basis of a set of random reads from a genome project. Moreover, we can find the consensus for repeat families before assembly. Our methods are based on the l-tuple content of the reads.

PMID:12902383 | PMC:PMC403783 | DOI:10.1101/gr.1251803

Haplotype block partition with limited resources and applications to human chromosome 21 haplotype data

Kui Zhang — Fri, 13 Jun 2003 06:00:00 -0400

Am J Hum Genet. 2003 Jul;73(1):63-73. doi: 10.1086/376437. Epub 2003 Jun 10.

ABSTRACT

Recent studies have shown that the human genome has a haplotype block structure such that it can be decomposed into large blocks with high linkage disequilibrium (LD) and relatively limited haplotype diversity, separated by short regions of low LD. One of the practical implications of this observation is that only a small fraction of all the single-nucleotide polymorphisms (SNPs) (referred as "tag SNPs") can be chosen for mapping genes responsible for human complex diseases, which can significantly reduce genotyping effort, without much loss of power. Algorithms have been developed to partition haplotypes into blocks with the minimum number of tag SNPs for an entire chromosome. In practice, investigators may have limited resources, and only a certain number of SNPs can be genotyped. In the present article, we first formulate this problem as finding a block partition with a fixed number of tag SNPs that can cover the maximal percentage of the whole genome, and we then develop two dynamic programming algorithms to solve this problem. The algorithms are sufficiently flexible to permit knowledge of functional polymorphisms to be considered. We apply the algorithms to a data set of SNPs on human chromosome 21, combining the information of coding and noncoding regions. We study the density of SNPs in intergenic regions, introns, and exons, and we find that the SNP density in intergenic regions is similar to that in introns and is higher than that in exons, results that are consistent with previous studies. We also calculate the distribution of block break points in intergenic regions, genes, exons, and coding regions and do not find any significant differences.

PMID:12802783 | PMC:PMC1180591 | DOI:10.1086/376437

Distributional regimes for the number of k-word matches between two random sequences

Ross A Lippert — Fri, 11 Oct 2002 06:00:00 -0400

Proc Natl Acad Sci U S A. 2002 Oct 29;99(22):13980-9. doi: 10.1073/pnas.202468099. Epub 2002 Oct 8.

ABSTRACT

When comparing two sequences, a natural approach is to count the number of k-letter words the two sequences have in common. No positional information is used in the count, but it has the virtue that the comparison time is linear with sequence length. For this reason this statistic D(2) and certain transformations of D(2) are used for EST sequence database searches. In this paper we begin the rigorous study of the statistical distribution of D(2). Using an independence model of DNA sequences, we derive limiting distributions by means of the Stein and Chen-Stein methods and identify three asymptotic regimes, including compound Poisson and normal. The compound Poisson distribution arises when the word size k is large and word matches are rare. The normal distribution arises when the word size is small and matches are common. Explicit expressions for what is meant by large and small word sizes are given in the paper. However, when word size is small and the letters are uniformly distributed, the anticipated limiting normal distribution does not always occur. In this situation the uniform distribution provides the exception to other letter distributions. Therefore a naive, one distribution fits all, approach to D(2) statistics could easily create serious errors in estimating significance.

PMID:12374863 | PMC:PMC137823 | DOI:10.1073/pnas.202468099

A dynamic programming algorithm for haplotype block partitioning

Kui Zhang — Wed, 29 May 2002 06:00:00 -0400

Proc Natl Acad Sci U S A. 2002 May 28;99(11):7335-9. doi: 10.1073/pnas.102186799.

ABSTRACT

We develop a dynamic programming algorithm for haplotype block partitioning to minimize the number of representative single nucleotide polymorphisms (SNPs) required to account for most of the common haplotypes in each block. Any measure of haplotype quality can be used in the algorithm and of course the measure should depend on the specific application. The dynamic programming algorithm is applied to analyze the chromosome 21 haplotype data of Patil et al. [Patil, N., Berno, A. J., Hinds, D. A., Barrett, W. A., Doshi, J. M., Hacker, C. R., Kautzer, C. R., Lee, D. H., Marjoribanks, C., McDonough, D. P., et al. (2001) Science 294, 1719-1723], who searched for blocks of limited haplotype diversity. Using the same criteria as in Patil et al., we identify a total of 3,582 representative SNPs and 2,575 blocks that are 21.5% and 37.7% smaller, respectively, than those identified using a greedy algorithm of Patil et al. We also apply the dynamic programming algorithm to the same data set based on haplotype diversity. A total of 3,982 representative SNPs and 1,884 blocks are identified to account for 95% of the haplotype diversity in each block.

PMID:12032283 | PMC:PMC124231 | DOI:10.1073/pnas.102186799

An Eulerian path approach to DNA fragment assembly

P A Pevzner — Thu, 16 Aug 2001 06:00:00 -0400

Proc Natl Acad Sci U S A. 2001 Aug 14;98(17):9748-53. doi: 10.1073/pnas.171285098.

ABSTRACT

For the last 20 years, fragment assembly in DNA sequencing followed the "overlap-layout-consensus" paradigm that is used in all currently available assembly tools. Although this approach proved useful in assembling clones, it faces difficulties in genomic shotgun assembly. We abandon the classical "overlap-layout-consensus" approach in favor of a new euler algorithm that, for the first time, resolves the 20-year-old "repeat problem" in fragment assembly. Our main result is the reduction of the fragment assembly to a variation of the classical Eulerian path problem that allows one to generate accurate solutions of large-scale sequencing problems. euler, in contrast to the celera assembler, does not mask such repeats but uses them instead as a powerful fragment assembly tool.

PMID:11504945 | PMC:PMC55524 | DOI:10.1073/pnas.171285098

Probabilistic and statistical properties of words: an overview

G Reinert — Thu, 13 Jul 2000 06:00:00 -0400

J Comput Biol. 2000 Feb-Apr;7(1-2):1-46. doi: 10.1089/10665270050081360.

ABSTRACT

In the following, an overview is given on statistical and probabilistic properties of words, as occurring in the analysis of biological sequences. Counts of occurrence, counts of clumps, and renewal counts are distinguished, and exact distributions as well as normal approximations, Poisson process approximations, and compound Poisson approximations are derived. Here, a sequence is modelled as a stationary ergodic Markov chain; a test for determining the appropriate order of the Markov chain is described. The convergence results take the error made by estimating the Markovian transition probabilities into account. The main tools involved are moment generating functions, martingales, Stein's method, and the Chen-Stein method. Similar results are given for occurrences of multiple patterns, and, as an example, the problem of unique recoverability of a sequence from SBH chip data is discussed. Special emphasis lies on disentangling the complicated dependence structure between word occurrences, due to self-overlap as well as due to overlap between words. The results can be used to derive approximate, and conservative, confidence intervals for tests.

PMID:10890386 | DOI:10.1089/10665270050081360

Estimation for restriction sites observed by optical mapping using reversible-jump Markov Chain Monte Carlo

J K Lee — Sat, 17 Oct 1998 06:00:00 -0400

J Comput Biol. 1998 Fall;5(3):505-15. doi: 10.1089/cmb.1998.5.505.

ABSTRACT

A fundamentally new molecular-biology approach in constructing restriction maps, Optical Mapping, has been developed by Schwartz et al. (1993). Using this method restriction maps are constructed by measuring the relevant fluorescence intensity and length measurements. However, it is difficult to directly estimate the restriction site locations of single DNA molecules based on these optical mapping data because of the precision of length measurements and the unknown number of true restriction sites in the data. We propose the use of a hierarchical Bayes model based on a mixture model with normals and random noise. In this model we explicitly consider the missing observation structure of the data, such as the orientations of molecules, the allocations of cutting sites to restriction sites, and the indicator variables of whether observed cut sites are true or false. Because of the complexity of the model, the large number of missing data, and the unknown number of restriction sites, we use Reversible-Jump Markov Chain Monte Carlo (MCMC) to estimate the number and the locations of the restriction sites. Since there exists a high multimodality due to unknown orientations of molecules, we also use a combination of our MCMC approach and the flipping algorithm suggested by Dancík and Waterman (1997). The study is highly computer-intensive and the development of an efficient algorithm is required.

PMID:9773346 | DOI:10.1089/cmb.1998.5.505