Cookies on this website
We use cookies to ensure that we give you the best experience on our website. If you click 'Continue' we will assume that you are happy to receive all cookies and you will not see this message again. Click 'Find out more' for information on how to change your cookie settings.
  • Uncertainty in homology inferences: assessing and improving genomic sequence alignment.

    12 December 2017

    Sequence alignment underpins all of comparative genomics, yet it remains an incompletely solved problem. In particular, the statistical uncertainty within inferred alignments is often disregarded, while parametric or phylogenetic inferences are considered meaningless without confidence estimates. Here, we report on a theoretical and simulation study of pairwise alignments of genomic DNA at human-mouse divergence. We find that >15% of aligned bases are incorrect in existing whole-genome alignments, and we identify three types of alignment error, each leading to systematic biases in all algorithms considered. Careful modeling of the evolutionary process improves alignment quality; however, these improvements are modest compared with the remaining alignment errors, even with exact knowledge of the evolutionary model, emphasizing the need for statistical approaches to account for uncertainty. We develop a new algorithm, Marginalized Posterior Decoding (MPD), which explicitly accounts for uncertainties, is less biased and more accurate than other algorithms we consider, and reduces the proportion of misaligned bases by a third compared with the best existing algorithm. To our knowledge, this is the first nonheuristic algorithm for DNA sequence alignment to show robust improvements over the classic Needleman-Wunsch algorithm. Despite this, considerable uncertainty remains even in the improved alignments. We conclude that a probabilistic treatment is essential, both to improve alignment quality and to quantify the remaining uncertainty. This is becoming increasingly relevant with the growing appreciation of the importance of noncoding DNA, whose study relies heavily on alignments. Alignment errors are inevitable, and should be considered when drawing conclusions from alignments. Software and alignments to assist researchers in doing this are provided at

  • The Pfam protein families database.

    12 December 2017

    Pfam is a widely used database of protein families, currently containing more than 13,000 manually curated protein families as of release 26.0. Pfam is available via servers in the UK (, the USA ( and Sweden ( Here, we report on changes that have occurred since our 2010 NAR paper (release 24.0). Over the last 2 years, we have generated 1840 new families and increased coverage of the UniProt Knowledgebase (UniProtKB) to nearly 80%. Notably, we have taken the step of opening up the annotation of our families to the Wikipedia community, by linking Pfam families to relevant Wikipedia pages and encouraging the Pfam and Wikipedia communities to improve and expand those pages. We continue to improve the Pfam website and add new visualizations, such as the 'sunburst' representation of taxonomic distribution of families. In this work we additionally address two topics that will be of particular interest to the Pfam community. First, we explain the definition and use of family-specific, manually curated gathering thresholds. Second, we discuss some of the features of domains of unknown function (also known as DUFs), which constitute a rapidly growing class of families within Pfam.

  • PairsDB atlas of protein sequence space.

    12 December 2017

    Sequence similarity/database searching is a cornerstone of molecular biology. PairsDB is a database intended to make exploring protein sequences and their similarity relationships quick and easy. Behind PairsDB is a comprehensive collection of protein sequences and BLAST and PSI-BLAST alignments between them. Instead of running BLAST or PSI-BLAST individually on each request, results are retrieved instantaneously from a database of pre-computed alignments. Filtering options allow you to find a set of sequences satisfying a set of criteria-for example, all human proteins with solved structure and without transmembrane segments. PairsDB is continually updated and covers all sequences in Uniprot. The data is stored in a MySQL relational database. Data files will be made available for download at PairsDB can also be accessed interactively at PairsDB data is a valuable platform to build various downstream automated analysis pipelines. For example, the graph of all-against-all similarity relationships is the starting point for clustering protein families, delineating domains, improving alignment accuracy by consistency measures, and defining orthologous genes. Moreover, query-anchored stacked sequence alignments, profiles and consensus sequences are useful in studies of sequence conservation patterns for clues about possible functional sites.

  • OPTIC: orthologous and paralogous transcripts in clades.

    12 December 2017

    The genome sequences of a large number of metazoan species are now known. As multiple closely related genomes are sequenced, comparative studies that previously focussed on only pairs of genomes can now be extended over whole clades. The orthologous and paralogous transcripts in clades (OPTIC) database currently provides sets of gene predictions and orthology assignments for three clades: (i) amniotes, including human, dog, mouse, opossum, platypus and chicken (17 443 orthologous groups); (ii) a Drosophila clade of 12 species (12 889 orthologous groups) and (iii) a nematode clade of four species (13 626 orthologous groups). Gene predictions, multiple alignments and phylogenetic trees are freely available to browse and download from Further genomes and clades will be added in the future.

  • Genome of the marsupial Monodelphis domestica reveals innovation in non-coding sequences.

    13 December 2017

    We report a high-quality draft of the genome sequence of the grey, short-tailed opossum (Monodelphis domestica). As the first metatherian ('marsupial') species to be sequenced, the opossum provides a unique perspective on the organization and evolution of mammalian genomes. Distinctive features of the opossum chromosomes provide support for recent theories about genome evolution and function, including a strong influence of biased gene conversion on nucleotide sequence composition, and a relationship between chromosomal characteristics and X chromosome inactivation. Comparison of opossum and eutherian genomes also reveals a sharp difference in evolutionary innovation between protein-coding and non-coding functional elements. True innovation in protein-coding genes seems to be relatively rare, with lineage-specific differences being largely due to diversification and rapid turnover in gene families involved in environmental interactions. In contrast, about 20% of eutherian conserved non-coding elements (CNEs) are recent inventions that postdate the divergence of Eutheria and Metatheria. A substantial proportion of these eutherian-specific CNEs arose from sequence inserted by transposable elements, pointing to transposons as a major creative force in the evolution of mammalian gene regulation.

  • Genome sequence, comparative analysis and haplotype structure of the domestic dog.

    13 December 2017

    Here we report a high-quality draft genome sequence of the domestic dog (Canis familiaris), together with a dense map of single nucleotide polymorphisms (SNPs) across breeds. The dog is of particular interest because it provides important evolutionary information and because existing breeds show great phenotypic diversity for morphological, physiological and behavioural traits. We use sequence comparison with the primate and rodent lineages to shed light on the structure and evolution of genomes and genes. Notably, the majority of the most highly conserved non-coding sequences in mammalian genomes are clustered near a small subset of genes with important roles in development. Analysis of SNPs reveals long-range haplotypes across the entire dog genome, and defines the nature of genetic diversity within and across breeds. The current SNP map now makes it possible for genome-wide association studies to identify genes responsible for diseases and traits, with important consequences for human and companion animal health.

  • Genome-wide association between branch point properties and alternative splicing.

    12 December 2017

    The branch point (BP) is one of the three obligatory signals required for pre-mRNA splicing. In mammals, the degeneracy of the motif combined with the lack of a large set of experimentally verified BPs complicates the task of modeling it in silico, and therefore of predicting the location of natural BPs. Consequently, BPs have been disregarded in a considerable fraction of the genome-wide studies on the regulation of splicing in mammals. We present a new computational approach for mammalian BP prediction. Using sequence conservation and positional bias we obtained a set of motifs with good agreement with U2 snRNA binding stability. Using a Support Vector Machine algorithm, we created a model complemented with polypyrimidine tract features, which considerably improves the prediction accuracy over previously published methods. Applying our algorithm to human introns, we show that BP position is highly dependent on the presence of AG dinucleotides in the 3' end of introns, with distance to the 3' splice site and BP strength strongly correlating with alternative splicing. Furthermore, experimental BP mapping for five exons preceded by long AG-dinucleotide exclusion zones revealed that, for a given intron, more than one BP can be chosen throughout the course of splicing. Finally, the comparison between exons of different evolutionary ages and pseudo exons suggests a key role of the BP in the pathway of exon creation in human. Our computational and experimental analyses suggest that BP recognition is more flexible than previously assumed, and it appears highly dependent on the presence of downstream polypyrimidine tracts. The reported association between BP features and the splicing outcome suggests that this, so far disregarded but yet crucial, element buries information that can complement current acceptor site models.

  • Alternative splicing: Global insights: Minireview

    20 November 2017

    Following the original reports of pre-mRNA splicing in 1977, it was quickly realized that splicing together of different combinations of splice sites - alternative splicing- allows individual genes to generate more than one mRNA isoform. The full extent of alternative splicing only began to be revealed once large-scale genome and transcriptome sequencing projects began, rapidly revealing that alternative splicing is the rule rather than the exception. Recent technical innovations have facilitated the investigation of alternative splicing at a global scale. Splice-sensitive microarray platforms and deep sequencing allow quantitative profiling of very large numbers of alternative splicing events, whereas global analysis of the targets of RNA binding proteins reveals the regulatory networks involved in post-transcriptional gene control. Combined with sophisticated computational analysis, these new approaches are beginning to reveal the so-called 'RNA code' that underlies tissue and developmentally regulated alternative splicing, and that can be disrupted by disease-causing mutations. © 2010 FEBS.

  • Evolutionarily conserved human targets of adenosine to inosine RNA editing.

    12 December 2017

    A-to-I RNA editing by ADARs is a post-transcriptional mechanism for expanding the proteomic repertoire. Genetic recoding by editing was so far observed for only a few mammalian RNAs that are predominantly expressed in nervous tissues. However, as these editing targets fail to explain the broad and severe phenotypes of ADAR1 knockout mice, additional targets for editing by ADARs were always expected. Using comparative genomics and expressed sequence analysis, we identified and experimentally verified four additional candidate human substrates for ADAR-mediated editing: FLNA, BLCAP, CYFIP2 and IGFBP7. Additionally, editing of three of these substrates was verified in the mouse while two of them were validated in chicken. Interestingly, none of these substrates encodes a receptor protein but two of them are strongly expressed in the CNS and seem important for proper nervous system function. The editing pattern observed suggests that some of the affected proteins might have altered physiological properties leaving the possibility that they can be related to the phenotypes of ADAR1 knockout mice.

  • Brix from xenopus laevis and brx1p from yeast define a new family of proteins involved in the biogenesis of large ribosomal subunits.

    12 December 2017

    A clone was isolated from a cDNA library from early embryos of Xenopus laevis that codes for a highly charged protein containing 339 amino acids. Two putative nuclear localization signals could be identified in its sequence, but no other known motifs or domains. Closely related ORFs are present in the genomes of man, C. elegans, yeast and Arabidopsis. A fusion protein with GFP expressed in HeLa cells or Xenopus oocytes was found to be localized in the nucleolus and coiled (Cajal) bodies. Moreover, immunoprecipitation experiments demonstrated that the new Xenopus protein interacts with 5S, 5.8S and 28S RNAs of large ribosomal subunits. The name Brix (biogenesis of ribosomes in Xenopus) is proposed for this protein and the corresponding gene. In Saccharomyces cerevisiae, the essential gene YOL077c, now named BRX1, codes for the Brix homolog, which is also localized in the nucleolus. Depletion of Brx1 p in a conditional yeast mutant leads to defects in rRNA processing, and a block in the assembly of large ribosomal subunits.