Bioinformatics

Erratum to: GADGETS: a genetic algorithm for detecting epistasis using nuclear families

Michael Nodzenski — Fri, 24 Dec 2021 06:00:00 -0500

Bioinformatics. 2021 Dec 23:btab833. doi: 10.1093/bioinformatics/btab833. Online ahead of print.

NO ABSTRACT

PMID:34950948 | DOI:10.1093/bioinformatics/btab833

BioProfiling.jl: Profiling biological perturbations with high-content imaging in single cells and heterogeneous populations

Loan Vulliard — Wed, 22 Dec 2021 06:00:00 -0500

Bioinformatics. 2021 Dec 22:btab853. doi: 10.1093/bioinformatics/btab853. Online ahead of print.

ABSTRACT

MOTIVATION: High-content imaging screens provide a cost-effective and scalable way to assess cell states across diverse experimental conditions. The analysis of the acquired microscopy images involves assembling and curating raw cellular measurements into morphological profiles suitable for testing biological hypotheses. Despite being a critical step, general-purpose and adaptable tools for morphological profiling are lacking and no solution is available for the high-performance Julia programming language.

RESULTS: Here, we introduce BioProfiling.jl, an efficient end-to-end solution for compiling and filtering informative morphological profiles in Julia. The package contains all the necessary data structures to curate morphological measurements and helper functions to transform, normalize and visualize profiles. Robust statistical distances and permutation tests enable quantification of the significance of the observed changes despite the high fraction of outliers inherent to high-content screens. This package also simplifies visual artifact diagnostics, thus streamlining a bottleneck of morphological analyses. We showcase the features of the package by analyzing a chemical imaging screen, in which the morphological profiles prove to be informative about the compounds' mechanisms of action and can be conveniently integrated with the network localization of molecular targets.

AVAILABILITY: The Julia package is available on GitHub: https://github.com/menchelab/BioProfiling.jlWe also provide Jupyter notebooks reproducing our analyses: https://github.com/menchelab/BioProfilingNotebooks.

SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

PMID:34935929 | DOI:10.1093/bioinformatics/btab853

Gene set analysis with graph embedded kernel association test

Jialin Qu — Wed, 22 Dec 2021 06:00:00 -0500

Bioinformatics. 2021 Dec 22:btab851. doi: 10.1093/bioinformatics/btab851. Online ahead of print.

ABSTRACT

MOTIVATION: Kernel-based association test (KAT) has been a popular approach to evaluate the association of expressions of a gene set (e.g., pathway) with a phenotypic trait. KATs rely on kernel functions which capture the sample similarity across multiple features, to capture potential linear or nonlinear relationship among features in a gene set. When calculating the kernel functions, no network graphical information about the features is considered. While genes in a functional group (e.g., a pathway) are not independent in general due to regulatory interactions, incorporating regulatory network (or graph) information can potentially increase the power of KAT. In this work, we propose a graph-embedded kernel association test, termed gKAT. gKAT incorporates prior pathway knowledge when constructing a kernel function into hypothesis testing.

RESULTS: We apply a diffusion kernel to capture any graph structures in a gene set, then incorporate such information to build a kernel function for further association test. We illustrate the geometric meaning of the approach. Through extensive simulation studies, we show that the proposed gKAT algorithm can improve testing power compared to the one without considering graph structures. Application to a real data set further demonstrate the utility of the method.

AVAILABILITY: The R code used for the analysis can be accessed at https://github.com/JialinQu/gKAT.

SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

PMID:34935928 | DOI:10.1093/bioinformatics/btab851

MACA: Marker-based automatic cell-type annotation for single cell expression data

Yang Xu — Wed, 22 Dec 2021 06:00:00 -0500

Bioinformatics. 2021 Dec 22:btab840. doi: 10.1093/bioinformatics/btab840. Online ahead of print.

ABSTRACT

SUMMARY: Accurately identifying cell-types is a critical step in single-cell sequencing analyses. Here, we present marker-based automatic cell-type annotation (MACA), a new tool for annotating single-cell transcriptomics datasets. We developed MACA by testing 4 cell-type scoring methods with 2 public cell-marker databases as reference in 6 single-cell studies. MACA compares favorably to 4 existing marker-based cell-type annotation methods in terms of accuracy and speed. We show that MACA can annotate a large single-nuclei RNA-seq study in minutes on human hearts with ∼290k cells. MACA scales easily to large datasets and can broadly help experts to annotate cell types in single-cell transcriptomics datasets, and we envision MACA provides a new opportunity for integration and standardization of cell-type annotation across multiple datasets.

AVAILABILITY AND IMPLEMENTATION: MACA is written in python and released under GNU General Public License v3.0. The source code is available at https://github.com/ImXman/MACA.

PMID:34935911 | DOI:10.1093/bioinformatics/btab840

geoCancerPrognosticDatasetsRetriever, a bioinformatics tool to easily identify cancer prognostic datasets on Gene Expression Omnibus (GEO)

Abbas Alameer — Wed, 22 Dec 2021 06:00:00 -0500

Bioinformatics. 2021 Dec 22:btab852. doi: 10.1093/bioinformatics/btab852. Online ahead of print.

ABSTRACT

SUMMARY: Having multiple datasets is a key aspect of robust bioinformatics analyses, because it allows researchers to find possible confirmation of the discoveries made on multiple cohorts. For this purpose, Gene Expression Omnibus (GEO) can be a useful database, since it provides hundreds of thousands of microarray gene expression datasets freely available for download and usage. Despite this large availability, collecting prognostic datasets of a specific cancer type from GEO can be a long, time-consuming, and energy-consuming activity for any bioinformatician, who needs to execute it manually by first performing a search on the GEO website and then by checking all the datasets found one by one. To solve this problem, we present here geoCancerPrognosticDatasetsRetriever, a Perl 5 application which reads a cancer type and a list of microarray platforms, searches for prognostic gene expression datasets of that cancer type and based on those platforms available on GEO, and returns the GEO accession codes of those datasets, if found. Our bioinformatics tool can easily generate in a few minutes a list of cancer prognostic datasets that otherwise would require numerous hours of manual work to any bioinformatician. geoCancerPrognosticDatasetsRetriever can handily retrieve multiple prognostic datasets of gene expression of any cancer type, laying the foundations for numerous bioinformatics studies and meta-analyses that can have a strong impact on oncology research.

AVAILABILITY AND IMPLEMENTATION: geoCancerPrognosticDatasetsRetriever is freely available under the GPLv2 license on the Comprehensive Perl Archive Network (CPAN) at https://metacpan.org/pod/App::geoCancerPrognosticDatasetsRetriever and on GitHub at https://github.com/AbbasAlameer/geoCancerPrognosticDatasetsRetriever.

PMID:34935889 | DOI:10.1093/bioinformatics/btab852

Benchmarking Table Recognition Performance on Biomedical Literature on Neurological Disorders

Tim Adams — Wed, 22 Dec 2021 06:00:00 -0500

Bioinformatics. 2021 Dec 22:btab843. doi: 10.1093/bioinformatics/btab843. Online ahead of print.

ABSTRACT

MOTIVATION: Table recognition systems are widely used to extract and structure quantitative information from the vast amount of documents that are increasingly available from different open sources. While many systems already perform well on tables with a simple layout, tables in the biomedical domain are often much more complex. Benchmark and training data for such tables is however very limited.

RESULTS: To address this issue, we present a novel, highly curated benchmark data set based on a hand-curated literature corpus on neurological disorders, which can be used to tune and evaluate table extraction applications for this challenging domain. We evaluate several state-of-the-art table extraction systems based on our proposed benchmark and discuss challenges that emerged during the benchmark creation as well as factors that can impact the performance of recognition methods. For the evaluation procedure, we propose a new metric as well as several improvements that result in a better performance evaluation.

AVAILABILITY: The resulting benchmark data set as well as the source code to our novel evaluation approach can be openly accessed.

SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

PMID:34935870 | DOI:10.1093/bioinformatics/btab843

EcoPLOT: Dynamic Analysis of Biogeochemical Data

Christopher D Sanchez — Mon, 20 Dec 2021 06:00:00 -0500

Bioinformatics. 2021 Dec 20:btab842. doi: 10.1093/bioinformatics/btab842. Online ahead of print.

ABSTRACT

MOTIVATION: We have created EcoPLOT (Parameterized Linkage of Omics-driven Technologies), a web-app for the dynamic, interactive analysis of biogeochemical datasets that combines state-of-the-art analysis tools to statistically and graphically explore environmental, geochemical, and microbiome datasets. Using the iterative Random Forest (iRF), a machine learning algorithm, EcoPLOT allows for the de novo discovery of drivers which exhibit significant impact on plant, microbial, or soil dynamics.

AVAILABILITY AND IMPLEMENTATION: EcoPLOT is built entirely within the R language. It can be accessed through any system where R is installed, including Windows, Mac, and most Linux systems. EcoPLOT is free to use and can be accessed at https://github.com/cdsanchez18/EcoPLOT.

PMID:34927685 | DOI:10.1093/bioinformatics/btab842

CRPMKB: a knowledge base of cancer risk prediction models for systematic comparison and personalized applications

Shumin Ren — Mon, 20 Dec 2021 06:00:00 -0500

Bioinformatics. 2021 Dec 20:btab850. doi: 10.1093/bioinformatics/btab850. Online ahead of print.

ABSTRACT

MOTIVATION: In the era of big data and precision medicine, accurate risk assessment is a prerequisite for the implementation of risk screening and preventive treatment. A large number of studies have focused on the risk of cancer, and related risk prediction models have been constructed, but there is a lack of effective resource integration for systematic comparison and personalized applications. Therefore, the establishment and analysis of the cancer risk prediction model knowledge base (CRPMKB) is of great significance.

RESULTS: The current knowledge base contains 802 model data. The model comparison indicates that the accuracy of cancer risk prediction was greatly affected by regional differences, cancer types and model types. We divided the model variables into four categories: environment, behavioral lifestyle, biological genetics and clinical examination, and found that there are differences in the distribution of various variables among different cancer types. Taking 50 genes involved in the lung cancer risk prediction models as an example to perform pathway enrichment analyses and the results showed that these genes were significantly enriched in p53 Signaling and Aryl Hydrocarbon Receptor Signaling pathways which are associated with cancer and specific diseases. In addition, we verified the biological significance of overlapping lung cancer genes via STRING database. CRPMKB was established to provide researchers an online tool for the future personalized model application and developing. This study of CRPMKB suggests that developing more targeted models based on specific demographic characteristics and cancer types will further improve the accuracy of cancer risk model predictions.

AVAILABILITY: http://www.sysbio.org.cn/CRPMKB/.

SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

PMID:34927675 | DOI:10.1093/bioinformatics/btab850

TMBleR, a bioinformatic tool to optimize TMB estimation and predictive power

Laura Fancello — Mon, 20 Dec 2021 06:00:00 -0500

Bioinformatics. 2021 Dec 20:btab836. doi: 10.1093/bioinformatics/btab836. Online ahead of print.

ABSTRACT

MOTIVATION: Tumor mutational burden (TMB) has been proposed as a predictive biomarker for immunotherapy response in cancer patients, as it is thought to enrich for tumors with high neoantigen load. TMB assessed by Whole Exome Sequencing (WES) is considered the gold standard but remains confined to research settings. In the clinical setting, targeted gene panels sampling various genomic sizes along with diverse strategies to estimate TMB were proposed and no real standard has emerged yet.

RESULTS: We provide the community with TMBleR, a tool to measure the clinical impact of various strategies of panel-based TMB measurement.

AVAILABILITY: R package and docker container (GPL-3 Open Source license): https://acc-bioinfo.github.io/TMBleR/. Graphical-user interface website: https://bioserver.ieo.it/shiny/app/tmbler.

SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

PMID:34927668 | DOI:10.1093/bioinformatics/btab836

Identifying cancer pathway dysregulations using differential causal effects

Kim Philipp Jablonski — Mon, 20 Dec 2021 06:00:00 -0500

Bioinformatics. 2021 Dec 20:btab847. doi: 10.1093/bioinformatics/btab847. Online ahead of print.

ABSTRACT

MOTIVATION: Signaling pathways control cellular behavior. Dysregulated pathways, for example, due to mutations that cause genes and proteins to be expressed abnormally, can lead to diseases, such as cancer.

RESULTS: We introduce a novel computational approach, called Differential Causal Effects (dce), which compares normal to cancerous cells using the statistical framework of causality. The method allows to detect individual edges in a signaling pathway that are dysregulated in cancer cells, while accounting for confounding. Hence, technical artifacts have less influence on the results and dce is more likely to detect the true biological signals. We extend the approach to handle unobserved dense confounding, where each latent variable, such as, for example, batch effects or cell cycle states, affects many covariates. We show that dce outperforms competing methods on synthetic data sets and on CRISPR knockout screens. We validate its latent confounding adjustment properties on a GTEx dataset. Finally, in an exploratory analysis on breast cancer data from TCGA, we recover known and discover new genes involved in breast cancer progression.

AVAILABILITY: The method dce is freely available as an R package on Bioconductor (https://bioconductor.org/packages/release/bioc/html/dce.html) as well as on https://github.com/cbg-ethz/dce. The GitHub repository also contains the Snakemake workflows needed to reproduce all results presented here.

SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

PMID:34927666 | DOI:10.1093/bioinformatics/btab847

InDeep: 3D fully convolutional neural networks to assist in silico drug design on protein-protein interactions

Vincent Mallet — Wed, 15 Dec 2021 06:00:00 -0500

Bioinformatics. 2021 Dec 15:btab849. doi: 10.1093/bioinformatics/btab849. Online ahead of print.

ABSTRACT

MOTIVATION: Protein-protein interactions (PPIs) are key elements in numerous biological pathways and the subject of a growing number of drug discovery projects including against infectious diseases. Designing drugs on PPI targets remains a difficult task and requires extensive efforts to qualify a given interaction as an eligible target. To this end, besides the evident need to determine the role of PPIs in disease-associated pathways and their experimental characterization as therapeutics targets, prediction of their capacity to be bound by other protein partners or modulated by future drugs is of primary importance.

RESULTS: We present InDeep, a tool for predicting functional binding sites within proteins that could either host protein epitopes or future drugs. Leveraging deep learning on a curated data set of PPIs, this tool can proceed to enhanced functional binding site predictions either on experimental structures or along molecular dynamics trajectories. The benchmark of InDeep demonstrates that our tool outperforms state of the art ligandable binding sites predictors when assessing PPI targets but also conventional targets. This offers new opportunities to assist drug design projects on PPIs by identifying pertinent binding pockets at or in the vicinity of PPI interfaces.

AVAILABILITY: The tool is available on GitLab at https://gitlab.pasteur.fr/InDeep/InDeep.

PMID:34908131 | DOI:10.1093/bioinformatics/btab849

On the relation between input and output distributions of scRNA-seq experiments

Daniel Schwabe — Wed, 15 Dec 2021 06:00:00 -0500

Bioinformatics. 2021 Dec 15:btab841. doi: 10.1093/bioinformatics/btab841. Online ahead of print.

ABSTRACT

MOTIVATION: Single-cell RNA sequencing determines RNA copy numbers per cell for a given gene. However, technical noise poses the question how observed distributions (output) are connected to their cellular distributions (input).

RESULTS: We model a single-cell RNA sequencing setup consisting of PCR amplification and sequencing, and derive probability distribution functions for the output distribution given an input distribution. We provide copy number distributions arising from single transcripts during PCR amplification with exact expressions for mean and variance. We prove that the coefficient of variation of the output of sequencing is always larger than that of the input distribution. Experimental data reveals the variance and mean of the input distribution to obey characteristic relations, which we specifically determine for a HeLa data set. We can calculate as many moments of the input distribution as are known of the output distribution (up to all). This, in principle, completely determines the input from the output distribution.

SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

PMID:34908126 | DOI:10.1093/bioinformatics/btab841

Virtifier: A deep learning-based identifier for viral sequences from metagenomes

Yan Miao — Wed, 15 Dec 2021 06:00:00 -0500

Bioinformatics. 2021 Dec 15:btab845. doi: 10.1093/bioinformatics/btab845. Online ahead of print.

ABSTRACT

MOTIVATION: Viruses, the most abundant biological entities on earth, are important components of microbial communities, and as major human pathogens, they are responsible for human mortality and morbidity. The identification of viral sequences from metagenomes is critical for viral analysis. As massive quantities of short sequences are generated by next-generation sequencing (NGS), most methods utilize discrete and sparse one-hot vectors to encode nucleotide sequences, which are usually ineffective in viral identification.

RESULTS: In this paper, Virtifier, a deep learning-based viral identifier for sequences from metagenomic data, is proposed. It includes a meaningful nucleotide sequence encoding method named Seq2Vec and a variant viral sequence predictor with an attention-based Long Short-Term Memory (LSTM) network. By utilizing a fully trained embedding matrix to encode codons, Seq2Vec can efficiently extract the relationships among those codons in a nucleotide sequence. Combined with an attention layer, the LSTM neural network can further analyze the codon relationships and sift the parts that contribute to the final features. Experimental results of three datasets have shown that Virtifier can accurately identify short viral sequences (< 500 bp) from metagenomes, surpassing three widely used methods, VirFinder, DeepVirFinder and PPR-Meta. Meanwhile, a comparable performance was achieved by Virtifier at longer lengths (> 5,000bp).

AVAILABILITY: A Python implementation of Virtifier and the Python code developed for this study have been provided on Github https://github.com/crazyinter/Seq2Vec.

SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

PMID:34908121 | DOI:10.1093/bioinformatics/btab845

RNAglib: A python package for RNA 2.5D graphs

Vincent Mallet — Wed, 15 Dec 2021 06:00:00 -0500

Bioinformatics. 2021 Dec 15:btab844. doi: 10.1093/bioinformatics/btab844. Online ahead of print.

ABSTRACT

SUMMARY: RNA 3D architectures are stabilized by sophisticated networks of (non-canonical) base pair interactions, which can be conveniently encoded as multi-relational graphs and efficiently exploited by graph theoretical approaches and recent progresses in machine learning techniques. RNAglib is a library that eases the use of this representation, by providing clean data, methods to load it in machine learning pipelines and graph-based deep learning models suited for this representation. RNAglib also offers other utilities to model RNA with 2.5D graphs, such as drawing tools, comparison functions or baseline performances on RNA applications.

AVAILABILITY AND IMPLEMENTATION: The method is distributed as a pip package, RNAglib. The source code, data, and documentation is available at https://rnaglib.cs.mcgill.ca.

PMID:34908108 | DOI:10.1093/bioinformatics/btab844

InterARTIC: an interactive web application for whole-genome nanopore sequencing analysis of SARS-CoV-2 and other viruses

James M Ferguson — Wed, 15 Dec 2021 06:00:00 -0500

Bioinformatics. 2021 Dec 15:btab846. doi: 10.1093/bioinformatics/btab846. Online ahead of print.

ABSTRACT

MOTIVATION: InterARTIC is an interactive web application for the analysis of viral whole-genome sequencing (WGS) data generated on Oxford Nanopore Technologies (ONT) devices. A graphical interface enables users with no bioinformatics expertise to analyse WGS experiments and reconstruct consensus genome sequences from individual isolates of viruses, such as SARS-CoV-2. InterARTIC is intended to facilitate widespread adoption and standardisation of ONT sequencing for viral surveillance and molecular epidemiology.

WORKED EXAMPLE: We demonstrate the use of InterARTIC for the analysis of ONT viral WGS data from SARS-CoV-2 and Ebola virus, using a laptop computer or the internal computer on an ONT GridION sequencing device. We showcase the intuitive graphical interface, workflow customisation capabilities and job-scheduling system that facilitate execution of small- and large-scale WGS projects on any common virus.

IMPLEMENTATION: InterARTIC is a free, open-source web application implemented in Python that executes best-practice command line workflows from the ARTIC network. The application can be downloaded as a set of pre-compiled binaries that are compatible with all common Linux distributions, Windows with Linux subsystems, MacOSX and ARM systems. For further details please visit: https://github.com/Psy-Fer/interARTIC/.

SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

PMID:34908106 | DOI:10.1093/bioinformatics/btab846

Overcoming the inadaptability of sparse group lasso for data with various group structures by stacking

Huan He — Wed, 15 Dec 2021 06:00:00 -0500

Bioinformatics. 2021 Dec 15:btab848. doi: 10.1093/bioinformatics/btab848. Online ahead of print.

ABSTRACT

MOTIVATION: Efficiently identifying genes based on gene expression level have been studied to help to classify different cancer types and improve the prediction performance. Logistic regression model based on regularization technique is often one of the effective approaches for simultaneously realizing prediction and feature (gene) selection in genomic data of high dimensionality. However, standard methods ignore biological group structure and generally result in poorer predictive models.

RESULTS: In this paper, we develop a classifier named Stacked SGL that satisfies the criteria of prediction, stability and selection based on sparse group lasso penalty by stacking. Sparse group lasso has a mixing parameter representing the ratio of lasso to group lasso, thus providing a compromise between selecting a subset of sparse feature groups and introducing sparsity within each group. We propose to use stacked generalization to combine different ratios rather than choosing one ratio, which could help to overcome the inadaptability of sparse group lasso for some data. Considering that stacking weakens feature selection, we perform a post-hoc feature selection which might slightly reduce predictive performance, but it shows superior in feature selection. Experimental results on simulation demonstrate that our approach enjoys competitive and stable classification performance and lower false discovery rate in feature selection for varying sets of data compared with other regularization methods. In addition, our method presents better accuracy in three public cancer data sets and identifies more powerful discriminatory and potential mutation genes for thyroid carcinoma.

AVAILABILITY: https://github.com/huanheaha/Stacked_SGL; https://zenodo.org/record/5761577#.YbAUyciEwk2.

SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

PMID:34908103 | DOI:10.1093/bioinformatics/btab848

Generation of ENSEMBL-based proteogenomics databases boosts the identification of non-canonical peptides

Husen M Umer — Tue, 14 Dec 2021 06:00:00 -0500

Bioinformatics. 2021 Dec 14:btab838. doi: 10.1093/bioinformatics/btab838. Online ahead of print.

ABSTRACT

SUMMARY: We have implemented the pypgatk package and the pgdb workflow to create proteogenomics databases based on ENSEMBL resources. The tools allow the generation of protein sequences from novel protein-coding transcripts by performing a three-frame translation of pseudogenes, lncRNAs, and other non-canonical transcripts, such as those produced by alternative splicing events. It also includes exonic out-of-frame translation from otherwise canonical protein-coding mRNAs. Moreover, the tool enables the generation of variant protein sequences from multiple sources of genomic variants including COSMIC, cBioportal, gnomAD, and mutations detected from sequencing of patient samples. pypgatk and pgdb provide multiple functionalities for database handling including optimized target/decoy generation by the algorithm DecoyPyrat. Finally, we have reanalyzed six public datasets in PRIDE by generating cell-type specific databases for 65 cell lines using the pypgatk and pgdb workflow, revealing a wealth of non-canonical or cryptic peptides amounting to more than 5% of the total number of peptides identified.

AVAILABILITY: The software is freely available. pypgatk: (https://github.com/bigbio/py-pgatk/), and pgdb: (https://nf-co.re/pgdb).

SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

PMID:34904638 | DOI:10.1093/bioinformatics/btab838

swCAM: estimation of subtype-specific expressions in individual samples with unsupervised sample-wise deconvolution

Lulu Chen — Tue, 14 Dec 2021 06:00:00 -0500

Bioinformatics. 2021 Dec 14:btab839. doi: 10.1093/bioinformatics/btab839. Online ahead of print.

ABSTRACT

MOTIVATION: Complex biological tissues are often a heterogeneous mixture of several molecularly distinct cell subtypes. Both subtype compositions and subtype-specific expressions can vary across biological conditions. Computational deconvolution aims to dissect patterns of bulk tissue data into subtype compositions and subtype-specific expressions. Existing deconvolution methods can only estimate averaged subtype-specific expressions in a population, while many downstream analyses such as inferring co-expression networks in particular subtypes require subtype expression estimates in individual samples. However, individual-level deconvolution is a mathematically underdetermined problem because there are more variables than observations.

RESULTS: We report a sample-wise Convex Analysis of Mixtures (swCAM) method that can estimate subtype proportions and subtype-specific expressions in individual samples from bulk tissue transcriptomes. We extend our previous CAM framework to include a new term accounting for between-sample variations and formulate swCAM as a nuclear-norm and ℓ2,1-norm regularized matrix factorization problem. We determine hyperparameter values using cross-validation with random entry exclusion and obtain a swCAM solution using an efficient alternating direction method of multipliers. Experimental results on realistic simulation data show that swCAM can accurately estimate subtype-specific expressions in individual samples and successfully extract co-expression networks in particular subtypes that are otherwise unobtainable using bulk data. In two real-world applications, swCAM analysis of bulk RNASeq data from brain tissue of cases and controls with bipolar disorder or Alzheimer's disease identified significant changes in cell proportion, expression pattern and co-expression module in patient neurons. Comparative evaluation of swCAM versus peer methods is also provided.

AVAILABILITY: The R Scripts of swCAM are freely available at https://github.com/Lululuella/swCAM. A user's guide and a vignette are provided.

SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

PMID:34904628 | DOI:10.1093/bioinformatics/btab839

PHIST: fast and accurate prediction of prokaryotic hosts from metagenomic viral sequences

Andrzej Zielezinski — Tue, 14 Dec 2021 06:00:00 -0500

Bioinformatics. 2021 Dec 14:btab837. doi: 10.1093/bioinformatics/btab837. Online ahead of print.

ABSTRACT

SUMMARY: PHIST (Phage-Host Interaction Search Tool) predicts prokaryotic hosts of viruses based on exact matches between viral and host genomes. It improves host prediction accuracy at species level over current alignment-based tools (on average by 3 percentage points) as well as alignment-free and CRISPR-based tools (by 14-20 percentage points). PHIST is also two orders of magnitude faster than alignment-based tools making it suitable for metagenomics studies.

AVAILABILITY AND IMPLEMENTATION: GNU-licensed C ++ code wrapped in Python API available at: https://github.com/refresh-bio/phist.

SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

PMID:34904625 | DOI:10.1093/bioinformatics/btab837

Cryo-shift: reducing domain shift in cryo-electron subtomograms with unsupervised domain adaptation and randomization

Hmrishav Bandyopadhyay — Mon, 13 Dec 2021 06:00:00 -0500

Bioinformatics. 2021 Nov 23:btab794. doi: 10.1093/bioinformatics/btab794. Online ahead of print.

ABSTRACT

MOTIVATION: Cryo-Electron Tomography (cryo-ET) is a 3D imaging technology that enables the visualization of subcellular structures in situ at near-atomic resolution. Cellular cryo-ET images help in resolving the structures of macromolecules and determining their spatial relationship in a single cell, which has broad significance in cell and structural biology. Subtomogram classification and recognition constitute a primary step in the systematic recovery of these macromolecular structures. Supervised deep learning methods have been proven to be highly accurate and efficient for subtomogram classification, but suffer from limited applicability due to scarcity of annotated data. While generating simulated data for training supervised models is a potential solution, a sizeable difference in the image intensity distribution in generated data as compared with real experimental data will cause the trained models to perform poorly in predicting classes on real subtomograms.

RESULTS: In this work, we present Cryo-Shift, a fully unsupervised domain adaptation and randomization framework for deep learning-based cross-domain subtomogram classification. We use unsupervised multi-adversarial domain adaption to reduce the domain shift between features of simulated and experimental data. We develop a network-driven domain randomization procedure with 'warp' modules to alter the simulated data and help the classifier generalize better on experimental data. We do not use any labeled experimental data to train our model, whereas some of the existing alternative approaches require labeled experimental samples for cross-domain classification. Nevertheless, Cryo-Shift outperforms the existing alternative approaches in cross-domain subtomogram classification in extensive evaluation studies demonstrated herein using both simulated and experimental data.

AVAILABILITYAND IMPLEMENTATION: https://github.com/xulabs/aitom.

SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

PMID:34897387 | DOI:10.1093/bioinformatics/btab794

Corrigendum to: Decombinator V4: an improved AIRR-C compliant software package for T-cell receptor sequence annotation

Thomas Peacock — Mon, 13 Dec 2021 06:00:00 -0500

Bioinformatics. 2021 Dec 13:btab550. doi: 10.1093/bioinformatics/btab550. Online ahead of print.

NO ABSTRACT

PMID:34897370 | DOI:10.1093/bioinformatics/btab550

CellMeSH: Probabilistic Cell-Type Identification Using Indexed Literature

Shunfu Mao — Sat, 11 Dec 2021 06:00:00 -0500

Bioinformatics. 2021 Dec 10:btab834. doi: 10.1093/bioinformatics/btab834. Online ahead of print.

ABSTRACT

: Single-cell RNA sequencing (scRNA-seq) is widely used for analyzing gene expression in multi-cellular systems and provides unprecedented access to cellular heterogeneity. scRNA-seq experiments aim to identify and quantify all cell types present in a sample. Measured single-cell transcriptomes are grouped by similarity and the resulting clusters are mapped to cell types based on cluster-specific gene expression patterns. While the process of generating clusters has become largely automated, annotation remains a laborious ad-hoc effort that requires expert biological knowledge. Here, we introduce CellMeSH - a new automated approach to identifying cell types for clusters based on prior literature. CellMeSH combines a database of gene-cell type associations with a probabilistic method for database querying. The database is constructed by automatically linking gene and cell type information from millions of publications using existing indexed literature resources. Compared to manually constructed databases, CellMeSH is more comprehensive and is easily updated with new data. The probabilistic query method enables reliable information retrieval even though the gene-cell type associations extracted from the literature are noisy. CellMeSH is also able to optionally utilize prior knowledge about tissues or cells for further annotation improvement. CellMeSH achieves top-one and top-three accuracies on a number of mouse and human datasets that are consistently better than existing approaches.

AVAILABILITY: Web server at https://uncurl.cs.washington.edu/db_query and API at https://github.com/shunfumao/cellmesh.

PMID:34893819 | DOI:10.1093/bioinformatics/btab834

A convolutional neural network for segmentation of yeast cells without manual training annotations

Herbert Kruitbosch — Sat, 11 Dec 2021 06:00:00 -0500

Bioinformatics. 2021 Dec 10:btab835. doi: 10.1093/bioinformatics/btab835. Online ahead of print.

ABSTRACT

MOTIVATION: Single-cell time-lapse microscopy is a ubiquitous tool for studying the dynamics of complex cellular processes. While imaging can be automated to generate very large volumes of data, the processing of the resulting movies to extract high-quality single-cell information remains a challenging task. The development of software tools that automatically identify and track cells is essential for realizing the full potential of time-lapse microscopy data. Convolutional neural networks (CNNs) are ideally suited for such applications, but require great amounts of manually annotated data for training, a time-consuming and tedious process.

RESULTS: We developed a new approach to CNN training for yeast cell segmentation based on synthetic data, and present i) a software tool for the generation of synthetic images mimicking brightfield images of budding yeast cells and ii) a convolutional neural network (Mask-RCNN) for yeast segmentation that was trained on a fully synthetic dataset. The Mask-RCNN performed excellently on segmenting actual microscopy images of budding yeast cells, and a density-based clustering algorithm (DBSCAN) was able to track the detected cells across the frames of microscopy movies. Our synthetic data creation tool completely bypassed the laborious generation of manually annotated training datasets, and can be easily adjusted to produce images with many different features. The incorporation of synthetic data creation into the development pipeline of CNN-based tools for budding yeast microscopy is a critical step towards the generation of more powerful, widely applicable and user-friendly image processing tools for this microorganism.

AVAILABILITY: The synthetic data generation code can be found at https://github.com/prhbrt/synthetic-yeast-cells. The Mask R-CNN, as well as the tuning and benchmarking scripts can be found at https://github.com/ymzayek/yeastcells-detection-maskrcnn We also provide Google Colab scripts that reproduce all the results of this work.

SUPPLEMENTARY INFORMATION: Supplementary material is available at Bioinformatics online.

PMID:34893817 | DOI:10.1093/bioinformatics/btab835

HolistIC: leveraging Hi-C and whole genome shotgun sequencing for double minute chromosome discovery

Matthew Hayes — Fri, 10 Dec 2021 06:00:00 -0500

Bioinformatics. 2021 Dec 9:btab816. doi: 10.1093/bioinformatics/btab816. Online ahead of print.

ABSTRACT

MOTIVATION: Double minute chromosomes are acentric extrachromosomal DNA artifacts that are frequently observed in the cells of numerous cancers. They are highly amplified and contain oncogenes and drug resistance genes, making their presence a challenge for effective cancer treatment. Algorithmic discovery of double minutes (DM) can potentially improve bench-derived therapies for cancer treatment. A hindrance to this task is that DMs evolve, yielding circular chromatin that shares segments from progenitor double minutes. This creates double minutes with overlapping amplicon coordinates. Existing DM discovery algorithms use whole genome shotgun sequencing in isolation, which can potentially incorrectly classify DMs that share overlapping coordinates.

RESULTS: In this study, we describe an algorithm called "HolistIC" that can predict double minutes in tumor genomes by integrating whole genome shotgun sequencing (WGS) and Hi-C sequencing data. The consolidation of these sources of information resolves ambiguity in double minute amplicon prediction that exists in DM prediction with WGS data used in isolation. We implemented and tested our algorithm on the tandem Hi-C and WGS datasets of three cancer datasets and a simulated dataset. Results on the cancer datasets demonstrated HolistIC's ability to predict DMs from Hi-C and WGS data in tandem. The results on the simulated data showed the HolistIC can accurately distinguish double minutes that have overlapping amplicon coordinates, an advance over methods that predict extrachromosomal amplification using WGS data in isolation.

AVAILABILITY: Our software, named "HolistIC", is available at http://www.github.com/mhayes20/HolistIC.

SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

PMID:34888626 | DOI:10.1093/bioinformatics/btab816

EnGRaiN: A Supervised Ensemble Learning Method for Recovery of Large-scale Gene Regulatory Networks

Maneesha Aluru — Fri, 10 Dec 2021 06:00:00 -0500

Bioinformatics. 2021 Dec 9:btab829. doi: 10.1093/bioinformatics/btab829. Online ahead of print.

ABSTRACT

MOTIVATION: Reconstruction of genome-scale networks from gene expression data is an actively studied problem. A wide range of methods that differ between the types of interactions they uncover with varying trade-offs between sensitivity and specificity have been proposed. To leverage benefits of multiple such methods, ensemble network methods that combine predictions from resulting networks have been developed, promising results better than or as good as the individual networks. Perhaps owing to the difficulty in obtaining accurate training examples, these ensemble methods hitherto are unsupervised.

RESULTS: In this paper, we introduce EnGRaiN, the first supervised ensemble learning method to construct gene networks. The supervision for training is provided by small training datasets of true edge connections (positives) and edges known to be absent (negatives) among gene pairs. We demonstrate the effectiveness of EnGRaiN using simulated datasets as well as a curated collection of A. thaliana datasets we created from microarray datasets available from public repositories. EnGRaiN shows better results not only in terms of ROC and PR characteristics for both real and simulated datasets compared to unsupervised methods for ensemble network construction, but also generates networks that can be mined for elucidating complex biological interactions.

AVAILABILITY: EnGRaiN software and the datasets used in the study are publicly available at the github repository: https://github.com/AluruLab/EnGRaiN.

SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

PMID:34888624 | DOI:10.1093/bioinformatics/btab829

scShaper: an ensemble method for fast and accurate linear trajectory inference from single-cell RNA-seq data

Johannes Smolander — Fri, 10 Dec 2021 06:00:00 -0500

Bioinformatics. 2021 Dec 9:btab831. doi: 10.1093/bioinformatics/btab831. Online ahead of print.

ABSTRACT

MOTIVATION: Computational models are needed to infer a representation of the cells, i.e. a trajectory, from single-cell RNA-sequencing data that model cell differentiation during a dynamic process. Although many trajectory inference methods exist, their performance varies greatly depending on the dataset and hence there is a need to establish more accurate, better generalisable methods.

RESULTS: We introduce scShaper, a new trajectory inference method that enables accurate linear trajectory inference. The ensemble approach of scShaper generates a continuous smooth pseudotime based on a set of discrete pseudotimes. We demonstrate that scShaper is able to infer accurate trajectories for a variety of trigonometric trajectories, including many for which the commonly used principal curves method fails. A comprehensive benchmarking with state-of-the-art methods revealed that scShaper achieved superior accuracy of the cell ordering and, in particular, the differentially expressed genes. Moreover, scShaper is a fast method with few hyperparameters, making it a promising alternative to the principal curves method for linear pseudotemporal ordering.

SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

AVAILABILITY AND IMPLEMENTATION: scShaper is available as an R package at https://github.com/elolab/scshaper. The submitted software version of scShaper and test data are available at https://doi.org/10.5281/zenodo.5734488.

PMID:34888622 | DOI:10.1093/bioinformatics/btab831

Unsupervised construction of computational graphs for gene expression data with explicit structural inductive biases

Paul Scherer — Fri, 10 Dec 2021 06:00:00 -0500

Bioinformatics. 2021 Dec 9:btab830. doi: 10.1093/bioinformatics/btab830. Online ahead of print.

ABSTRACT

MOTIVATION: Gene expression data is commonly used at the intersection of cancer research and machine learning for better understanding of the molecular status of tumour tissue. Deep learning predictive models have been employed for gene expression data due to their ability to scale and remove the need for manual feature engineering. However, gene expression data is often very high dimensional, noisy, and presented with a low number of samples. This poses significant problems for learning algorithms: models often overfit, learn noise, and struggle to capture biologically relevant information. In this article we utilise external biological knowledge embedded within structures of gene interaction graphs such as protein-protein interaction networks (PPI) to guide the construction of predictive models.

RESULTS: We present GINCCo (Gene Interaction Network Constrained Construction), an unsupervised method for automated construction of computational graph models for gene expression data that are structurally constrained by prior knowledge of gene interaction networks. We employ this methodology in a case study on incorporating a PPI network in cancer phenotype prediction tasks. Our computational graphs are structurally constructed using topological clustering algorithms on the PPI networks which incorporate inductive biases stemming from network biology research on protein complex discovery. Each of the entities in the GINCCo computational graph represent biological entities such as genes, candidate protein complexes and phenotypes instead of arbitrary hidden nodes of a neural network. This provides a biologically relevant mechanism for model regularisation yielding strong predictive performance whilst drastically reducing the number of model parameters and enabling guided post-hoc enrichment analyses of influential gene sets with respect to target phenotypes. Our experiments analysing a variety of cancer phenotypes show that GINCCo often outperform SVM, Fully-Connected MLP, and Randomly-Connected MLPs despite greatly reduced model complexity.

AVAILABILITY: https://github.com/paulmorio/gincco contains the source code for our approach. We also release a library with algorithms for protein complex discovery within protein-protein interaction networks at https://github.com/paulmorio/protclus. This repository contains implementations of the clustering algorithms used in this paper.

PMID:34888618 | DOI:10.1093/bioinformatics/btab830

Structural feature-driven pattern analysis for multitarget modulator landscapes

Vigneshwaran Namasivayam — Fri, 10 Dec 2021 06:00:00 -0500

Bioinformatics. 2021 Dec 9:btab832. doi: 10.1093/bioinformatics/btab832. Online ahead of print.

ABSTRACT

MOTIVATION: Multitargeting features of small-molecules have been of increasing interest in recent years. Polypharmacological drugs that address several therapeutic targets may provide greater therapeutic benefits for patients. Furthermore, multitarget compounds can be used to address proteins of the same (or similar) protein families for their exploration as potential pharmacological targets. In addition, the knowledge of multitargeting features is of major importance in the drug selection process; particularly in ultra-large virtual screening procedures to gain high-quality compound collections. However, large-scale multitarget modulator landscapes are almost non-existent.

RESULTS: We implemented a specific feature-driven computer-aided pattern analysis (C@PA) to extract molecular-structural features of inhibitors of the model protein family of ATP-binding cassette (ABC) transporters. New molecular-structural features have been identified that successfully expanded the known multitarget modulator landscape of pan-ABC transporter inhibitors. The prediction capability was biologically confirmed by the successful discovery of pan-ABC transporter inhibitors with a distinct inhibitory activity profile.

AVAILABILITY AND IMPLEMENTATION: The multitarget dataset is available under the http://www.panabc.info URL and its use is free of charge.

SUPPLEMENTARY INFORMATION: Supplementary data is available at Bioinformatics online.

PMID:34888617 | DOI:10.1093/bioinformatics/btab832

CNApy: a CellNetAnalyzer GUI in Python for Analyzing and Designing Metabolic Networks

Sven Thiele — Wed, 08 Dec 2021 06:00:00 -0500

Bioinformatics. 2021 Dec 8:btab828. doi: 10.1093/bioinformatics/btab828. Online ahead of print.

ABSTRACT

SUMMARY: Constraint-based reconstruction and analysis (COBRA) is a widely used modeling framework for analyzing and designing metabolic networks. Here we present CNApy, an open source cross-platform desktop application written in Python, which offers a state-of-the-art graphical front-end for the intuitive analysis of metabolic networks with COBRA methods. While the basic look-and-feel of CNApy is similar to the user interface of the MATLAB toolbox CellNetAnalyzer (CNA), it provides various enhanced features by using components of the powerful Qt library. CNApy supports a number of standard and advanced COBRA techniques and further functionalities can be easily embedded in its GUI facilitating modular extension in the future.

AVAILABILITY AND IMPLEMENTATION: CNApy can be installed via conda and its source code is freely available at https://github.com/cnapy-org/CNApy under the Apache 2 license.

PMID:34878104 | DOI:10.1093/bioinformatics/btab828

Back Translation for Molecule Generation

Yang Fan — Tue, 07 Dec 2021 06:00:00 -0500

Bioinformatics. 2021 Dec 7:btab817. doi: 10.1093/bioinformatics/btab817. Online ahead of print.

ABSTRACT

MOTIVATION: Molecule generation, which is to generate new molecules, is an important problem in bioinformatics. Typical tasks include generating molecules with given properties, molecular property improvement (i.e., improving specific properties of an input molecule), retrosynthesis (i.e., predicting the molecules that can be used to synthesize a target molecule), etc. Recently, deep learning based methods received more attention for molecule generation. The labeled data of bioinformatics is usually costly to obtain, but there are millions of unlabeled molecules. Inspired by the success of sequence generation in natural language processing with unlabeled data (He et al., 2016), we would like to explore an effective way of using unlabeled molecules for molecule generation.

RESULTS: We propose a new method, back translation for molecule generation, which is a simple yet effective semi-supervised method. Let X be the source domain, which is the collection of properties, the molecules to be optimized, etc. Let Y be the target domain which is the collection of molecules. In particular, given a main task which is about to learn a mapping from the source domain X to the target domain Y, we first train a reversed model g for the Y to X mapping. After that, we use g to back translate the unlabeled data in Y to X and obtain more synthetic data. Finally, we combine the synthetic data with the labeled data and train a model for the main task. We conduct experiments on molecular property improvement and retrosynthesis, and we achieve state-of-the-art results on four molecule generation tasks proposed by Jin et al. (2020) and one retrosynthesis benchmark, USPTO-50k.

AVAILABILITY: Our code and data is available at https://github.com/fyabc/BT4MolGen.

SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

PMID:34875015 | DOI:10.1093/bioinformatics/btab817

No one tool to rule them all: Prokaryotic gene prediction tool annotations are highly dependent on the organism of study

Nicholas J Dimonaco — Tue, 07 Dec 2021 06:00:00 -0500

Bioinformatics. 2021 Dec 7:btab827. doi: 10.1093/bioinformatics/btab827. Online ahead of print.

ABSTRACT

MOTIVATION: The biases in CoDing Sequence (CDS) prediction tools, which have been based on historic genomic annotations from model organisms, impact our understanding of novel genomes and metagenomes. This hinders the discovery of new genomic information as it results in predictions being biased towards existing knowledge. To date, users have lacked a systematic and replicable approach to identify the strengths and weaknesses of any CoDing Sequence (CDS) prediction tool and allow them to choose the right tool for their analysis.

RESULTS: We present an evaluation framework (ORForise) based on a comprehensive set of 12 primary and 60 secondary metrics that facilitate the assessment of the performance of CDS prediction tools. This makes it possible to identify which performs better for specific use-cases. We use this to assess 15 ab initio and model-based tools representing those most widely used (historically and currently) to generate the knowledge in genomic databases. We find that the performance of any tool is dependent on the genome being analysed, and no individual tool ranked as the most accurate across all genomes or metrics analysed. Even the top-ranked tools produced conflicting gene collections which could not be resolved by aggregation. The ORForise evaluation framework provides users with a replicable, data-led approach to make informed tool choices for novel genome annotations and for refining historical annotations.

AVAILABILITY: https://github.com/NickJD/ORForise.

SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

PMID:34875010 | DOI:10.1093/bioinformatics/btab827

pystablemotifs: Python library for attractor identification and control in Boolean networks

Jordan C Rozum — Tue, 07 Dec 2021 06:00:00 -0500

Bioinformatics. 2021 Dec 7:btab825. doi: 10.1093/bioinformatics/btab825. Online ahead of print.

ABSTRACT

SUMMARY: pystablemotifs is a Python 3 library for analyzing Boolean networks. Its non-heuristic and exhaustive attractor identification algorithm was previously presented in (Rozum et al. 2021). Here, we illustrate its performance improvements over similar methods and discuss how it uses outputs of the attractor identification process to drive a system to one of its attractors from any initial state. We implement six attractor control algorithms, five of which are new in this work. By design, these algorithms can return different control strategies, allowing for synergistic use. We also give a brief overview of the other tools implemented in pystablemotifs.

AVAILABILITY: The source code is on GitHub at https://github.com/jcrozum/pystablemotifs/.

SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

PMID:34875008 | DOI:10.1093/bioinformatics/btab825

miRe2e: a full end-to-end deep model based on Transformers for prediction of pre-miRNAs

Jonathan Raad — Tue, 07 Dec 2021 06:00:00 -0500

Bioinformatics. 2021 Dec 7:btab823. doi: 10.1093/bioinformatics/btab823. Online ahead of print.

ABSTRACT

MOTIVATION: MicroRNAs (miRNAs) are small RNA sequences with key roles in the regulation of gene expression at post-transcriptional level in different species. Accurate prediction of novel miRNAs is needed due to their importance in many biological processes and their associations with complicated diseases in humans. Many machine learning approaches were proposed in the last decade for this purpose, but requiring handcrafted features extraction in order to identify possible de novo miRNAs. More recently, the emergence of deep learning has allowed the automatic feature extraction, learning relevant representations by themselves. However, the state-of-art deep models require complex pre-processing of the input sequences and prediction of their secondary structure in order to reach an acceptable performance.

RESULTS: In this work we present miRe2e, the first full end-to-end deep learning model for pre-miRNA prediction. This model is based on Transformers, a neural architecture that uses attention mechanisms to infer global dependencies between inputs and outputs. It is capable of receiving the raw genome-wide data as input, without any pre-processing nor feature engineering. After a training stage with known pre-miRNAs, hairpin and non-harpin sequences, it can identify all the pre-miRNA sequences within a genome. The model has been validated through several experimental setups using the human genome, and it was compared with state-of-the-art algorithms obtaining 10 times better performance.

AVAILABILITY AND IMPLEMENTATION: Webdemo available at https://sinc.unl.edu.ar/web-demo/miRe2e/ and source code available for download at https://github.com/sinc-lab/miRe2e.

SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

PMID:34875006 | DOI:10.1093/bioinformatics/btab823

A Network-Based Drug Repurposing Method Via Non-Negative Matrix Factorization

Shagahyegh Sadeghi — Tue, 07 Dec 2021 06:00:00 -0500

Bioinformatics. 2021 Dec 7:btab826. doi: 10.1093/bioinformatics/btab826. Online ahead of print.

ABSTRACT

MOTIVATION: Drug repurposing is a potential alternative to the traditional drug discovery process. Drug repurposing can be formulated as a recommender system that recommends novel indications for available drugs based on known drug-disease associations. This paper presents a method based on non-negative matrix factorization (NMF-DR) to predict the drug-related candidate disease indications. This work proposes a recommender system-based method for drug repurposing to predict novel drug indications by integrating drug and diseases related data sources. For this purpose, this framework first integrates two types of disease similarities, the associations between drugs and diseases, and the various similarities between drugs from different views to make a heterogeneous drug-disease interaction network. Then, an improved non-negative matrix factorization-based method is proposed to complete the drug-disease adjacency matrix with predicted scores for unknown drug-disease pairs.

RESULTS: The comprehensive experimental results show that NMF-DR achieves superior prediction performance when compared with several existing methods for drug-disease association prediction.

AVAILABILITY: The program is available at https://github.com/sshaghayeghs/NMF-DR.

PMID:34875000 | DOI:10.1093/bioinformatics/btab826

Automated classification of cytogenetic abnormalities in hematolymphoid neoplasms

Andrew Cox — Tue, 07 Dec 2021 06:00:00 -0500

Bioinformatics. 2021 Dec 7:btab822. doi: 10.1093/bioinformatics/btab822. Online ahead of print.

ABSTRACT

MOTIVATION: Algorithms for classifying chromosomes, like convolutional deep neural networks (CNNs), show promise to augment cytogeneticists' workflows, however, a critical limitation is their inability to accurately classify various structural chromosomal abnormalities. In hematopathology, recurrent structural cytogenetic abnormalities herald diagnostic, prognostic, and therapeutic implications, but are laborious for expert cytogeneticists to identify. Non-recurrent cytogenetic abnormalities also occur frequently cancerous cells. Here, we demonstrate the feasibility of using CNNs to accurately classify many recurrent cytogenetic abnormalities while being able to reliably detect non-recurrent, spurious abnormal chromosomes, as well as provide insights into dataset assembly, model selection, and training methodology that improve overall generalizability and performance for chromosome classification.

RESULTS: Our top-performing model achieved a mean weighted F1 score of 96.86% on the validation set and 94.03% on the test set. Gradient class activation maps indicated that our model learned biologically-meaningful feature maps, reinforcing the clinical utility of our proposed approach. Altogether, this work: proposes a new dataset framework for training chromosome classifiers for use in a clinical environment, reveals that residual CNNs and cyclical learning rates confer superior performance, and demonstrates the feasibility of using this approach to automatically screen for many recurrent cytogenetic abnormalities while adeptly classifying non-recurrent abnormal chromosomes.

AVAILABILITY AND IMPLEMENTATION: Software is freely available at https://github.com/DaehwanKimLab/Chromosome-ReAd. The data underlying this article cannot be shared publicly due to it being protected patient information.

SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

PMID:34874998 | DOI:10.1093/bioinformatics/btab822

SCRIP: an accurate simulator for single-cell RNA sequencing data

Fei Qin — Tue, 07 Dec 2021 06:00:00 -0500

Bioinformatics. 2021 Dec 7:btab824. doi: 10.1093/bioinformatics/btab824. Online ahead of print.

ABSTRACT

MOTIVATION: Recent advancements in single-cell RNA sequencing (scRNA-seq) have enabled time-efficient transcriptome profiling in individual cells. To optimize sequencing protocols and develop reliable analysis methods for various application scenarios, solid simulation methods for scRNA-seq data are required. However, due to the noisy nature of scRNA-seq data, currently available simulation methods cannot sufficiently capture and simulate important properties of real data, especially the biological variation. In this study, we developed SCRIP, a novel simulator for scRNA-seq that is accurate and enables simulation of bursting kinetics.

RESULTS: Compared to existing simulators, SCRIP showed a significantly higher accuracy of stimulating key data features, including mean-variance dependency in all experiments. SCRIP also outperformed other methods in recovering cell-cell distances. The application of SCRIP in evaluating differential expression analysis methods showed that edgeR outperformed other examined methods in differential expression analyses, and ZINB-WaVE improved the AUC at high dropout rates. Collectively, this study provides the research community with a rigorous tool for scRNA-seq data simulation.

AVAILABILITY AND IMPLEMENTATION: https://CRAN.R-project.org/package=SCRIP.

SUPPLEMENTARY INFORMATION: Supplementary files are available at Bioinformatics online.

PMID:34874992 | DOI:10.1093/bioinformatics/btab824

DiscoRhythm: an easy-to-use web application and R package for discovering rhythmicity

Matthew Carlucci — Tue, 07 Dec 2021 06:00:00 -0500

Bioinformatics. 2021 Jul 28:btab517. doi: 10.1093/bioinformatics/btab517. Online ahead of print.

NO ABSTRACT

PMID:34874990 | DOI:10.1093/bioinformatics/btab517

3MCor: an integrative web server for metabolome-microbiome-metadata correlation analysis

Tao Sun — Tue, 07 Dec 2021 06:00:00 -0500

Bioinformatics. 2021 Dec 7:btab818. doi: 10.1093/bioinformatics/btab818. Online ahead of print.

ABSTRACT

MOTIVATION: The metabolome and microbiome disorders are highly associated with human health and there are great demands for dual-omics interaction analysis. Here, we designed and developed an integrative platform, 3MCor, for metabolome and microbiome correlation analysis under the instruction of phenotype and with the consideration of confounders.

RESULTS: Many traditional and novel correlation analysis methods were integrated for intra- and inter-correlation analysis. Three inter-correlation pipelines are provided for global, hierarchical, and pairwise analysis. The incorporated network analysis function is conducive to rapid identification of network clusters and key nodes from a complicated correlation network. Complete numerical results (csv files) and rich figures (pdf files) will be generated in minutes. To our knowledge, 3MCor is the first platform developed specifically for the correlation analysis of metabolome and microbiome. Its functions were compared with corresponding modules of existing omics data analysis platforms. A real-world data set was used to demonstrate its simple and flexible operation, comprehensive outputs, and distinctive contribution to dual-omics studies.

AVAILABILITY: 3MCor is available at http://3mcor.cn and the backend R script is available at https://github.com/chentianlu/3MCorServer.

SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

PMID:34874987 | DOI:10.1093/bioinformatics/btab818

Hubness reduction improves clustering and trajectory inference in single-cell transcriptomic data

Elise Amblard — Mon, 06 Dec 2021 06:00:00 -0500

Bioinformatics. 2021 Nov 23:btab795. doi: 10.1093/bioinformatics/btab795. Online ahead of print.

ABSTRACT

MOTIVATION: Single-cell RNA-seq (scRNAseq) datasets are characterized by large ambient dimensionality, and their analyses can be affected by various manifestations of the dimensionality curse. One of these manifestations is the hubness phenomenon, i.e. existence of data points with surprisingly large incoming connectivity degree in the datapoint neighbourhood graph. Conventional approach to dampen the unwanted effects of high dimension consists in applying drastic dimensionality reduction. It remains unexplored if this step can be avoided thus retaining more information than contained in the low-dimensional projections, by correcting directly hubness.

RESULTS: We investigated hubness in scRNAseq data. We show that hub cells do not represent any visible technical or biological bias. The effect of various hubness reduction methods is investigated with respect to the clustering, trajectory inference and visualization tasks in scRNAseq datasets. We show that hubness reduction generates neighbourhood graphs with properties more suitable for applying machine learning methods; and that it outperforms other state-of-the-art methods for improving neighbourhood graphs. As a consequence, clustering, trajectory inference and visualization perform better, especially for datasets characterized by large intrinsic dimensionality. Hubness is an important phenomenon characterizing data point neighbourhood graphs computed for various types of sequencing datasets. Reducing hubness can be beneficial for the analysis of scRNAseq data with large intrinsic dimensionality in which case it can be an alternative to drastic dimensionality reduction.

AVAILABILITY AND IMPLEMENTATION: The code used to analyze the datasets and produce the figures of this article is available from https://github.com/sysbio-curie/schubness.

SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

PMID:34871374 | DOI:10.1093/bioinformatics/btab795

HDMC: a novel deep learning based framework for removing batch effects in single-cell RNA-seq data

Xiao Wang — Sun, 05 Dec 2021 06:00:00 -0500

Bioinformatics. 2021 Dec 4:btab821. doi: 10.1093/bioinformatics/btab821. Online ahead of print.

ABSTRACT

MOTIVATION: With the development of single-cell RNA sequencing (scRNA-seq) techniques, increasingly more large-scale gene expression datasets become available. However, to analyze datasets produced by different experiments, batch effects among different datasets must be considered. Although several methods have been recently published to remove batch effects in scRNA-seq data, two problems remain to be challenging and not completely solved: 1) how to reduce the distribution differences of different batches more accurately; 2) how to align samples from different batches to recover the cell type clusters.

RESULTS: We proposed a novel deep learning approach, which is a hierarchical distribution matching framework assisted with contrastive learning to address these two problems. Firstly, we design a hierarchical framework for distribution matching based on a deep autoencoder. This framework employs an adversarial training strategy to match the global distribution of different batches. This provides an improved foundation to further match the local distributions with a maximum mean discrepancy (MMD) based loss. For local matching, we divide cells in each batch into clusters and develop a contrastive learning mechanism to simultaneously align similar cluster pairs and keep noisy pairs apart from each other. This allows to obtain clusters with all cells of the same type (true positives), and avoid clusters with cells of different type (false positives). We demonstrate the effectiveness of our method on both simulated and real datasets. Results show that our new method significantly outperforms the state-of-the-art methods and has the ability to prevent overcorrection.

AVAILABILITY: The python code to generate results and figures in this paper is available at https://github.com/zhanglabNKU/HDMC.

SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

PMID:34864918 | DOI:10.1093/bioinformatics/btab821

TreeAndLeaf: an R/Bioconductor package for graphs and trees with focus on the leaves

Milena A Cardoso — Sun, 05 Dec 2021 06:00:00 -0500

Bioinformatics. 2021 Dec 2:btab819. doi: 10.1093/bioinformatics/btab819. Online ahead of print.

ABSTRACT

MOTIVATION: Dendrogram is a classical diagram for visualizing binary trees. Although efficient to represent hierarchical relations, it provides limited space for displaying information on the leaf elements, especially for large trees.

RESULTS: Here we present TreeAndLeaf, an R/Bioconductor package that implements a hybrid layout strategy to represent tree diagrams with focus on the leaves. The TreeAndLeaf package combines force-directed graph and tree layout algorithms using a single visualization system, allowing projection of multiple layers of information onto a graph-tree diagram. The Supplementary Information provides two case studies that use breast cancer data from epidemiological and experimental studies.

AVAILABILITY: TreeAndLeaf is written in the R language, and is available from the Bioconductor project at http://bioconductor.org/packages/TreeAndLeaf/ (version¿=1.4.2).

SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

PMID:34864914 | DOI:10.1093/bioinformatics/btab819

Detecting Spatially Co-expressed Gene Clusters with Functional Coherence by Graph-regularized Convolutional Neural Network

Tianci Song — Sun, 05 Dec 2021 06:00:00 -0500

Bioinformatics. 2021 Dec 2:btab812. doi: 10.1093/bioinformatics/btab812. Online ahead of print.

ABSTRACT

MOTIVATION: Clustering spatial-resolved gene expressions is an essential analysis to reveal gene activities in the underlying morphological context by their functional roles. However, conventional clustering analysis does not consider gene expression co-localizations in tissue for detecting spatial expression patterns or functional relationships among the genes for biological interpretation in the spatial context. In this paper, we present a Convolutional Neural Network (CNN) regularized by the graph of Protein-Protein Interaction (PPI) network to cluster spatially-resolved gene expressions. This method improves the coherence of spatial patterns and provides biological interpretation of the gene clusters in the spatial context by exploiting the spatial localization by convolution and gene functional relationships by graph-Laplacian regularization.

RESULTS: In the experiments, we tested clustering the spatially variable genes or all expressed genes in the transcriptome in 22 Visium spatial transcriptomics datasets of different tissue sections publicly available from 10x Genomics and spatialLIBD. The results demonstrate that the PPI-regularized CNN constantly detects gene clusters with coherent spatial patterns and significantly enriched by gene functions with the-state-of-the-art performance. Additional case studies on mouse kidney tissue and human breast cancer tissue suggest that the PPI-regularized CNN also detects spatially co-expressed genes to define the corresponding morphological context in the tissue with valuable insights.

AVAILABILITY: Source code is available at https://github.com/kuanglab/CNN-PReg.

PMID:34864909 | DOI:10.1093/bioinformatics/btab812

RCSB Protein Data Bank: Improved Annotation, Search, and Visualization of Membrane Protein Structures Archived in the PDB

Sebastian Bittrich — Sun, 05 Dec 2021 06:00:00 -0500

Bioinformatics. 2021 Dec 2:btab813. doi: 10.1093/bioinformatics/btab813. Online ahead of print.

ABSTRACT

MOTIVATION: Membrane proteins are encoded by approximately one fifth of human genes but account for more than half of all US FDA approved drug targets. Thanks to new technological advances, the number of membrane proteins archived in the PDB is growing rapidly. However, automatic identification of membrane proteins or inference of membrane location is not a trivial task.

RESULTS: We present recent improvements to the RCSB Protein Data Bank web portal (RCSB PDB, rcsb.org) that provide a wealth of new membrane protein annotations integrated from 4 external resources: OPM, PDBTM, MemProtMD, and mpstruc. We have substantially enhanced the presentation of data on membrane proteins. The number of membrane proteins with annotations available on rcsb.org was increased by ∼80%. Users can search for these annotations, explore corresponding tree hierarchies, display membrane segments at the 1D amino acid sequence level, and visualize the predicted location of the membrane layer in 3D.

AVAILABILITY: Annotations, search, tree data, and visualization are available at our rcsb.org web portal. Membrane visualization is supported by the open-source Mol* viewer (molstar.org and github.com/molstar/molstar).

SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

PMID:34864908 | DOI:10.1093/bioinformatics/btab813

GLYCO: a tool to quantify glycan shielding of glycosylated proteins

Myungjin Lee — Sun, 05 Dec 2021 06:00:00 -0500

Bioinformatics. 2021 Nov 23:btab791. doi: 10.1093/bioinformatics/btab791. Online ahead of print.

ABSTRACT

MOTIVATION: Glycans play important roles in protein folding and cell-cell interactions-and, furthermore, glycosylation of protein antigens can dramatically impact immune responses. While there have been attempts to quantify the glycan shielding or coverage of a protein surface, none of the publicly available tools analyzes glycan shielding computationally at an atomistic level.

RESULTS: Here, we developed an in silico approach, GLYCO (GLYcan COverage), to quantify the glycan shielding of a protein surface. The software provides insights into glycan-dense/sparse regions of the entire protein surface or a subset of the protein surface. GLYCO calculates glycan shielding from a single coordinate file or from multiple coordinate files, for instance, as obtained from molecular dynamics simulations or by nuclear magnetic resonance spectroscopy structure determination, enabling analysis of glycan dynamics. Overall, GLYCO provides fundamental insights into the glycan shielding of glycosylated proteins.

AVAILABILITY AND IMPLEMENTATION: GLYCO is freely available at GitHub (https://github.com/myungjinlee/GLYCO).

SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

PMID:34864901 | DOI:10.1093/bioinformatics/btab791

Identification, semantic annotation and comparison of combinations of functional elements in multiple biological conditions

Michele Leone — Sun, 05 Dec 2021 06:00:00 -0500

Bioinformatics. 2021 Dec 2:btab815. doi: 10.1093/bioinformatics/btab815. Online ahead of print.

ABSTRACT

MOTIVATION: Approaches such as chromatin immunoprecipitation followed by sequencing (ChIP-seq) represent the standard for the identification of binding sites of DNA-associated proteins, including transcription factors and histone marks. Public repositories of omics data contain a huge number of experimental ChIP-seq data, but their reuse and integrative analysis across multiple conditions remain a daunting task.

RESULTS: We present the Combinatorial and Semantic Analysis of Functional Elements (CombSAFE), an efficient computational method able to integrate and take advantage of the valuable and numerous, but heterogeneous, ChIP-seq data publicly available in big data repositories. Leveraging natural language processing techniques, it integrates omics data samples with semantic annotations from selected biomedical ontologies; then, using hidden Markov models, it identifies combinations of static and dynamic functional elements throughout the genome for the corresponding samples. CombSAFE allows analyzing the whole genome, by clustering patterns of regions with similar functional elements and through enrichment analyses to discover ontological terms significantly associated with them. Moreover, it allows comparing functional states of a specific genomic region to analyze their different behavior throughout the various semantic annotations. Such findings can provide novel insights by identifying unexpected combinations of functional elements in different biological conditions.

AVAILABILITY AND IMPLEMENTATION: The Python implementation of the CombSAFE pipeline is freely available for non-commercial use at: https://github.com/DEIB-GECO/CombSAFE.

SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

PMID:34864898 | DOI:10.1093/bioinformatics/btab815

Predicting the multi-label protein subcellular localization through multi-information fusion and MLSI dimensionality reduction based on MLFE classifier

Yushuang Liu — Sun, 05 Dec 2021 06:00:00 -0500

Bioinformatics. 2021 Dec 2:btab811. doi: 10.1093/bioinformatics/btab811. Online ahead of print.

ABSTRACT

MOTIVATION: Multi-label protein subcellular localization (SCL) is an indispensable way to study protein function. It can locate a certain protein (such as the human transmembrane protein that promotes the invasion of the SARS-CoV-2) or expression product at a specific location in a cell, which can provide a reference for clinical treatment of diseases such as COVID-19.

RESULTS: The paper proposes a novel method named ML-locMLFE. First of all, six feature extraction methods are adopted to obtain protein effective information. These methods include pseudo amino acid composition (PseAAC), encoding based on grouped weight (EBGW), gene ontology (GO), multi-scale continuous and discontinuous (MCD), residue probing transformation (RPT) and evolutionary distance transformation (EDT). In the next part, we utilize the multi-label information latent semantic index (MLSI) method to avoid the interference of redundant information. In the end, multi-label learning with feature induced labeling information enrichment (MLFE) is adopted to predict the multi-label protein SCL. The Gram-positive bacteria dataset is chosen as a training set, while the Gram-negative bacteria dataset, virus dataset, newPlant dataset and SARS-CoV-2 dataset as the test sets. The overall actual accuracy (OAA) of the first four datasets is 99.23%, 93.82%, 93.24%, and 96.72% by the leave-one-out cross validation (LOOCV). It is worth mentioning that the OAA prediction result of our predictor on the SARS-CoV-2 dataset is 72.73%. The results indicate that the ML-locMLFE method has obvious advantages in predicting the SCL of multi-label protein, which provides new ideas for further research on the SCL of multi-label protein.

AVAILABILITY AND IMPLEMENTATION: The source codes and data are publicly available at https://github.com/QUST-AIBBDRC/ML-locMLFE/.

SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

PMID:34864897 | DOI:10.1093/bioinformatics/btab811

RCandy: an R package for visualising homologous recombinations in bacterial genomes

Chrispin Chaguza — Sun, 05 Dec 2021 06:00:00 -0500

Bioinformatics. 2021 Dec 2:btab814. doi: 10.1093/bioinformatics/btab814. Online ahead of print.

ABSTRACT

MOTIVATION: Homologous recombination is an important evolutionary process in bacteria and other prokaryotes, which increases genomic sequence diversity and can facilitate adaptation. Several methods and tools have been developed to detect genomic regions recently affected by recombination. Exploration and visualisation of such recombination events can reveal valuable biological insights, but it remains challenging. Here, we present RCandy, a platform-independent R package for rapid, simple, and flexible visualisation of recombination events in bacterial genomes.

AVAILABILITY: RCandy is an R package freely available for use under the MIT license. It is platform-independent and has been tested on Windows, Linux, and MacOSX. The source code comes together with a detailed vignette available on GitHub at https://github.com/ChrispinChaguza/RCandy.

SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

PMID:34864895 | DOI:10.1093/bioinformatics/btab814

trfermikit: a tool to discover VNTR-associated deletions

Peter McHale — Sun, 05 Dec 2021 06:00:00 -0500

Bioinformatics. 2021 Dec 2:btab805. doi: 10.1093/bioinformatics/btab805. Online ahead of print.

ABSTRACT

RESULTS: We present trfermikit, a software tool designed to detect deletions larger than 50 bp occurring in Variable Number Tandem Repeats (VNTRs) using Illumina DNA sequencing reads. In such regions, it achieves a better trade-off between sensitivity and false discovery than a state-of-the-art structural variation (SV) caller, Manta, and complements it by recovering a significant number of deletions that Manta missed. trfermikit is based upon the fermikit pipeline, which performs read assembly, maps the assembly to the reference genome, and calls variants from the alignment.

AVAILABILITY: https://github.com/petermchale/trfermikit.

SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

PMID:34864893 | DOI:10.1093/bioinformatics/btab805

SBGNview: towards data analysis, integration and visualization on all pathways

Xiaoxi Dong — Sun, 05 Dec 2021 06:00:00 -0500

Bioinformatics. 2021 Nov 23:btab793. doi: 10.1093/bioinformatics/btab793. Online ahead of print.

ABSTRACT

SUMMARY: Pathway analysis is widely used in genomics and omics research, but the data visualization has been highly limited in function, pathway coverage and data format. Here, we develop SBGNview a comprehensive R package to address these needs. By adopting the standard SBGN format, SBGNview greatly extend the coverage of pathway-based analysis and data visualization to essentially all major pathway databases beyond KEGG, including 5200 reference pathways and over 3000 species. In addition, SBGNview substantially extends or exceeds current tools (esp. Pathview) in both design and function, including standard input format (SBGN), high-quality output graphics (SVG format) convenient for both interpretation and further update, and flexible and open-end workflow for iterative editing and interactive visualization (Highlighter module). In addition to pathway analysis and data visualization, SBGNview provides essential infrastructure for SBGN data manipulation and processing.

AVAILABILITY AND IMPLEMENTATION: The data underlying this article are available as part of the SBGNview package is available on both GitHub and Bioconductor: https://github.com/datapplab/SBGNview, https://bioconductor.org/packages/SBGNview.

SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

PMID:34864890 | DOI:10.1093/bioinformatics/btab793

Web interface for 3D visualization and analysis of SARS-CoV-2-human mimicry and interactions

Damla Ovek — Sun, 05 Dec 2021 06:00:00 -0500

Bioinformatics. 2021 Dec 2:btab799. doi: 10.1093/bioinformatics/btab799. Online ahead of print.

ABSTRACT

SUMMARY: We present a web-based server for navigating and visualizing possible interactions between SARS-CoV-2 and human host proteins. The interactions are obtained from HMI_Pred which relies on the rationale that virus proteins mimic host proteins. The structural alignment of the viral protein with one side of the human protein-protein interface determines the mimicry. The mimicked human proteins and predicted interactions, and the binding sites are presented. The user can choose one of the 18 SARS-CoV-2 protein structures and visualize the potential 3D complexes it forms with human proteins. The mimicked interface is also provided. The user can superimpose two interacting human proteins in order to see whether they bind to the same site or different sites on the viral protein. The server also tabulates all available mimicked interactions together with their match scores and number of aligned residues. This is the first server listing and cataloging all interactions between SARS-CoV-2 and human protein structures, enabled by our innovative interface mimicry strategy.

AVAILABILITY: The server is available at https://interactome.ku.edu.tr/sars/.

PMID:34864889 | DOI:10.1093/bioinformatics/btab799