Bioinformatics Made Simple.com

How to create a 3D pie chart in R

2020-09-02T23:06:00.007+05:30

Pie Chart, represented in the circular chart symbol, is easy to understand complex data. Each section of the circle shows the data value proportions. The Pie charts can be of two-dimensional view or three-dimensional views based upon the taste of the user. In this example, I am going to use R package plotrix to draw a 3D pie chart.

Prerequisite

We need the following R libraries to run the script

plotrix

File format

Category    Number of genes    Color
Transcription_O    24    #ed2f52
Transport_O    13    #efc023
Catalytic     31    #008080
Phosphorylation    5    #8FBC8B
Cell Wall     7    #AFEEEE
Defense    16    #CD853F
Secondary metabolites    4    #A0522D
Unknown    26    #9ACD32
Miscellaneous    28    #D8BFD8
Uncharacterized    10    #E6E6FA

Script

Result

How to add function descriptions to FASTA sequences HERE

KEGG Sequence Downloader : retrieve gene sequences in Fasta format from KEGG database

2020-04-09T00:11:00.001+05:30

I wanted to download the gene sequence of tobacco from NCBI. Since NCBI also contains the isoform and some other unwanted genes, therefore I choose to get it from KEGG. Although KEGGREST is a wonderful R package to retrieve the data from KEGG, but it limits the retrieval. The following bash script can help to download the thousands of sequences in a single go without any limitation. Although this is a crude solution and there must be an efficient way to do it but it worked for me. Basically, this bash script works in three steps:

Split IDs in a given chunk
Download fasta sequences as HTML file
Clean HTML file and save the result

Uses

bash KEGG_sequence_downloader.sh query_file number_of_sequence

How to download only viridiplantae miRNA from miRBase HERE

Script

Script name	Download
KEGG_sequence_downloader.sh

Easiest way to find number of cluster in gene expression data

2020-01-22T06:22:00.004+05:30

Gene clustering is a common method to find the groups of the gene with similar expression patterns. However, it is not always easy to decide the number of clusters in the whole datasets. The following R script uses the most popular methods for determining the optimal clusters. This R script uses "TF_average.csv" as input and saves the result as "optial_cluster.png".

How to perform Non-metric multidimensional scaling (NMDS) analysis script HERE

Prerequisite

We need the following R libraries to run the script

factoextra
NbClust

optimal_cluster_finder.R

Easiest way to calculate Ka Ks ratio and divergence time

2020-01-18T00:09:00.002+05:30

The Ka/Ks ratio is used to estimate the nature of evolution among neutral, purifying selection and beneficial mutations acting on a set of homologous protein-coding genes. Although there are several wonderful programs out there to calculate the Ka, Ks and Ka/Ks ratio but kaks function of R package seqinr is the easiest way to do it. The function kaks use the method published by Li et al (J. Mol. Evol., 36:96-99, 1993) to calculate the Ka, Ks and Ka/Ks ratio. In the following script take aligned the fasta sequences in the clustal format and, finally, R package seqinr will be used to calculate the Ka, Ks and Ka/Ks ratio.

Cheat Sheet to Install and work with R on Ubuntu HERE

Prerequisite

We need the following R libraries to run the script

seqinr
ape
phangorn

Ka/Ks_calculator.R

Draw a heatmap with Custom Symbol in Cell

2019-09-10T21:47:00.003+05:30

Heatmap is a good way to save some space when you want to compose a figure with lots of panels. I got some gene expression data which were supposed to insert in a big figure. Although I can easily create a bar graph for that, but I choose the draw a heatmap to save some space. The easiest way was to create this heatmap by the Excel but I choose ggplot2 R package to draw the heatmap because it was easy to handle the big data and customize the annotation. My aim was to draw the heatmap and annotate the cell where the difference of gene expression is statistically significantly from the control. I choose star (*) to show the cells which are significantly different from control.

Prerequisite

We need the following R libraries to run the script

ggplot2
plyr
scales

Cheat Sheet to Install and work with R on Ubuntu HERE

Sample data

ggplot2_heatmap.R

How to add function descriptions to FASTA sequences

2019-08-28T23:36:00.002+05:30

Short descriptions in fasta sequence help us to quickly gain insight into important information about a sequence. Automatic assignment of Human Readable Descriptions (AHRD) is popularly used to add select descriptions and Gene Ontology terms that are concise, informative and precise. This bash script run DIAMOND to search homology to three different databases (TAIR, uniprot_sprot, and uniprot_trembl), then run AHRD and, finally perform the addition of new information fasta description.

USES

bash run_AHRD.sh fasta_input

Extract Part of a FASTA Sequences with Position by python script HERE

Dependencies

Install AHRD
Install DIAMOND (move to dist directory)
Download database sequences for TAIR, uniprot_sprot, and uniprot_trembl. make the DIAMOND blast database and name them uniprot_sprot, arabidopsis, and uniprot_trembl and move the file to a new directory database.
Download and decompress the resources.tar in the working directory.

run_AHRD.sh

How to rename fasta headers according to a matching name list

2019-08-26T21:34:00.001+05:30

FaBox has several utilities to manipulate the FASTA sequence. I wanted to replace the FASTA header with the new header or description which are saved in a file. Although I can do it with FaBox, but it handles difficult when the number of sequences is huge. This PERL script will rename the fasta sequence as per store in another file.

Header

Header and new FASTA header should be separated by TAB

M54089d protein1
M54089c protein2
M54089b protein3
M54089a protein4

Sequence

FASTA should be in one line

Convert Multi line Fasta file into a Single line FASTA File HERE

>M54089d
MEQCRQGSRQNGSVTSGKGLALRAGHGGPSPEPVGCRWTARAAPAARAGRRVPAGGRTGNGSFGGLPRASHSQLRTGTDKGNPTV
>M54089c
MINFDHLFACLHGHYGEVENKLKCILHYFGRICSSMPLGYVSFERKVLSLECTPSCIPYPKEKAWSQSNISLCPIEITISGLIEDQSREAIEVDFANMYLGGGALVRGCVQQEEIRFMINPELIAGMLFLPCMADNEAVEIVGTERFSSYTGRLTKHFVASWINSSVISINSFSKMMASWDFNMIKMLKTPVEGPLLIFCRLVILQLHLKKLRKHRKTS
>M54089b
MIGRADIEGSKSNVAMNAWLHKPVIPVVTFLTPLASNSEGLKIVRPRFHGSYSYWKSESNELLPSVPHEISVRVELILGHLRYLLTDVPPQPNSPPDNVFRRIGLQASLGSKKRGSAPLPLHGISKITLEVVVFHFRLSAPTYTTPLKSFTKSD
>M54089a
MNGLTRFHCPCLLSSETTAKGTGLAESAGKEDPVELDSSRLCEMT

Script

This PERL script will ask for header list and FASTA sequences (file format given above) and save the FASTA file with new header in result.fasta
If you are working with unix based system, then this AWK one-liner will be very useful

awk 'FNR==NR{  a[">"$1]=$2;next}$1 in a{  sub(/>/,">"a[$1]"|",$1)}1' header_list.txt sequence.fasta

How to get gene expression value from Arrayexpress

2019-08-16T20:06:00.003+05:30

ArrayExpress has a wonderful R package for data search, download and analysis but it doesn't always work in perfection. Therefore it always good to have an alternative to download the raw data/CEL and analyze it. This R script will simply take the accession number in command line argument and save the expression data in a file named as Gene_expression.txt.

How to download expression data set from NCBI GEO HERE

Prerequisite

We need the following R libraries to run the script

ArrayExpress
aff

Uses

Rscript script_name accession_number

Limitations

This script is written specifically for Arabidopsis data sets, therefore, you have to modify it as per your requirement.

Script

Easiest way to download multiple sequences from NCBI

2019-07-25T17:49:00.001+05:30

NCBI and me have shared several tricks to download large set of sequence from database HERE and HERE, respectively. In this post. I am going to share another easy way to download multiple sequences from NCBI. This script will take the file accession list ( one accession number in each line) and download sequence in individual files. Finally, concatenate those files in a single multiline fasta file and delete them.

How to BLAST multiple sequences against NCBI database using PERL script HERE

#!/bin/bash
 
while read i; do curl -s  "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&id=${i}&rettype=fasta&retmode=txt">$i.fasta; done < id.list

#find . -name '*.fasta' -exec cat {} \; >protein.fas
cat *.fasta >protein.fas
rm *.fasta

How to perform parallel BLAST

2019-02-24T00:48:00.001+05:30

BLAST can be time-consuming especially when it includes a large number of the query sequence. Therefore, parallel BLAST can be useful. This Bash script can perform parallel BLASt in three common steps

Split fasta file into files with a given number of sequences
BLAST them in parallel fashion
Combine their result

Since it doesn't include any additional software except BLAST+, therefore, it is easy to use. If you want to parse your BLAST result you can always use NCBI BLAST parser from here.

USES

This bash script require following commands
script_name: Name of given script
blast_type: What kind of BLAST you want to perform e.g. blastn
query_file: Name of query file
database_name: Name of the database
number_of_sequence_each_file: how many sequences you want per query file

SCRIPT

#!/bin/bash

########################################################################################################################
#
#     split fasta file into files with given number of sequences 
#     blast them in parallel fashion
#     combine their result
########################################################################################################################

prog="$1"
query="$2"
database="$3"
num_seq="$4"


if [[ $# -lt 3 ]] ; then
    printf "\033[1;31mGive me a proper command\033[0m\n"
    printf "\033[1;31mUsage: script_name blast_type query_file database name number_of_sequence_each_file\033[0m\n\n"
    
exit;


else

start=`date +%s`

#split fasta sequences in given number of sequences per file 
awk -v a_seq=$num_seq 'BEGIN {n_seq=0;} /^>/ {if(n_seq%a_seq==0){file=sprintf("myseq%d.fa",n_seq);} print >> file; n_seq++; next;} { print >> file; }' < $query



#run blast
ls *.fa | parallel -a - $prog -query {} -db $database -out {.}.out -evalue 0.001 -num_descriptions 1 -num_alignments 1 -num_threads 8 

cat *.out >$query.blast.result


while true; do
    read -p "Do you want to delete intermediate files?" yn
    case $yn in
        [Yy]* ) rm -f *.out *.fa; break;;
        [Nn]* ) exit;;
        * ) echo "Please answer yes or no.";;

    esac
done

runtime=$((end-start))
printf "\033[1;31mYour analysis was done in $((($(date +%s)-$start)/60)) minutes\033[0m\n"${reset}
printf "\033[1;31mTHANKS\033[0m\n"${reset}
   
fi

Cheat Sheet to Install and work with R on Ubuntu

2019-02-08T22:09:00.000+05:30

Tips and tricks to Install and work with R on Ubuntu

Run these commands in terminal

#Update and Install R
sudo apt-get update
sudo apt-get install r-base r-base-dev
sudo apt-get upgrade

# Know R version
R --version

Extract Part of a FASTA Sequences with Position by python script HERE

Run these commands in R terminal

# Know R version
sessionInfo()

# Know version of specific packages
packageVersion("ggplot2")

# check installed packages
installed.packages()

# list all packages where an update is available
old.packages()

# update all available packages
update.packages()

# update, without prompts for permission/clarification
update.packages(ask = FALSE)
 update.packages(checkBuilt = TRUE, ask = FALSE, repos = "https://cran.rstudio.com")

# update only a specific package use install.packages()
install.packages("ggplot2")


# install packages from bioconductor
source("https://bioconductor.org/biocLite.R")
biocLite("ComplexHeatmap")

# install multiple packages from bioconductor
source("https://bioconductor.org/biocLite.R")
biocLite("ComplexHeatmap", "ggplot2")


# install packages from github
library(devtools)
install_github("jokergoo/ComplexHeatmap")


# start library
library('ggplot2')

Get multiple strings from a file and replace them in another file with AWK

2019-01-28T00:06:00.005+05:30

Get multiple strings from a file and replace them in another file

I have multiple strings (old strings) and their sustitution (new strings) in tab limited format in a file named as string. I want to replace them in another file named as Inputfile and save in Outfile.

Script

awk '
NR==FNR {
    old[NR] = $1
    gsub(/&/,RS,$2)
    new[NR] = $2
    next
}
{
    for (i=1; i in old; i++) {
        gsub(old[i],new[i])
    }
    gsub(RS,"\\&")
    print
}
' string Inputfile >Outfile

Extract Part of a FASTA Sequences with Position by python script HERE

Terminal screenshot

Examples For Sed Linux Command In Text Manipulation and File Handling

2019-01-25T00:18:00.000+05:30

Click on red strip to expand it

1. Replace all occurrence

1. Replace all occurrence Ram in Inputfile with Shyam and save the result in Outfile

sed 's/Ram/Shyam/' Inputfile  >Outfile

2. Replace all occurrence of multiple string

2. Replace all occurrence Ram and Sita in Inputfile with Shyam and Geeta respectively, and save the result in Outfile

sed -e 's/Ram/Shyam/; s/Sita/Geeta/' Inputfile  >Outfile

3. Reading sed commands from a file

3. Reading Commands From a File: I have multiple strings saved in Inputfile and want to replace them with multiple strings saved in a file string and save the in Outfile

sed -f string Inputfile  >Outfile

4. Substituting Flags: to control the replacement in file

Extract Part of a FASTA Sequences with Position by python script HERE

5. Limiting sed in a file

6. Delete Lines in a file with sed

sed '2d' Inputfile  >Outfile

- following command will delete second and third line

sed '2,3d' Inputfile  >Outfile

- following command will delete second line to the end of the file.

sed '2,$d' Inputfile  >Outfile

How to compare multiple sets using UpsetR

2018-07-05T19:59:00.000+05:30

Why UpSet

Everyday I face the problems that need to understand the relationships between sets. Ven diagram always a great job if the number of sets is limited (like up to 4)but it gets clumsy when the number of sets increases. A Venn diagram with multiple sets is difficult to interpret and easy to be lost. So UpSet is another "visualization technique for the quantitative analysis of sets, their intersections, and aggregates of intersections".

How to use UpSet

The source code of Python implementation of UpSet can be downloaded from HERE while R version is HERE. The web version of UpSet can be used from HERE or HERE. Obviously web versions are easy to use for any project but unfortunately, our taste doesn't match always. I just modified two main functions which draw the main plots. New functions give more flexibility to the plot such as

can automatically calculate the number of unique colours for each comparison.
colours of numbers can be changed and it doesn't have to be the same as bar color.
fonts are also changeable.

Prerequisite

We need following R libraries to run the script

UpSetR
ggplot2
grid
RColorBrewer
extrafont

Downloads

All files can be downloaded from here

UpsetR_modified
UpsetR_modified - This repository contains the R script the make plot using UpsetR library. Two functions were modified to make the package more flexible.

Extract Part of a FASTA Sequences with Position by python script HERE

How to make a group bar graph with error bars and split y axis

2018-05-19T01:22:00.000+05:30

I would like to draw a group bar graph with error bars and split y axis to show both smaller and larger values in same plot. Although plotrix has function to do that but I don't know how to moifiy their aweful looking graphs. I got a good solution HERE. I just modified it as per my taste and requirement. It need gplots, extrafont and RColorBrewer and produce a high resolution beautiful chart.

How to perform Non-metric multidimensional scaling (NMDS) analysis

How to make a Heatmap with multiple annotation

2018-04-11T02:05:00.005+05:30

I was interested to make a heatmap with multiple annotation with least interference. It will create an heatmap with multiple annotation such as: genotype, treatment, gene, class. Legends related to all annotations are given right side of the heatmap. Please visit this page for all file related to this R script. We need R libraries extrafont, ComplexHeatmap, circlize and RColorBrewer to run this script.

1. Easiet Way to Create A Heat Map In Excel

How to download expression data set from NCBI GEO

2018-03-02T03:33:00.000+05:30

NCBI has provided a wonderful tool GEO2R to do analysis of microarray data sets but sometime I need the normalized data sets to check the expression of given probe across experiments. In this case I can use this given R script which can help to download the expression data for whole experiment in a simple text file that can be used as in excel or other similar program. This script assume that you have installed GEOquery at your machine. If GEOquery is not pre installed then run this command

source("http://bioconductor.org/biocLite.R")
biocLite("GEOquery")

Script can be simply run from your terminal

Rscript GEO_expression.R accession_number

1. Easiest Way to Download All Sra Samples or Multi Experiment file from NCBI SRA database

How to perform Non-metric multidimensional scaling (NMDS) analysis

2017-10-10T23:11:00.002+05:30

Requirements
We need R libraries vegan, ggplot2, extrafont to run this script.

R scripts

For complete result of analysis and raw file, please visit HERE

How To Predict CRISPR-Cas9 target site in R HERE

Easiest Way to Download All Sra Samples or Multi Experiment file from NCBI SRA database- II

2017-07-06T02:07:00.003+05:30

Previously I shared a easy way to download the data files from NCBI SRA database. Although it took only wget to download the data files but it required lots of link editing. Here I am going to share another script which can download all files of any biostudy or a single file, depending upon the accession number provided to the script. This bash script will also simultaneously convert the native sra file into fastq files also. Hope it will be some help.

sra_download.sh


#!/bin/bash
id=$1
if [ "$#" -eq  "0" ]
 then
   echo "No accession number provided"
   exit 1
else
   echo -n "Please wait...."
   esearch -db sra -query $id  | efetch --format runinfo | cut -d ',' -f 1 | grep SRR | xargs fastq-dump --split-files
   echo
   echo -n "Download complete...."
   echo
fi

Save the script as sra_download.sh and run as

 bash sra_download.sh SRA_accession_number

You can use any kind of accession number including SRR3114162, PRJNA309373 or SRP068795. I have tested this script on Ubuntu and assume that Entrez Direct and NCBI SRA Toolkit is installed on your machine.

Extract Part of a FASTA Sequences with Position by python script HERE

How To Predict CRISPR-Cas9 target site in R

2017-04-28T23:45:00.002+05:30

Although there are several online bioinformatics tools to predict the target site for CRISPR-Cas9 but I was looking for a offline solution. I am going to share a R script that use two R libraies CRISPRseek and msa. The best part is that it also predict the restriction enzyme sites in target region therefore it will help in downstream analysis of mutant screening. Hope this script will help some one. As always R script is heavily commented to get it easy.

How To Perform Basic Multiple Sequence Alignments In R HERE

How To Perform Basic Multiple Sequence Alignments In R

2017-03-20T01:41:00.000+05:30

There are several servers out there to perform the multiple sequence alignment and visualization. I was looking for all in one offline tool to perform the multiple sequence alignment and generate a publication quality image. We can use msa package for both amino acid and DNA alignment. Following Rscript can be used for this purpose. Script is heavily commented to easy understand.Hope it will be helpful for others too.

Sequence Similarity Search - I : Basic Local Alignment Search Tool (BLAST) HERE

Easiest Way to Download All Sra Samples or Multi Experiment file from NCBI SRA database

2016-11-17T20:46:00.000+05:30

European Nucleotide Archive is good place to start to download the raw fastq files. But it is not easy to download multiple run files from NCBI SRA database. I recently learn to download in relatively easy way. I want to download these four run together from NCBI SRA database : SRR122247,SRR122248,SRR122249, SRR122250. Format of basic url is like that

ftp://ftp-trace.ncbi.nih.gov/sra/sra-instant/reads/ByRun/sra/{SRR|ERR|DRR}/<first 6 characters of accession>/<accession>/<accession>.sra

Where {SRR|ERR|DRR} should be either ‘SRR’, ‘ERR’, or ‘DRR’ and should match the prefix of the target .sra file Described in details HERE. So my final urls will look like this


ftp://ftp-trace.ncbi.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR122/SRR122247/SRR122247.sra
ftp://ftp-trace.ncbi.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR122/SRR122248/SRR122248.sra
ftp://ftp-trace.ncbi.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR122/SRR122249/SRR122249.sra
ftp://ftp-trace.ncbi.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR122/SRR122250/SRR122250.sra

Now save this urls in a text file (url_list.txt, for example). Run the wget like this

wget -i url_list.txt

run sudo apt-get install wget from terminal if you don't have wget

Journal Citation Reports 2015 / 2016

2016-06-15T20:37:00.001+05:30

Recently, Thomson Reuter announced Journal Citation Reports® for year 2016. Actually Journal Citation Reports® offers a systematic means to critically evaluate the world's leading journals, with quantifiable, statistical information based on citation data by compiling articles' cited references. Thus in scientific community journal impact is now considering as a quality measure of published work. JCR is actually contains the journal impact factor, number of times they cited in different journals and other statistical values. So this list of journal impact factors is made by the data gathered throughtout 2015 that is why you can also say it journal impact factor 2015 download. You can view this SCI journal impact factor 2015 in other tab by clicking HERE or you can download journal impact factor 2015 HERE

How to download only viridiplantae miRNA from miRBase

2016-01-19T23:08:00.003+05:30

There is no direct way to download the organism specific miRNA from miRBase database. So I extracted the miRNA of viridiplantae plant from miRBase using some unix command. Steps are as follows

Download the information regarding organisms from HERE.
Download the mature miRNA sequence from HERE
Extract both files in same directory
Download the fasta dereplicating python script from HERE
Now run the bash script given from the same directory

#!/bin/bash
#script to extact plant mirna from mirbase database

# convert fasta to tab
awk 'BEGIN{RS=">"}{gsub("\n"," ",$0); print ">"$0}' mature.fa >mature.tab


#extract the organisms belong to Viridiplantae. You can extract the miRNA for other
# organism too by changing the word "Viridiplantae"
grep Viridiplantae organisms.txt >plants_mirbase.txt

# extract name of plants
awk '{ print $3 " " $4 }' plants_mirbase.txt >plant_name.txt

#extract mirna for plants
grep -f plant_name.txt mature.tab >plant_mirna.tab

#convert tab to fasta
awk '{print ""$1" "$2" "$3" "$4" "$5"\n"$6}' plant_mirna.tab > plant_mirna.rna

#convert RNA to DNA
sed '/^[^>]/ y/uU/tT/' plant_mirna.rna  >plant_mirna.fasta


#dereplicate mirna file
python derep.py -i plant_mirna.fasta

#cleaning fasta header
cat derep_plant_mirna.fasta | awk -F ';' '{print $1}' >plant_mature_mirna_unique.fasta


rm mature.tab
rm plants_mirbase.txt
rm plant_mirna.tab
rm plant_mirna.rna
rm plant_name.txt
rm derep_plant_mirna.fasta

echo mature mirna from all plants are in plant_mirna.fasta!!!
echo unique mature mirna from all plants are in plant_mature_mirna_unique.fasta!!!
echo all job done!!!

Basically the above bash script extract the miRNA from plant deposited to miRBase database and save them to a file plant_mirna.fasta. In second part, it remove the duplicate miRNAs and save them in another file plant_mature_mirna_unique.fasta.

How to remove duplicate sequences from FASTA file HERE

BLAST Database creation error

2015-12-28T04:32:00.000+05:30

I was trying to create a BLAST database but I got this error

Building a new DB, current time: 12/03/2015 09:44:18
New DB name:   plant_protein
New DB title:  /home/sanjay/bin/Genomes/plant_protein_from_plantgdb.fa
Sequence type: Protein
Keep Linkouts: T
Keep MBits: T
Maximum file size: 1000000000B

volume: plant_protein

file: plant_protein.pin
file: plant_protein.phr
file: plant_protein.psq

BLAST Database creation error: FASTA-Reader: No residues given

Then I looked whether my any FASTA sequence is empty or not by running this command

grep -c "^$" ~/bin/Genomes/plant_protein_from_plantgdb.fa

I found that there is one sequence which have only FASTA header. To remove the empty FASTA sequence I run this command

awk 'BEGIN {RS = ">" ; FS = "\n" ; ORS = ""} $2 {print ">"$0}' ~/bin/Genomes/plant_protein_from_plantgdb.fa >~/bin/Genomes/plant_protein_from_plantgdb.fasta

And finally I got the happy success message

Building a new DB, current time: 12/03/2015 09:48:01
New DB name:   plant_protein
New DB title:  /home/sanjay/bin/Genomes/plant_protein_from_plantgdb.fasta
Sequence type: Protein
Keep Linkouts: T
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 980219 sequences in 24.2583 seconds.

How to install NCBI BLAST program on your computer HERE