Alignment -- local blast uses detailed 1- database sequence retrieval, download and comparison

This is a note from October 7, 2018. At that time, a batch of amoA genes of soil ammonia oxidizing bacteria (AOA/AOB) were cloned and sequenced, and the sequencing results need to be identified in batch. In order to save time, I taught myself about the local comparison of blast. At that time, I also used the blast-2.7.1 + version. Now download some sequences randomly from NCBI, reproduce and update the notes.

1, Software installation

ncbi-blast-2.12.0 + is available at: ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/.

# Linux download
wget -c -t 0 ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ncbi-blast-2
.12.0+-x64-linux.tar.gz
# You can also download the windows version
ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ncbi-blast-2.12.0+-win64.exe

# decompression 
tar -zxvf ncbi-blast-2.12.0+-x64-linux.tar.gz
 
# Set the bin directory where the blast subroutine is located as the environment variable
echo "export PATH=~/hj/software/ncbi-blast-2.12.0+/bin/:\${PATH}"  >> ${HOME}/.bashrc
source ~/.bashrc # Or export PATH=~/hj/software/ncbi-blast-2.12.0+/bin/:${PATH}

Figure 1 subroutine included in blast. Software user manual: http://www.ncbi.nlm.nih.gov/books/NBK279690 .

Table 1 instructions for use of blast subroutine( https://www.ncbi.nlm.nih.gov/books/NBK52640/ )

2, Data preparation

The NCBI database for annotation can be created using the researcher's own FASTA sequence and makeblastdb. NCBI ready-made blast database can also be used for decompression and direct use. Blast database download address: https://ftp.ncbi.nlm.nih.gov/blast/db/ . Database information description website: https://www.ncbi.nlm.nih.gov/books/NBK62345/#blast_ftp_site.The_blastdb_subdirectory. GENOME TAXONOMY DATABASE(https://gtdb.ecogenomic.org/ )Is a database of bacterial and archaeal genomes. Fungene( http://fungene.cme.msu.edu/hmm_detail.spr?hmm_id=358 )The functional gene sequence can be downloaded. At present, it can't be downloaded because of hardware failure. I don't know when it will be ready. There are many other databases that you can learn by yourself.

16S RNA sequence databasehttps://ftp.ncbi.nlm.nih.gov/blast/db/16S_ribosomal_RNA.tar.gz
18S fungal sequence number databasehttps://ftp.ncbi.nlm.nih.gov/blast/db/18S_fungal_sequences.tar.gz
28S fungal sequence databasehttps://ftp.ncbi.nlm.nih.gov/blast/db/28S_fungal_sequences.tar.gz
beta coronavirus nucleic acid databasehttps://ftp.ncbi.nlm.nih.gov/blast/db/Betacoronavirus-nucl-metadata.json
ITS fungal reference sequence databasehttps://ftp.ncbi.nlm.nih.gov/blast/db/ITS_RefSeq_Fungi.tar.gz
ITS eukaryotic reference sequence databasehttps://ftp.ncbi.nlm.nih.gov/blast/db/ITS_eukaryote_sequences.tar.gz
Eukaryotic ribosomal large subunit rRNA(LSU rRNA) databasehttps://ftp.ncbi.nlm.nih.gov/blast/db/LSU_eukaryote_rRNA.tar.gz
Prokaryotic ribosomal large subunit rRNA databasehttps://ftp.ncbi.nlm.nih.gov/blast/db/LSU_prokaryote_rRNA.tar.gz
Eukaryotic ribosomal small subunit rRNA(SSU rRNA) databasehttps://ftp.ncbi.nlm.nih.gov/blast/db/SSU_eukaryote_rRNA.tar.gz
Non redundant protein database (nr)https://ftp.ncbi.nlm.nih.gov/blast/db/nr-prot-metadata.json
Partial non redundant nucleic acid database (nt)https://ftp.ncbi.nlm.nih.gov/blast/db/nt-nucl-metadata.json
Swiss prot database (last major release)https://ftp.ncbi.nlm.nih.gov/blast/db/swissprot.tar.gz
Species classification databasehttps://ftp.ncbi.nlm.nih.gov/blast/db/taxdb.tar.gz

Table 2 | partial blast database. The database has multiple sub files that need to be downloaded together. View the metadata.json file corresponding to the database. You can view the database version, database type, including data files and other information. It can also be found in the backup link of the national Microbial Science Data Center( ftp://download.nmdc.cn/blast/db/ )Download in, but the update is delayed. The RefSeq Select database contains representative and selectable sequences of each protein coding gene( https://www.ncbi.nlm.nih.gov/refseq/refseq_select/ ). Marker gene project: https://www.ncbi.nlm.nih.gov/refseq/targetedloci/.

2.1 download blast v5 database

# Download and update blast V5 database
## Download and decompress by yourself
cd ~/hj/DB
nohup wget -t 0 -c https://ftp.ncbi.nlm.nih.gov/blast/db/16S_ribosomal_RNA.tar.gz >16s.log 2>&1 &
nohup wget -t 0 -c https://ftp.ncbi.nlm.nih.gov/blast/db/18S_ fungal_ Sequences. Tar. GZ > 18S. Log 2 > & 1 & # Download Database
nohup wget -t 0 -c https://ftp.ncbi.nlm.nih.gov/blast/db/ITS_RefSeq_Fungi.tar.gz >ITS_F.log 2>&1 &
nohup wget -t 0 -c https://ftp.ncbi.nlm.nih.gov/blast/db/ITS_eukaryote_sequences.tar.gz >ITS_E.log 2>&1 &
nohup wget -t 0 -c https://ftp.ncbi.nlm.nih.gov/blast/db/cdd_delta.tar.gz >CDD.log 2>&1 &

for file in `find . -name "*.gz"` 
do 
fold=`basename ${file} .tar.gz`
mkdir "$fold";
tar zxvf  "$file" -C "$fold";
done # Batch decompress the blast database into the specified directory to prevent the taxdb information from being overwritten.

## Using update_blastdb.pl download or update the database
update_blastdb.pl --showall # View all databases
### If the blast environment variable is not set, the full path of the perl script is used to run the program
perl ~/software/ncbi-blast-2.12.0+/bin/update_blastdb.pl --passive \
--decompress  taxdb \
--blastdb_version 5 # If the taxdb database already exists, this step is to update the taxdb database.
###update_blastdb.pl sometimes reports an error, which is more convenient to download manually.

## Set the database BLASTDB environment variable - to improve efficiency( https://www.ncbi.nlm.nih.gov/books/NBK52640/ )
echo "export BLASTDB=~/hj/DB" >> ~/.bashrc 
#### If BLASTDB is not set, blast will search the working directory

## Check the integrity and validity of the database,# pal is the suffix of protein; nal is a nucleic acid suffix
~/software/ncbi-blast-2.12.0+/bin/blastdbcheck -help # View help information
### Detect a specified database, - full means to detect each sequence in the database. The amount of data is large and can be replaced with - random,-ends or - stripe
~/software/ncbi-blast-2.12.0+/bin/blastdbcheck \
-db  /home/zhangguogang/hj/DB/16S_ribosomal_RNA/16S_ribosomal_RNA \
-dbtype nucl -verbosity 3 -full 

### Recursively detect the integrity of all databases in the directory, - verbosity set the output detail level of the detection results, 0-4 optional
~/software/ncbi-blast-2.12.0+/bin/blastdbcheck \
-dir ~/hj/DB/ \
-recursive \
-dbtype guess \
-verbosity 2 \
-full

Figure 2 Update_ Blastdb.pl updates the download database name.

Figure 3 - blastdbcheck checks database integrity.

2.2 sequence preparation for blast alignment

At that time, because I was doing the cloning and sequencing of AOA / AOB amoA genes, I downloaded the fasta format sequences of all AOA / AOB amoA genes in NCBI, built the database with makeblastdb, and then compared them. Here, use blastdbcmd to download some sequences for learning blast operation. DNA or protein sequences can also be downloaded from NCBI (which can be downloaded locally using edirect) or other databases, preferably with GI number.

# LMDB available for blast v5 data( http://www.lmdb.tech/doc/ )
##Or blastdbcmd to retrieve the sequence more quickly through the login number
blastdbcmd -help #View help information

## Download the sequence locally using the login number,
###You can also http://www.ncbi.nlm.nih.gov/sites/batchentrez Website Manual Download
cd ~/hj/blast
### Download a login number sequence
~/software/ncbi-blast-2.12.0+/bin/blastdbcmd \
-db $BLASTDB/16S_ribosomal_RNA/16S_ribosomal_RNA \
-entry nr_025000 -out 16S_query.fa 

### Batch download multiple login number sequences
cat batch_Accessions.txt # batch_ Access.txt contains three login numbers
nr_025000
nr_028940
nr_125568

~/software/ncbi-blast-2.12.0+/bin/blastdbcmd \
-db $BLASTDB/16S_ribosomal_RNA/16S_ribosomal_RNA \
-entry_batch batch_Accessions.txt -out 16S_query3.fa

Figure 4. Blastdbcmd obtains the sequence from the blast database according to the login number.

## You can use taxid to download data from NCBI,
###taxid is a unique digital ID assigned by NCBI to each taxon.
###It can be found in the taxonomy database( https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi)
###Or use get in blast_ species_ Taxids.sh queries based on Latin or scientific names
### Get the taxid list
~/software/ncbi-blast-2.12.0+/bin/get_species_taxids.sh \
-n Mycobacterium kubicae # taxid is a genus taxon 1763

~/software/ncbi-blast-2.12.0+/bin/get_species_taxids.sh \
-t 1763 > 1763.txids # Get all taxid s at and below 1763 classification level

### Use the taxid list to download data. Not all taxids can download data from the database.
~/software/ncbi-blast-2.12.0+/bin/blastdbcmd \
-db $BLASTDB/16S_ribosomal_RNA/16S_ribosomal_RNA \
-taxidlist 1763.txids \
-outfmt "%f %T %S" \
-target_only \
-out 16S_query_MK.fa 
#### Using the taxid download sequence, 1233042 represents ammonia oxidizing archaeon ar.
####In the output format,% a indicates the output login number,% T outputs taxid,% s outputs scientific name.

#### -target_only # Because nr is a non redundant database, a sequence may correspond to multiple login numbers.
####Set - target_only means that only the sequence containing only the login number to be retrieved is returned,
####Otherwise, the sequence containing at least one login number to be retrieved will be returned.

Figure 5|get_species_taxids.sh get the species taxon ID.

Figure 6|get_species_taxids.sh gets all taxid s under a taxon.

Figure 7. Blastdbcmd downloads sequence data using the taxid list.

The NCBI batchentrez website can also be accessed manually( http://www.ncbi.nlm.nih.gov/sites/batchentrez )Download: 1. Sort the serial number of the sequence to be downloaded into a TXT file, one line at a time. 2. Upload and download the txt file from the NCBI batch download sequence website

3, Blast local comparison process - blast n as an example

When the database and the sequences to be compared are ready, the search subroutines of blast (blast n, blast P, blast x, t blastn and tblastx) can be used for sequence retrieval. It can run in both windows and Linux. Blast user manual: https://www.ncbi.nlm.nih.gov/books/NBK279690/

3.1 use the downloaded blast database for sequence retrieval

# 3.1 use the downloaded blast database for sequence retrieval
## 3.1.1 use BLASTN (nucleic acid to nucleic acid) to search 16S RNA database_ Query3.fa sequence
blastn -help # Help information
#blastn comparison type. Open the output file to view the comparison results
~/software/ncbi-blast-2.12.0+/bin/blastn \
-db $BLASTDB/16S_ribosomal_RNA/16S_ribosomal_RNA \
-query 16S_query3.fa \
-task blastn \
-strand both \
-evalue 1e-5 \
-max_target_seqs 1 \
-outfmt 6 \
-out 16S_blast_out
## -strand both: you can select the sequence retrieval direction: both, minus or plus.
## -task: set the retrieval method, default ` megablast,
###Generally, blast n is selected, the base number of the sequence to be retrieved is < 50, and blast n-short is selected for retrieval.

## -max_target_seqs 1: displays the sequence with the highest matching degree
## -outfmt 6 #There are 17 formats available,
###Many gene structure and function annotations are based on the output results of blast(dimond is also widely used and runs faster),
###Commonly used XML(5) and Tabular(6, 7) formats,
###The output results are in the form of 6, 7 and 10. You can add additional output parameters yourself.

Table 3|blastn search options. https://www.ncbi.nlm.nih.gov/books/NBK569839/#usrman_BLAST_feat.Sequence_filtering_app

Fig. 8 output result of outfmt 6 of BLASTN. 16S_blast_out.

## 3.1.2 user defined output content: in order not to retrieve the sequence to be retrieved,
###Sort the sequence ID to be retrieved into rm.txt document,
###Use - negative when retrieving_ Seq id list is used to exclude sequence IDs
grep ">" 16S_query3.fa | awk  '{print $1}' | sed  's/>//g' > rm.txt

~/software/ncbi-blast-2.12.0+/bin/blastn \
-db $BLASTDB/16S_ribosomal_RNA/16S_ribosomal_RNA \
-query 16S_query3.fa \
-task blastn \
-strand both \
-evalue 1e-5 \
-negative_seqidlist rm.txt \
-num_descriptions 5  \
-num_alignments 5 \
-outfmt "7 std staxid sstrand" \
-num_threads 4 \
-out 16S_blast_out2 
# The output result is in the standard (std) format of 7,
##Add a column of taxon ID(staxid) and a column of the chain direction of the retrieved sequence.

## -num_descriptions 500 #To display the number of database sequences for a single line description,
###And - Max_ target_ If SEQS is incompatible, outfmt > 4 cannot be set.

## -num_alignments 250 #The output result contains the number of sequences matched,
###And - max_target_seqs incompatible

## -num_threads 4 # Sets the number of CPU s used for retrieval

Fig. 9 output results of outfmt 7 plus taxon and chain direction of BLASTN. 16S_blast_out2 comparison results do not contain themselves.

## 3.1.3 if there are many database sequences to be shielded or you want to compare them in the specified sequence of the database
### You can use blastdb_ The alias tool generates a bsl file according to the sequence ID,
###Used to specify a sequence to speed up retrieval.
~/software/ncbi-blast-2.12.0+/bin/blastdb_aliastool \
-seqid_file_in rm.txt  -seqid_file_out rm.txt.bsl

~/software/ncbi-blast-2.12.0+/bin/blastn \
-db $BLASTDB/16S_ribosomal_RNA/16S_ribosomal_RNA \
-query 16S_query3.fa \
-task blastn \
-evalue 1e-5 \
-seqidlist rm.txt.bsl \
-outfmt "17 std SQ" \
-out 16S_blast_out3  
#### -seqidlist rm.txt.bsl, which is compared with the specified sequence-
#### outfmt "17 std SQ" output base sequence

Fig. 10 output result of outfmt 17 plus base sequence of|blastn. 16S_ blast_ The out3 comparison result only contains itself.

## 3.1.4 during comparison, you can also specify the taxon ID for comparison or mask the specific taxon ID
###Take the taxid corresponding to the specified species taxon
#/software/ncbi-blast-2.12.0+/bin/get_species_taxids.sh -n Mycobacterium kubicae # taxid is a genus taxon 1763
#/software/ncbi-blast-2.12.0+/bin/get_species_taxids.sh -t 1763 > 1763.txids # Get all taxid s at and below 1763 classification level

### Specify a taxon ID to compare or mask a specific taxon ID
~/software/ncbi-blast-2.12.0+/bin/blastn \
-db $BLASTDB/16S_ribosomal_RNA/16S_ribosomal_RNA \
-query 16S_query3.fa \
-taxidlist 1763.txids \
-outfmt "6 std staxid" \
-out 16S_blast_out5 # –taxidlist 1763.txids # Multiple txid s can be specified
##-negative_ The taxidlist option is used to exclude the corresponding option

~/software/ncbi-blast-2.12.0+/bin/blastn \
-db $BLASTDB/16S_ribosomal_RNA/16S_ribosomal_RNA \
-query 16S_query3.fa \
-taxids 1273442 \
-outfmt "7 std staxid" \
-out 16S_blast_out6 # -taxids 1273442 #taxid with species and lower taxonomic level is required
## -negative_taxids is used to exclude the corresponding options
## -query_loc: you can select the specified sequence location of the sequence to be queried for retrieval, such as - query_loc 102-1110 represents the 102nd base to 1110 base of the sequence to be retrieved.
## The parameters that can be set on the NCBI web version of blast can be set in the local version of blast. Just check the help information settings.

Fig. 11 output results of|- taxids 1273442. 16S_blast_out6 contains only 1273442 taxon ID sequences.

3.2 use the FASTA sequences downloaded from other databases or sequenced to build a database for blast comparison

# Building a database using makeblastdb
## chmod 777 ~/software/ncbi-blast-2.12.0+/bin/makeblastdb # Give operation permission
~/software/ncbi-blast-2.12.0+/bin/makeblastdb -help # View help information
~/software/ncbi-blast-2.12.0+/bin/makeblastdb -in 16S_query3.fa \
-dbtype nucl -out 16S_query3db -parse_seqids
## -input_type fasta #There are four optional data forms for building a database: asn1_bin, asn1_txt, blastdb and FASTA (default)
## -dbtype nucl # The constructed database type nucl: nucleic acid database; prot: protein database.
## -out 16S_query3db #When the database is named, it will be output in 16S_ A series of database files prefixed with query3db.

Figure 12 | database construction results. 16S_query3db

# After the database is built, you can use blastn to retrieve it
#blastn -help # Help information
~/software/ncbi-blast-2.12.0+/bin/blastn -db 16S_query3db \
-query 16S_query3.fa \
-task blastn \
-evalue 1e-5 \
-outfmt 6 \
-out 16S_blast_out4

Figure 13 | retrieval results of self built database. The database path should be correct. 16S_blast_out4

# The method is the same under windows. Enter the location of the program
#Enter the disk where the blast stand-alone version is located
H:
#Enter the bin folder. There are many kinds of runnable programs under the bin folder
cd H:\software\NCBI\\bin
#Create a new alignment sequence database, and the database file namedb.nhr/nin/nsq appears
makeblastdb -in Sequence file name.fasta -dbtype Database type -out Output database name db -parse_seqids
#It may appear that the command cannot be recognized as a program name, add '. \' before the command line
.\makeblastdb -in Sequence file name.fasta -dbtype Database type -out Output database name db

The use of other subroutines of blast is similar. Just check the help information and user manual. Blast can also be used for multiple alignment sequences( https://www.ncbi.nlm.nih.gov/books/NBK569842/ )Search. rpsblas can be used for domain search in gene family analysis. Blast has many functions and is very powerful. You can see the user manual of blast if necessary.

Course guide of bioinformatics: handout of bioinformatics Graduate Summer School

The official account of WeChat public sends "blast", which can get software packages and explanation documents, or download them in the folder of QQ communication group.

Recommended reading

isoform Switch Analyzer - transcript based variable splicing and isoform Switch analysis

Expansion and contraction analysis of phylogenomics Cafe gene family

Phylogenomics - Construction of non partitioned tandem genome-wide evolutionary tree (ML method)

Phylogenomics densitree drawing detailed tutorial

Phylogenomics sequence alignment multi FASTA to VCF and VCF version conversion

EcoEvoPhylo: it mainly shares the data analysis course of microbial ecology and phylogenomics.

Scan QR code and pay attention to EcoEvoPhylo. Let's learn together, communicate with each other and make common progress.

Academic exchange QQ group | 438942456

Academic exchange wechat group 𞓜 jingmorensheng412

When adding friends, please note name, company and research direction.

Tags: Database Data Analysis

Posted on Tue, 30 Nov 2021 17:37:15 -0500 by sw9