CAZy_SeqDownloader: a bash script to download fasta sequences by using the CAZy database.

Andrea Strazzulli
28 ott 2020
Tempo di lettura: 5 min

CAZy_SeqDownloader: a bash script to download fasta sequences by using the CAZy database.

Two of the most frequently asked questions from my colleagues and students are:

"Why doesn't CAZy have the sequence download function?
“How can I easily download all the amino acid sequences (eg characterized) of a given family indicated in CAZy?”

About the first question I never got an answer...

Regarding the second one, in the past times the multistep solution I indicated was rather tortuous and required:

A passage, by using a spreadsheet, of the tables directly from the webpages of www.cazy.org
Saving the list of Genbank Accession Numbers as a list (one Acc. Num. per row)
Search of the Accession Numbers list by using Batch Entrez (https://www.ncbi.nlm.nih.gov/sites/batchentrez).

This was easy (boring) when the pages of a given family were few, but it was a real nightmare for larger families (did anyone say GH1?).

In addition, unfortunately in recent times, Batch Entrez has started giving errors in the output, not returning the UIDs blasts.

We needed an alternative solution...

For this reason, I wrote CAZy_SeqDownloader, a "spaghetti bash script" (very inelegant, indeed) that allows you to download in a short time the amino acid sequences of interest and intended for basic Linux & MacOS users (sorry Windows users).

The script requires two free software that must be installed separately on your machine: Lynx and Edirect

To install Lynx you can find information here:

MacOS: https://osxdaily.com/2011/07/26/get-lynx-for-mac-os-x-10-7-lion/
Linux: http://elinuxbook.com/install-lynx-browser-lynx-web-browser-on-ubuntu-16-04-a-text-web-browser/

To install EDirect: https://www.ncbi.nlm.nih.gov/books/NBK179288/

CAZy_SeqDownloader is certainly not perfect and could be significantly improved.

Below you will find the script (also as bash file), you are obviously free to modify it and use it as you wish.

#!/bin/bash
# This script saves an accession number list from a specific webpage of www.CAZy.org (GenBank column)then download the fasta in a multifasta file
# Multipages debug 2020.
# Dependences: lynx/E-utilitis
# CAZy SeqDownloader Copyright: Andrea Strazzulli (andrea.strazzulli@unina.it) - Dpt. of Biology, University of Naples, Federico II - ITALY.
clear
echo "CAZy SeqDownloader"
echo "**********************"
echo "Andrea Strazzulli 2020©"
echo "**********************"
echo "Please, type (or copy&paste) the CAZY webpage URL that you needs"
read INPUT
echo "Type an output filename"
read OUTPUT
echo 
echo "************************"
echo 
echo

#Make a selection for the kind of CAZy page
PS3='Please, select the type of CAZy page, [1] All, [2] Kingdom (Archaea,bacteria, Eukaryota, and single CAZy page) [3] Characterized, [4] Exit: '
options=("All" "Kingdom" "Characterized" "EXIT")

#option1 - Page "all"
select opt in "${options[@]}"
do
    case $opt in
	"All")
            # Search more pages and write it in a temporary file.
echo "Accession numbers extraction..."
lynx -dump $INPUT|grep "PRINC"|sed 's/http/@http/g'|cut -f2 -d @| sort -u >PAGES.list~
# Save the first page.
lynx -dump $INPUT|grep "entrez"| sed 's/protein&val=/@/g'|cut -f2 -d @ >$OUTPUT.list
# Save the following pages
cat PAGES.list~| while read LINE;
	do lynx -dump $LINE|grep "entrez"|sed 's/protein&val=/@/g'|cut -f2 -d @>>$OUTPUT.list;
	done
		cat $OUTPUT.list|sort -u >$OUTPUT.clean.list
# remove the temporary files
rm PAGES.list~
# Count the accession number
echo 
echo
echo "Total GenBank Accession numbers: " $(cat $OUTPUT.clean.list|wc -l)
#Find and download the sequences in a multifasta file.
echo "Download of the sequences..."
cat $OUTPUT.list|cut -f1 -d ".">TEMP.list~
touch $OUTPUT.fasta
epost -db protein -format acc -input TEMP.list~ |efetch -format fasta > $OUTPUT.fasta
echo "Total sequences downloaded: "$(grep -c "^>" $OUTPUT.fasta) 
#Count the sequences that aren't downloaded
cat $OUTPUT.fasta|grep ">" |sed 's/>//g'| cut -f1 -d " "|cut -f 1 -d "." > downloaded_fasta.list~
cat $OUTPUT.clean.list | cut -f 1 -d "."| while read LINE; do grep -c "$LINE" downloaded_fasta.list~ ;done >$OUTPUT.value.list~
paste $OUTPUT.clean.list $OUTPUT.value.list~ > $OUTPUT.count.list~
cat $OUTPUT.count.list~ | grep "0$" |cut -f 1 > $OUTPUT.nomatch.list
echo "Sequences without match in NCBI: "$(cat $OUTPUT.nomatch.list|wc -l)
# remove the temporary files
rm TEMP.list~
rm downloaded_fasta.list~
rm $OUTPUT.value.list~
rm $OUTPUT.count.list~
# definitions block
echo "---------"
echo "$OUTPUT.list --> Full list of accession numbers from CAZy"
echo "$OUTPUT.clean.list --> List of accession numbers without redundances"
echo "$OUTPUT.fasta --> Multi-fasta file of the sequences"
echo "$OUTPUT.nomatch.list --> list of accession numbers not found in NCBI database".
break
	;;
#option 2 - Kingdom

	"Kingdom")
            # Search more pages and write it in a temporary file.
echo "Accession numbers extraction..."
lynx -dump $INPUT|grep "TAXO"|sed 's/http/@http/g'|cut -f2 -d @| sort -u >PAGES.list~
# Save the first page.
lynx -dump $INPUT|grep "entrez"| sed 's/protein&val=/@/g'|cut -f2 -d @ >$OUTPUT.list
# Save the following pages
cat PAGES.list~| while read LINE;
	do lynx -dump $LINE|grep "entrez"|sed 's/protein&val=/@/g'|cut -f2 -d @>>$OUTPUT.list;
	done
		cat $OUTPUT.list|sort -u >$OUTPUT.clean.list
# remove the temporary files
rm PAGES.list~
# Count the accession number
echo 
echo
echo "Total GenBank Accession numbers: " $(cat $OUTPUT.clean.list|wc -l)
#Find and download the sequences in a multifasta file.
echo "Download of the sequences..."
cat $OUTPUT.list|cut -f1 -d ".">TEMP.list~
touch $OUTPUT.fasta
epost -db protein -format acc -input TEMP.list~ |efetch -format fasta > $OUTPUT.fasta
echo "Total sequences downloaded: "$(grep -c "^>" $OUTPUT.fasta) 
#Count the sequences that aren't downloaded
cat $OUTPUT.fasta|grep ">" |sed 's/>//g'| cut -f1 -d " "|cut -f 1 -d "." > downloaded_fasta.list~
cat $OUTPUT.clean.list | cut -f 1 -d "."| while read LINE; do grep -c "$LINE" downloaded_fasta.list~ ;done >$OUTPUT.value.list~
paste $OUTPUT.clean.list $OUTPUT.value.list~ > $OUTPUT.count.list~
cat $OUTPUT.count.list~ | grep "0$" |cut -f 1 > $OUTPUT.nomatch.list
echo "Sequences without match in NCBI: "$(cat $OUTPUT.nomatch.list|wc -l)
# remove the temporary files
rm TEMP.list~
rm downloaded_fasta.list~
rm $OUTPUT.value.list~
rm $OUTPUT.count.list~
# definitions block
echo "---------"
echo "$OUTPUT.list --> Full list of accession numbers from CAZy"
echo "$OUTPUT.clean.list --> List of accession numbers without redundances"
echo "$OUTPUT.fasta --> Multi-fasta file of the sequences"
echo "$OUTPUT.nomatch.list --> list of accession numbers not found in NCBI database".
break
	;;
#option 3 - Characterized 

	"Characterized")
            # Search more pages and write it in a temporary file.
echo "Accession numbers extraction..."
lynx -dump $INPUT|grep "FUNC"|sed 's/http/@http/g'|cut -f2 -d @| sort -u >PAGES.list~
# Save the first page.
lynx -dump $INPUT|grep "entrez"| sed 's/protein&val=/@/g'|cut -f2 -d @ >$OUTPUT.list
# Save the following pages
cat PAGES.list~| while read LINE;
	do lynx -dump $LINE|grep "entrez"|sed 's/protein&val=/@/g'|cut -f2 -d @>>$OUTPUT.list;
	done
	cat $OUTPUT.list|sort -u >$OUTPUT.clean.list
# remove the temporary files
rm PAGES.list~
# Count the accession number
echo 
echo
echo "Total GenBank Accession numbers: " $(cat $OUTPUT.clean.list|wc -l)
#Find and download the sequences in a multifasta file.
echo "Download of the sequences..."
cat $OUTPUT.clean.list | cut -f1 -d ".">TEMP.list~
touch $OUTPUT.fasta
epost -db protein -format acc -input TEMP.list~ |efetch -format fasta > $OUTPUT.fasta
echo "Total sequences downloaded: "$(grep -c "^>" $OUTPUT.fasta) 
#Count the sequences that aren't downloaded
cat $OUTPUT.fasta|grep ">" |sed 's/>//g'| cut -f1 -d " "|cut -f 1 -d "." > downloaded_fasta.list~
cat $OUTPUT.clean.list | cut -f 1 -d "."| while read LINE; do grep -c "$LINE" downloaded_fasta.list~ ;done >$OUTPUT.value.list~
paste $OUTPUT.clean.list $OUTPUT.value.list~ > $OUTPUT.count.list~
cat $OUTPUT.count.list~ | grep "0$" |cut -f 1 > $OUTPUT.nomatch.list
echo "Sequences without match in NCBI: "$(cat $OUTPUT.nomatch.list|wc -l)
# remove the temporary files
rm TEMP.list~
rm downloaded_fasta.list~
rm $OUTPUT.value.list~
rm $OUTPUT.count.list~
# definitions block
echo "---------"
echo "$OUTPUT.list --> Full list of accession numbers from CAZy"
echo "$OUTPUT.clean.list --> List of accession numbers without redundances"
echo "$OUTPUT.fasta --> Multi-fasta file of the sequences"
echo "$OUTPUT.nomatch.list --> list of accession numbers not found in NCBI database".
break
	;;
#option 4 - exit command
    "EXIT")
            break
            ;;
        *) echo Invalid option ;;
    esac
done

Andrea Strazzulli, PhD

CAZy_SeqDownloader: a bash script to download fasta sequences by using the CAZy database.

Post recenti

Comments