9. Orthology and Phylogeny¶
Warning
Since 2020, none of the internal links are functioning. Please use the Dropbox links in the Downloads section.
9.1. Preface¶
In this section you will use some software to find orthologue genes and do phylogenetic reconstructions.
9.2. Learning outcomes¶
After studying this tutorial you should be able to:
Use bioinformatics software to find orthologues in the NCBI database.
Use bioinformatics software to perform sequence alignment.
Use bioinformatics software to perform phylogenetic reconstructions.
9.3. Before we start¶
Lets see how our directory structure looks so far:
cd ~/analysis
ls -1F
annotation/
assembly/
data/
kraken/
mappings/
trimmed/
trimmed-fastqc/
variants/
Make a directory for the phylogeny results (in your analysis directory):
mkdir phylogeny
Download the fasta file of the S. cerevisiase TEF2 gene to the phylogeny folder:
cd phylogeny
curl -O http://compbio.massey.ac.nz/data/203341/s_cerev_tef2.fas
Note
Should the download fail, download manually from Downloads.
9.4. Installing the software¶
# activate the env
conda activate ngs
conda install blast
This will install a BLAST executable that you can use to remotely query the NCBI database.
conda install muscle
This will install MUSCLE, alignment program that you can use to align nucleotide or protein sequences.
We will also install RAxML-NG, a phylogenetic tree inference tool, which uses maximum-likelihood (ML) optimality criterion. However, there is no conda repository for it yet. Thus, we need to download it manually.
wget
https://github.com/amkozlov/raxml-ng/releases/download/0.5.1/raxml-ng_v0.5.1b_linux_x86_64.zip
unzip raxml-ng_v0.5.1b_linux_x86_64.zip
rm raxml-ng_v0.5.1b_linux_x86_64.zip
9.5. Finding orthologues using BLAST¶
We will first make a BLAST database of our current assembly so that we can
find the orthologous sequence of the S. cerevisiae gene.
To do this, we run the command makeblastdb
:
# create blast db
makeblastdb –in ../assembly/spades_final/scaffolds.fasta –dbtype nucl
To run BLAST, we give it:
-db
: The name of the database that we are BLASTing-query
: A fasta format input fileA name for the output files
Some notes about the format we want
First, we blast without any formatting:
blastn –db ../assembly/spades_final/scaffolds.fasta –query s_cerev_tef2.fas > blast.out
This should output a file with a set of BLAST hits similar to what you might see on the BLAST web site.
Read through the output (e.g. using nano
) to see what the results of your BLAST run was.
Next we will format the output a little so that it is easier to deal with.
blastn –db ../assembly/spades_final/scaffolds.fasta –query s_cerev_tef2.fas –evalue 1e-100 –outfmt “6 length sseq” > blast_formatted.out
This will yield a file that has only the sequences of the subject, so that we can later add those to other fasta files.
However, the formatting is not perfect.
To adjust the format such that it is fasta format, open the file in an editor (e.g. nano
) and edit the first line so that it has a name for your sequence.
You should know the general format of a fasta-file (e.g. the first line start with a “>”).
Hint
To edit in vi
editor, you will need to press the escape key and “a” or “e”.
To save in vi
, you will need to press the escape key and “w” (write).
To quit vi
, you will need to press the escape key and “q” (quit).
Next, you have to replace the dashes (signifying indels in the BLAST result).
This can easily be done in vi
:
Press the escape key, followed by: :%s/\-//g
Now we will BLAST a remote database to get a list of hits that are already in the NCBI database.
Note
It turns out you may not be able to access this database from within BioLinux. In such a case, download the file named blast.fas
and place it into your ~/analysis/phylogeny/
directory.
curl -O http://compbio.massey.ac.nz/data/203341/blast_u.fas
Append the fasta file of your yeast sequence to this file, using whatever set of commands you wish/know.
Note
Should the download fail, download manually from Downloads.
9.6. Performing an alignment¶
We will use MUSCLE to perform our alignment on all the sequences in the BLAST fasta file. This syntax is very simple (change the filenames accordingly):
muscle –in infile.fas –out your_alignment.aln
9.7. Building a phylogeny¶
We will use RAxML-NG to build our phylogeny. This uses a maximum likelihood method to infer parameters of evolution and the topology of the tree. Again, the syntx of the command is fairly simple, except you must make sure that you are using the directory in which RAxML-NG sits.
The arguments are:
-s
: an alignment file-m
: a model of evolution. In this case we will use a general time reversible model with gamma distributed rates (GTR+GAMMA)-n
: outfile-name-p
: specify a random number seed for the parsimony inferences
raxmlHPC -s your_alignment.aln -m GTRGAMMA –n yeast_tree –p 12345
9.8. Visualizing the phylogeny¶
We will use the online software Interactive Tree of Life (iTOL) to visualize the tree.
Navigate to this homepage.
Open the file containing your tree (*bestTree.out
), copy the contents, and paste into the web page (in the Tree text box).
You should then be able to zoom in and out to see where your yeast taxa is. To find out the closest relative, you will have to use the NCBI taxa page.
Todo
Are you certain that the yeast are related in the way that the phylogeny suggests? Why might the topology of this phylogeny not truly reflect the evolutionary history of these yeast species?