9. Orthology and Phylogeny

Warning

Since 2020, none of the internal links are functioning. Please use the Dropbox links in the Downloads section.

9.1. Preface

In this section you will use some software to find orthologue genes and do phylogenetic reconstructions.

9.2. Learning outcomes

After studying this tutorial you should be able to:

  1. Use bioinformatics software to find orthologues in the NCBI database.

  2. Use bioinformatics software to perform sequence alignment.

  3. Use bioinformatics software to perform phylogenetic reconstructions.

9.3. Before we start

Lets see how our directory structure looks so far:

cd ~/analysis
ls -1F
annotation/
assembly/
data/
kraken/
mappings/
trimmed/
trimmed-fastqc/
variants/

Make a directory for the phylogeny results (in your analysis directory):

mkdir phylogeny

Download the fasta file of the S. cerevisiase TEF2 gene to the phylogeny folder:

cd phylogeny
curl -O http://compbio.massey.ac.nz/data/203341/s_cerev_tef2.fas

Note

Should the download fail, download manually from Downloads.

9.4. Installing the software

# activate the env
conda activate ngs

conda install blast

This will install a BLAST executable that you can use to remotely query the NCBI database.

conda install muscle

This will install MUSCLE, alignment program that you can use to align nucleotide or protein sequences.

We will also install RAxML-NG, a phylogenetic tree inference tool, which uses maximum-likelihood (ML) optimality criterion. However, there is no conda repository for it yet. Thus, we need to download it manually.

wget
https://github.com/amkozlov/raxml-ng/releases/download/0.5.1/raxml-ng_v0.5.1b_linux_x86_64.zip

unzip raxml-ng_v0.5.1b_linux_x86_64.zip

rm raxml-ng_v0.5.1b_linux_x86_64.zip

9.5. Finding orthologues using BLAST

We will first make a BLAST database of our current assembly so that we can find the orthologous sequence of the S. cerevisiae gene. To do this, we run the command makeblastdb:

# create blast db
makeblastdb –in ../assembly/spades_final/scaffolds.fasta –dbtype nucl

To run BLAST, we give it:

  • -db: The name of the database that we are BLASTing

  • -query: A fasta format input file

  • A name for the output files

  • Some notes about the format we want

First, we blast without any formatting:

blastn –db ../assembly/spades_final/scaffolds.fasta –query s_cerev_tef2.fas > blast.out

This should output a file with a set of BLAST hits similar to what you might see on the BLAST web site.

Read through the output (e.g. using nano) to see what the results of your BLAST run was.

Next we will format the output a little so that it is easier to deal with.

blastn –db ../assembly/spades_final/scaffolds.fasta –query s_cerev_tef2.fas –evalue 1e-100 –outfmt “6 length sseq” > blast_formatted.out

This will yield a file that has only the sequences of the subject, so that we can later add those to other fasta files. However, the formatting is not perfect. To adjust the format such that it is fasta format, open the file in an editor (e.g. nano) and edit the first line so that it has a name for your sequence. You should know the general format of a fasta-file (e.g. the first line start with a “>”).

Hint

To edit in vi editor, you will need to press the escape key and “a” or “e”. To save in vi, you will need to press the escape key and “w” (write). To quit vi, you will need to press the escape key and “q” (quit).

Next, you have to replace the dashes (signifying indels in the BLAST result). This can easily be done in vi: Press the escape key, followed by: :%s/\-//g

Now we will BLAST a remote database to get a list of hits that are already in the NCBI database.

Note

It turns out you may not be able to access this database from within BioLinux. In such a case, download the file named blast.fas and place it into your ~/analysis/phylogeny/ directory.

curl -O http://compbio.massey.ac.nz/data/203341/blast_u.fas

Append the fasta file of your yeast sequence to this file, using whatever set of commands you wish/know.

Note

Should the download fail, download manually from Downloads.

9.6. Performing an alignment

We will use MUSCLE to perform our alignment on all the sequences in the BLAST fasta file. This syntax is very simple (change the filenames accordingly):

muscle –in infile.fas –out your_alignment.aln

9.7. Building a phylogeny

We will use RAxML-NG to build our phylogeny. This uses a maximum likelihood method to infer parameters of evolution and the topology of the tree. Again, the syntx of the command is fairly simple, except you must make sure that you are using the directory in which RAxML-NG sits.

The arguments are:

  • -s: an alignment file

  • -m: a model of evolution. In this case we will use a general time reversible model with gamma distributed rates (GTR+GAMMA)

  • -n: outfile-name

  • -p: specify a random number seed for the parsimony inferences

raxmlHPC -s your_alignment.aln -m GTRGAMMA –n yeast_tree –p 12345

9.8. Visualizing the phylogeny

We will use the online software Interactive Tree of Life (iTOL) to visualize the tree. Navigate to this homepage. Open the file containing your tree (*bestTree.out), copy the contents, and paste into the web page (in the Tree text box).

You should then be able to zoom in and out to see where your yeast taxa is. To find out the closest relative, you will have to use the NCBI taxa page.

Todo

Are you certain that the yeast are related in the way that the phylogeny suggests? Why might the topology of this phylogeny not truly reflect the evolutionary history of these yeast species?