Created by Evgenii Baiakhmetov
The first genome of Arabidopsis thaliana was sequenced back in 2000 by the Arabidopsis Genome Initiative. The length of this assembled genome was 115,409,949 bp (at least 10% occupied by transposable elements) and contained 25,498 genes. Nonetheless, an approximate genome length is to be evaluated as 134,634,692 bp.
This short tutorial covers some basic analyses aiming to perform a chromosome-level scaffolding and annotation of a diploid plant (2n=2x=10) A. thaliana using primarily data produced by third generation sequencing.
Please note, all data used here was produced by Wang B et al., 2021. Please also take into account that this tutorial does not intend to reproduce all analyses from the above-mentioned article in detail.
After completing this tutorial you will be able to:
Bioinformatics, ONT sequencing, HiFi sequencing, Illumina sequencing, Hi-C sequencing, RNA sequencing, genome assembly, genome annotation, data visualisation.
The tutorial is supposed to be run under a Unix operating system. For Windows users, I propose to use a combination of PuTTY/Visual Studio Code (VSC) to access a server/cluster and WinSCP to transfer files between Windows and Unix. Alternatively, you may install Unix in a virtual machine, e.g., Ubuntu 20.04 LTS on VirtualBox. Personally, I use (1) PuTTY to access a server, (2) VSC to remotely open GUI applications and (3) WinSCP to transfer files (see Figure 1). Please note, you can also open GUI applications with PuTTY.
Figure 1. Top left - PuTTY interface, top right - WinSCP interface and bottom - VSC interface.
Please share your experience if you were able to run all analyses on a PC.
nohup wget https://repo.anaconda.com/archive/Anaconda3-2021.11-Linux-x86_64.sh > conda_download.out &
Please notice that we will frequently use a combination of
nohup
at the beginning and &
(the
ampersand sign) at the end of the command. This combination keeps you in
the current shell (bash/zsh) and you are able to run another process. To
sum up:
nohup
keeps your process
running even after exiting the shell.
wget
is a package
for retrieving files via HTTP, HTTPS and FTP.
https://repo.anaconda.com/archive/Anaconda3-2021.11-Linux-x86_64.sh
is a downloadable file.
>
redirects a standard
output to a file.
conda_download.out
is a redirected
output.
&
runs a process in the background using a
subshell.
Next, we need to install Anaconda
(it comes with Conda):
bash Anaconda3-2021.11-Linux-x86_64.sh
We may also need to update conda:
conda update --all
conda update conda
Additionally, let’s add channels to conda for searching
packages:
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
conda create -n Assembly_tutorial python=3.9
conda
calls the ‘conda’ tool
itself.
create
a command to create an
environment.
-n
a name flag.
Assembly_tutorial
a name of your environment.
python=3.9
creates an environment with a specific version
of Python.
Now, to access the newly created environment we have to type:
conda activate Assembly_tutorial
If you want to deactivate your environment, type:
conda deactivate
If you have several environments, but forgot their names, use:
conda info --envs
conda activate NAME_OF_YOUR_ENVIRONMENT
sudo apt install htop
is an
interactive process viewer.
pip install paralleltask
is a parallel task
engine.
conda install -c conda-forge gnu-coreutils=9.1
provides basic utilities of
the GNU operating system.
conda install -c bioconda flye=2.9
is a de novo
assembler for single-molecule sequencing reads.
conda install -c bioconda necat=0.0.1_update20200803
is an error correction and
de novo assembly tool for long reads.
conda install -c bioconda 3d-dna=201008
is a 3D de
novo assembly (3D DNA) pipeline.
conda install -c bioconda assembly-stats=1.0.1
provides assembly
statistics from FASTA and FASTQ files.
conda install -c bioconda minimap2=2.24
is a versatile pairwise aligner
for genomic and spliced nucleotide sequences.
conda install -c bioconda spoa=4.0.7
is a partial order alignment
algorithm.
conda install -c bioconda fastp=0.23.2
is a tool designed to
provide fast all-in-one preprocessing for FastQ files.
conda install -c bioconda fastqc=0.11.9
is a
quality control tool for high throughput sequence data.
conda install -c bioconda samtools=1.15.1
is a set of tools for
manipulating next-generation sequencing data.
conda install -c bioconda blast=2.12.0
is a
new suite of BLAST tools that utilizes the NCBI C++ Toolkit.
conda install -c bioconda seqtk=1.3
is a fast and lightweight tool for
processing sequences in the FASTA or FASTQ format.
conda install -c bioconda bwa=0.7.17
is a Burrow-Wheeler Aligner for
short-read alignments.
conda install -c bioconda dgenies=1.3.1
is a dot plot tool
designed to compare two genomes.
conda install -c bioconda kmer-jellyfish=2.3.0
is a tool for fast,
memory-efficient counting of k-mers in DNA.
conda install -c bioconda repeatmodeler=2.0.3
is a de
novo repeat family identification and modeling package.
conda install -c bioconda repeatmasker=4.1.2.p1
is a program that screens DNA
sequences for transposable elements.
conda install -c bioconda braker2=2.1.6
is a pipeline for fully
automated prediction of protein coding gene structures.
conda install -c bioconda/label/cf201901 sra-tools=2.11.0
is a collection of tools and
libraries for using SRA data.
conda install -c bioconda trim-galore=0.6.7
is a wrapper script
to automate quality, adapter trimming and quality control.
conda install -c bioconda/label/cf201901 hisat2=2.2.1
is an alignment program
for mapping reads to a reference genome.
conda install -c bioconda/label/cf201901 busco=5.3.2
provides measures for quantitative
assessment of genome assemblies.
conda install -c bioconda igv=2.13.1
is a
visualisation tool for genomics data and annotations.
conda install -c bioconda/label/cf201901 circos=0.69.8
is
a package for visualising data in a
circular layout.
conda install -c conda-forge gawk=5.1.0
this utility makes it easy to
handle simple data-reformatting jobs.
conda install -c conda-forge ncurses=6.3
is a free
software emulation of curses in System V Release 4.0 (SVr4) and
more.
conda install -c bioconda qualimap=2.2.2d
evaluates next generation
sequencing alignment data.
conda install -c bioconda quast=5.2.0
provides tools for genome assemblies
evaluation and comparison.
conda install -c bioconda syri=1.6
is a synteny and
rearrangement identifier between
whole-genome assemblies.
conda install -c bioconda plotsr=0.5.4
is a visualiser for
structural annotations between multiple genomes.
Please note, you may ignore some conda
packages; however, there is also a possibility that you may need to
install some additional packages.
NextDenovo is a fast and accurate de novo assembler for long reads.
Juicer 1.6 is a one-click system for analysing loop-resolution Hi-C experiments. The installation process is here.
Juicebox is a visualisation and analysis software for Hi-C data.
GeneMark-ES/ET/EP v. 4.69_lic (it should also contain ProtHint v. 2.6.0) is a part of Braker2 pipelines.
TSEBRA is a transcript selector software for BRAKER2.
Please try to organise your files in a
proper way. For instance, let’s create a new folder for your
downloads:
mkdir Downloads
Open the newly created folder:
cd Downloads
And download two reference genomes of A. thaliana (the first one produced by Wang B et al., 2021 and the second one that was submitted to the GenBank by The Arabidopsis Information Resource (TAIR). Later on, we will be comparing our assembly with these reference genomes:
nohup wget https://download.cncb.ac.cn/gwh/Plants/Arabidopsis_thaliana_AT_GWHBDNP00000000/GWHBDNP00000000.genome.fasta.gz &
nohup wget https://ftp.ncbi.nlm.nih.gov/genomes/refseq/plant/Arabidopsis_thaliana/latest_assembly_versions/GCF_000001735.4_TAIR10.1/GCF_000001735.4_TAIR10.1_genomic.fna.gz &
Then, unzip these archives:
gzip -d GWHBDNP00000000.genome.fasta.gz
gzip -d GCF_000001735.4_TAIR10.1_genomic.fna.gz
Next, we will create some more folders for Illumina, Hi-C, Hi-Fi, ONT
and RNAseq data:
mkdir Illumina
mkdir HiC
mkdir HiFi
mkdir ONT
mkdir RNAseq
Subsequently, we have to download the following files:
cd Illumina
nohup wget ftp://download.big.ac.cn/gsa/CRA004538/CRR302670/CRR302670_f1.fastq.gz > CRR302670_f1.out &
nohup wget ftp://download.big.ac.cn/gsa/CRA004538/CRR302670/CRR302670_r2.fastq.gz > CRR302670_r2.out &
cd -
it will navigate you to the previous directory
Downloads
cd HiC
nohup wget ftp://download.big.ac.cn/gsa/CRA004538/CRR302669/CRR302669_f1.fastq.gz > CRR302669_f1.out &
nohup wget ftp://download.big.ac.cn/gsa/CRA004538/CRR302669/CRR302669_r2.fastq.gz > CRR302669_r2.out &
cd -
cd HiFi
nohup wget ftp://download.big.ac.cn/gsa/CRA004538/CRR302668/CRR302668.fastq.gz > CRR302668.out &
cd -
cd ONT
nohup wget ftp://download.big.ac.cn/gsa/CRA004538/CRR302667/CRR302667.fastq.gz > CRR302667.out &
cd -
cd RNAseq
nohup wget https://sra-pub-run-odp.s3.amazonaws.com/sra/SRR3581356/SRR3581356 > wget_SRR3581356.out &
root
nohup wget https://sra-pub-run-odp.s3.amazonaws.com/sra/SRR3581693/SRR3581693 > wget_SRR3581693.out &
flower
nohup wget https://sra-pub-run-odp.s3.amazonaws.com/sra/SRR3581681/SRR3581681 > wget_SRR3581681.out &
leaf
nohup wget https://sra-pub-run-odp.s3.amazonaws.com/sra/SRR3581705/SRR3581705 > wget_SRR3581705.out &
internode
nohup wget https://sra-pub-run-odp.s3.amazonaws.com/sra/SRR3581706/SRR3581706 > wget_SRR3581706.out &
seed
nohup wget https://sra-pub-run-odp.s3.amazonaws.com/sra/SRR3581708/SRR3581708 > wget_SRR3581708.out &
silique
nohup wget https://sra-pub-run-odp.s3.amazonaws.com/sra/SRR3581703/SRR3581703 > wget_SRR3581703.out &
pedicel
cd -
Please also consider to use soft
links for these files. For example, one may create a new folder
Assembly
and create a soft/symbolic link there:
cd
redirect to the home directory.
mkdir Assembly
create a new folder.
cd Assembly
open the folder.
ln -s /home/Downloads/ONT/CRR302667.fastq.gz CRR302667.fastq.gz
create a symbolic link for ONT reads.
ls
will list all
files in this directory.
We will correct the ONT reads with NECAT:
cd /home/Downloads/ONT/
move
to the ONT
folder.
cat > read_list.txt
create a .txt file read_list
.
/home/Downloads/ONT/CRR302667.fastq.gz
type the path to the
ONT reads and save read_list.txt
.
necat config arabidopsis_config.txt
create a config file
template.
nano arabidopsis_config.txt
modify the
config file with the relative information, e.g.:
PROJECT=Arabidopsis
ONT_READ_LIST=read_list.txt
GENOME_SIZE=135000000
THREADS=16
MIN_READ_LENGTH=10000
PREP_OUTPUT_COVERAGE=40
OVLP_FAST_OPTIONS=-n 500 -z 20 -b 2000 -e 0.5 -j 0 -u 1 -a 1000
OVLP_SENSITIVE_OPTIONS=-n 500 -z 10 -e 0.5 -j 0 -u 1 -a 1000
CNS_FAST_OPTIONS=-a 2000 -x 4 -y 12 -l 1000 -e 0.5 -p 0.8 -u 0
CNS_SENSITIVE_OPTIONS=-a 2000 -x 4 -y 12 -l 1000 -e 0.5 -p 0.8 -u 0
TRIM_OVLP_OPTIONS=-n 100 -z 10 -b 2000 -e 0.5 -j 1 -u 1 -a 400
ASM_OVLP_OPTIONS=-n 100 -z 10 -b 2000 -e 0.5 -j 1 -u 0 -a 400
NUM_ITER=2
CNS_OUTPUT_COVERAGE=30
CLEANUP=1
USE_GRID=false
GRID_NODE=0
GRID_OPTIONS=
SMALL_MEMORY=0
FSA_OL_FILTER_OPTIONS=
FSA_ASSEMBLE_OPTIONS=
FSA_CTG_BRIDGE_OPTIONS=
POLISH_CONTIGS=true.
After modifying
arabidopsis_config.txt
, press
CTRL
+O
to save the file and
CTRL
+X
to exit the editing mode.
Alternatively, if you are not familiar with Unix, you can modify
arabidopsis_config.txt
via VSC or WinSCP.
Please note, we will keep only the longest
30x corrected reads for the downstream assembly.
We will use NextDenovo to assemble the genome.
cd Assembly
move to the
Assembly
folder
ln /home/Downloads/ONT/Arabidopsis/1-consensus/cns_final.fasta.gz > input.fofn
prepare an input file with the corrected ONT reads and update
run.cfg
, e.g.:
[General]
job_type = local # local, slurm, sge,
pbs, lsf
job_prefix = nextDenovo
task = assemble # all,
correct, assemble
rewrite = yes # yes/no
deltmp = yes
parallel_jobs = 16 # number of tasks used to run in parallel
input_type = corrected # raw, corrected
read_type = ont # clr, ont,
hifi
input_fofn = input.fofn
workdir = /home/Assembly/
[correct_option]
read_cutoff = 5k
genome_size = 135m # estimated genome size
sort_options = -m 10g -t 10
minimap2_options_raw = -t 8
pa_correction = 3 # number of corrected tasks used to run in parallel,
each corrected task requires ~TOTAL_INPUT_BASES/4 bytes of memory
usage.
correction_options = -p 15
[assemble_option]
minimap2_options_cns = -t 8
nextgraph_options = -a 1
nohup nextDenovo run.cfg &
run the program.
assembly-stats /home/Assembly/nextDenovo/03.ctg_graph/01.ctg_graph.sh.work/ctg_graph1/nd.asm.p.fasta > NextDenovo_ONT_corr_30x.txt
get a brief overview of the assembly:
stats for
nd.asm.p.fasta
sum = 122664679, n = 15, ave = 8177645.27, largest =
16219875
N50 = 15030013, n = 4
N60 = 14767803, n = 5
N70
= 13827629, n = 6
N80 = 11316166, n = 7
N90 = 9383066, n =
8
N100 = 251324, n = 15
N_count = 0
Gaps = 0
In this step, we may need to remove
contaminants, e.g., chloroplast, mitochondrial and/or bacterial
sequences.
ln -s /home/Assembly/nextDenovo/03.ctg_graph/01.ctg_graph.sh.work/ctg_graph1/nd.asm.p.fasta Assembly_NextDenovo_ONT_15_fragments.fasta
create a soft link for the assembly.egrep 'loop' Assembly_NextDenovo_ONT_15_fragments.fasta
check if the final assembly contains circular fragments: >ctg000013 type:s:loop
length:i:304269 node:i:16
>ctg000014 type:s:loop length:i:251324
node:i:7
We will remove these fragments from the assembly. Firstly, let’s
create a list with the headers of all contigs:
egrep '>' Assembly_NextDenovo_ONT_15_fragments.fasta > name.lst
you can find some basics on grep/egrep here.
sed 's/^.//' name.lst > name1.lst
remove
>
from the list. More on sed is here.
sed '14,15d' name1.lst > 13_fragments.lst
remove two
looped fragments from the list. Here is one
more useful link.
seqtk subseq Assembly_NextDenovo_ONT_15_fragments.fasta 13_fragments.lst > Assembly_NextDenovo_ONT_13_fragments.fasta
extract the linear sequences from the assembly.
Just out of curiosity, you may extract the looped sequences in the same manner and BLAST them to verify if these fragments do represent chloroplast, mitochondrial and bacterial sequences.
In order to improve the draft genome by
searching misassemblies and other inconsistencies and correcting them,
we will polish our assembly with the highly accurate long HiFi
reads.
cd /home/
mkdir Nextpolish
cd Nextpolish
ln -s /home/Assembly/Assembly_NextDenovo_ONT_13_fragments.fasta Assembly_ONT.fasta
ls /home/Downloads/HiFi/CRR302668.fastq.gz > hifi.fofn
prepare an input file with the HiFi reads and create a
run.cfg
file, e.g.:
[General]
job_type = local
job_prefix =
nextPolish
task = best
rewrite = yes
rerun = 3
parallel_jobs = 6
multithread_jobs = 5
genome =
/home/Nextpolish/Assembly_ONT.fasta #genome file
genome_size =
auto
workdir = /home/Nextpolish/
polish_options = -p
{multithread_jobs}
[hifi_option]
hifi_fofn =
/home/Nextpolish/hifi.fofn
hifi_options = -min_read_len 10k
-max_read_len 45k -max_depth 150
hifi_minimap2_options = -x
map-hifi
nohup nextPolish run.cfg &
run the program.
Let’s consider a case when we have a reference genome. We will use D-Genies that
provides a synthetic similarity overview. This software can be run locally or online. All you need is
to provide a reference and your newly assembled genome. Here, we will
compare genome.nextpolish.fasta
, which we got in the
previous step, and GWHBDNP00000000.genome.fasta.gz
that was
downloaded before. In Figure 2 you can see (A)
comparisons ‘Reference’ vs ‘Reference’ and (B) ‘Reference’ vs
‘genome.nextpolish’.
Figure 2. Synthetic similarities produced by D-Genies. (A) Reference vs Reference. (B) Reference vs Polished genome.
Notice, the first chromosome is
represented by contigs 4 and 10 (shown in Figure 2);
the second chromosome by contigs 7 and 11; the third chromosome by
contigs 2 and 12; the fourth chromosome by contigs 5 and 8; the fifth
chromosome by contigs 3 and 6.
Now we need to order our contigs into chromosomes according to
Figure 2B. Here is a potential scheme for Chromosome
1:
touch ctg000010.txt
create a text file.
nano ctg000010.txt
write the contig name
ctg000010
and save the file.
seqtk subseq genome.nextpolish.fasta ctg000010.txt > ctg000010.fasta
extract the contig ctg000010
that we want to reverse.
seqtk seq -r ctg000010.fasta > ctg000010_REV.fasta
reverse the contig.
touch ctg000004.txt
create a new
text file.
nano ctg000004.txt
write the contig name
ctg000004
and save the file.
seqtk subseq genome.nextpolish.fasta ctg000004.txt > ctg000004.fasta
extract the contig ctg000004
that represents the first
chromosome.
paste -d '' ctg000004.fasta ctg000010_REV.fasta > Chromosome_1.fasta
concatenate two contigs representing Chromosome 1.
nano Chromosome_1.fasta
you may want to edit the
header.
In the end, you should get the following result (Figure 3):
Figure 3. Synthetic similarities produced by D-Genies. (A) Reference vs Reference. (B) Reference vs Scaffolded genome. Red ellipses indicate the missing sequences in the centromeric regions.
Note, here we will not fill the remaining gaps in the centromeric regions. Please see an approach proposed in the original article (Replacing ONT-HiC-assemblies with HiFi-contigs).
If you have genome-wide chromatin interactions data (Hi-C), you can scaffold contigs into chromosomes without a reference genome.
You may want to know more about the
workflow. Here are some useful links:
The
Genome Assembly Cookbook provides a strategy on the
chromosome-length genome assemblies in detail.
The Aiden Lab YouTube
channel provides a brief overview on Juicebox Assembly
Tools.
IMPORTANT! You have to
properly organise your directory structure for running Juicer. Please
try to follow the
instruction.
bwa index genome.nextpolish.fasta
builds an index for the
genome.
nohup juicer.sh -D /juicer/scripts/ -g AT_2022 -z /home/juicer/references/genome.nextpolish.fasta -p contig.sizes -s none -d /home/juicer/work/AT_2022 -t 16 &
runs Juicer.
nohup
keeps your process
running even after exiting the shell.
juicer.sh
the
executed script.
-D
a path to the Juicer tools.
-g
the genomeID.
-z
a genome file.
-p
a file containing chromosome/contig
sizes*
.
-s
a restriction enzyme
file.
-d
a path to the directory containing the Hi-C
data (a fastq
folder).
-t
a number of
threads used.
&
runs a process in the background
using a subshell.
*
- you may get these sizes from
genome.nextpolish.fasta.fai
:
cut -f1,2 genome.nextpolish.fasta.fai > contig.sizes
Next, we will generate a candidate assembly with 3D de novo
assembly (3D-DNA)
pipeline:
nohup 3d-dna /home/juicer/references/genome.nextpolish.fasta /home/juicer/work/AT_2022/aligned/merged_nodups.txt &
Open
the .final.hic
file in Juicebox
and load the .final.assembly
file. You will need to examine
and rectify the assembly if necessary (see Figure
4).