[I] Preface

In a nutshell

The first genome of Arabidopsis thaliana was sequenced back in 2000 by the Arabidopsis Genome Initiative. The length of this assembled genome was 115,409,949 bp (at least 10% occupied by transposable elements) and contained 25,498 genes. Nonetheless, an approximate genome length is to be evaluated as 134,634,692 bp.

This short tutorial covers some basic analyses aiming to perform a chromosome-level scaffolding and annotation of a diploid plant (2n=2x=10) A. thaliana using primarily data produced by third generation sequencing.

Please note, all data used here was produced by Wang B et al., 2021. Please also take into account that this tutorial does not intend to reproduce all analyses from the above-mentioned article in detail.

Learning outcomes

After completing this tutorial you will be able to:

Assemble a genome of your interest using data produced by third generation sequencing, i.e., Oxford Nanopore Technologies (ONT) and Pacific Biosciences Circular Consensus Sequencing (CCS) reads.
Scaffold assembled contigs to the scale of full chromosomes using a reference genome.
Scaffold assembled contigs to the scale of full chromosomes using Hi-C data.
Fill sequencing gaps if necessary (pending).
Identify, annotate and mask transposable elements.
Annotate genomes using RNA-Seq and protein homology information.
Visualise the results.

Keywords

Bioinformatics, ONT sequencing, HiFi sequencing, Illumina sequencing, Hi-C sequencing, RNA sequencing, genome assembly, genome annotation, data visualisation.

DOI

[II] Software & sequencing data

Note for Windows users

The tutorial is supposed to be run under a Unix operating system. For Windows users, I propose to use a combination of PuTTY/Visual Studio Code (VSC) to access a server/cluster and WinSCP to transfer files between Windows and Unix. Alternatively, you may install Unix in a virtual machine, e.g., Ubuntu 20.04 LTS on VirtualBox. Personally, I use (1) PuTTY to access a server, (2) VSC to remotely open GUI applications and (3) WinSCP to transfer files (see Figure 1). Please note, you can also open GUI applications with PuTTY.

(See Figure 1)

Figure 1. Top left - PuTTY interface, top right - WinSCP interface and bottom - VSC interface.

Specs of the server

Operating system: Ubuntu 20.04.3 LTS.
Kernel: Linux 5.13.19-1-pve.
Architecture: x86-64.
CPU: Intel Xeon Silver 4114 (2.20 GHz), 8 cores available (16 threads).
RAM: DDR4 (2666 MHz), 128 GB.

Please share your experience if you were able to run all analyses on a PC.

Conda and setting an environment

[Step 1]

Let’s download Anaconda (a package manager):
nohup wget https://repo.anaconda.com/archive/Anaconda3-2021.11-Linux-x86_64.sh > conda_download.out &

Please notice that we will frequently use a combination of nohup at the beginning and & (the ampersand sign) at the end of the command. This combination keeps you in the current shell (bash/zsh) and you are able to run another process. To sum up:

nohup keeps your process running even after exiting the shell.
wgetis a package for retrieving files via HTTP, HTTPS and FTP.
https://repo.anaconda.com/archive/Anaconda3-2021.11-Linux-x86_64.sh is a downloadable file.
> redirects a standard output to a file.
conda_download.out is a redirected output.
& runs a process in the background using a subshell.

[Step 2]

Next, we need to install Anaconda (it comes with Conda):

bash Anaconda3-2021.11-Linux-x86_64.sh

[Step 3]

We may also need to update conda:

conda update --all
conda update conda

Additionally, let’s add channels to conda for searching packages:

conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge

[Step 4]

Now, it is time to create an environment where we will keep all required programs:
conda create -n Assembly_tutorial python=3.9

conda calls the ‘conda’ tool itself.
create a command to create an environment.
-n a name flag.
Assembly_tutorial a name of your environment.
python=3.9 creates an environment with a specific version of Python.

[Step 5]

Now, to access the newly created environment we have to type:
conda activate Assembly_tutorial

If you want to deactivate your environment, type:
conda deactivate

If you have several environments, but forgot their names, use:
conda info --envs
conda activate NAME_OF_YOUR_ENVIRONMENT

Packages installation

sudo apt install htop is an interactive process viewer.
pip install paralleltask is a parallel task engine.
conda install -c conda-forge gnu-coreutils=9.1 provides basic utilities of the GNU operating system.
conda install -c bioconda flye=2.9 is a de novo assembler for single-molecule sequencing reads.
conda install -c bioconda necat=0.0.1_update20200803 is an error correction and de novo assembly tool for long reads.
conda install -c bioconda 3d-dna=201008 is a 3D de novo assembly (3D DNA) pipeline.
conda install -c bioconda assembly-stats=1.0.1 provides assembly statistics from FASTA and FASTQ files.
conda install -c bioconda minimap2=2.24 is a versatile pairwise aligner for genomic and spliced nucleotide sequences.
conda install -c bioconda spoa=4.0.7 is a partial order alignment algorithm.
conda install -c bioconda fastp=0.23.2 is a tool designed to provide fast all-in-one preprocessing for FastQ files.
conda install -c bioconda fastqc=0.11.9 is a quality control tool for high throughput sequence data.
conda install -c bioconda samtools=1.15.1 is a set of tools for manipulating next-generation sequencing data.
conda install -c bioconda blast=2.12.0 is a new suite of BLAST tools that utilizes the NCBI C++ Toolkit.
conda install -c bioconda seqtk=1.3 is a fast and lightweight tool for processing sequences in the FASTA or FASTQ format.
conda install -c bioconda bwa=0.7.17 is a Burrow-Wheeler Aligner for short-read alignments.
conda install -c bioconda dgenies=1.3.1 is a dot plot tool designed to compare two genomes.
conda install -c bioconda kmer-jellyfish=2.3.0 is a tool for fast, memory-efficient counting of k-mers in DNA.
conda install -c bioconda repeatmodeler=2.0.3 is a de novo repeat family identification and modeling package.
conda install -c bioconda repeatmasker=4.1.2.p1 is a program that screens DNA sequences for transposable elements.
conda install -c bioconda braker2=2.1.6 is a pipeline for fully automated prediction of protein coding gene structures.
conda install -c bioconda/label/cf201901 sra-tools=2.11.0 is a collection of tools and libraries for using SRA data.
conda install -c bioconda trim-galore=0.6.7 is a wrapper script to automate quality, adapter trimming and quality control.
conda install -c bioconda/label/cf201901 hisat2=2.2.1 is an alignment program for mapping reads to a reference genome.
conda install -c bioconda/label/cf201901 busco=5.3.2 provides measures for quantitative assessment of genome assemblies.
conda install -c bioconda igv=2.13.1 is a visualisation tool for genomics data and annotations.
conda install -c bioconda/label/cf201901 circos=0.69.8 is a package for visualising data in a circular layout.
conda install -c conda-forge gawk=5.1.0 this utility makes it easy to handle simple data-reformatting jobs.
conda install -c conda-forge ncurses=6.3 is a free software emulation of curses in System V Release 4.0 (SVr4) and more.
conda install -c bioconda qualimap=2.2.2d evaluates next generation sequencing alignment data.
conda install -c bioconda quast=5.2.0 provides tools for genome assemblies evaluation and comparison.
conda install -c bioconda syri=1.6 is a synteny and rearrangement identifier between whole-genome assemblies.
conda install -c bioconda plotsr=0.5.4 is a visualiser for structural annotations between multiple genomes.

Please note, you may ignore some conda packages; however, there is also a possibility that you may need to install some additional packages.

Additional software (not in Conda)

NextDenovo is a fast and accurate de novo assembler for long reads.
Juicer 1.6 is a one-click system for analysing loop-resolution Hi-C experiments. The installation process is here.
Juicebox is a visualisation and analysis software for Hi-C data.
GeneMark-ES/ET/EP v. 4.69_lic (it should also contain ProtHint v. 2.6.0) is a part of Braker2 pipelines.
TSEBRA is a transcript selector software for BRAKER2.

Downloading raw sequence data

Please try to organise your files in a proper way. For instance, let’s create a new folder for your downloads:
mkdir Downloads

Open the newly created folder:
cd Downloads

And download two reference genomes of A. thaliana (the first one produced by Wang B et al., 2021 and the second one that was submitted to the GenBank by The Arabidopsis Information Resource (TAIR). Later on, we will be comparing our assembly with these reference genomes:

nohup wget https://download.cncb.ac.cn/gwh/Plants/Arabidopsis_thaliana_AT_GWHBDNP00000000/GWHBDNP00000000.genome.fasta.gz &
nohup wget https://ftp.ncbi.nlm.nih.gov/genomes/refseq/plant/Arabidopsis_thaliana/latest_assembly_versions/GCF_000001735.4_TAIR10.1/GCF_000001735.4_TAIR10.1_genomic.fna.gz &

Then, unzip these archives:
gzip -d GWHBDNP00000000.genome.fasta.gz
gzip -d GCF_000001735.4_TAIR10.1_genomic.fna.gz

Next, we will create some more folders for Illumina, Hi-C, Hi-Fi, ONT and RNAseq data:
mkdir Illumina
mkdir HiC
mkdir HiFi
mkdir ONT
mkdir RNAseq

Subsequently, we have to download the following files:
cd Illumina
nohup wget ftp://download.big.ac.cn/gsa/CRA004538/CRR302670/CRR302670_f1.fastq.gz > CRR302670_f1.out &
nohup wget ftp://download.big.ac.cn/gsa/CRA004538/CRR302670/CRR302670_r2.fastq.gz > CRR302670_r2.out &
cd - it will navigate you to the previous directory Downloads

cd HiC
nohup wget ftp://download.big.ac.cn/gsa/CRA004538/CRR302669/CRR302669_f1.fastq.gz > CRR302669_f1.out &
nohup wget ftp://download.big.ac.cn/gsa/CRA004538/CRR302669/CRR302669_r2.fastq.gz > CRR302669_r2.out &
cd -

cd HiFi
nohup wget ftp://download.big.ac.cn/gsa/CRA004538/CRR302668/CRR302668.fastq.gz > CRR302668.out &
cd -

cd ONT
nohup wget ftp://download.big.ac.cn/gsa/CRA004538/CRR302667/CRR302667.fastq.gz > CRR302667.out &
cd -

cd RNAseq
nohup wget https://sra-pub-run-odp.s3.amazonaws.com/sra/SRR3581356/SRR3581356 > wget_SRR3581356.out & root
nohup wget https://sra-pub-run-odp.s3.amazonaws.com/sra/SRR3581693/SRR3581693 > wget_SRR3581693.out & flower
nohup wget https://sra-pub-run-odp.s3.amazonaws.com/sra/SRR3581681/SRR3581681 > wget_SRR3581681.out & leaf
nohup wget https://sra-pub-run-odp.s3.amazonaws.com/sra/SRR3581705/SRR3581705 > wget_SRR3581705.out & internode
nohup wget https://sra-pub-run-odp.s3.amazonaws.com/sra/SRR3581706/SRR3581706 > wget_SRR3581706.out & seed
nohup wget https://sra-pub-run-odp.s3.amazonaws.com/sra/SRR3581708/SRR3581708 > wget_SRR3581708.out & silique
nohup wget https://sra-pub-run-odp.s3.amazonaws.com/sra/SRR3581703/SRR3581703 > wget_SRR3581703.out & pedicel
cd -

Please also consider to use soft links for these files. For example, one may create a new folder Assembly and create a soft/symbolic link there:

cd redirect to the home directory.
mkdir Assembly create a new folder.
cd Assembly open the folder.
ln -s /home/Downloads/ONT/CRR302667.fastq.gz CRR302667.fastq.gz create a symbolic link for ONT reads.
ls will list all files in this directory.

[III] Genome assembly

[Step 1 | NECAT]

We will correct the ONT reads with NECAT:

cd /home/Downloads/ONT/ move to the ONT folder.
cat > read_list.txt create a .txt file read_list.
/home/Downloads/ONT/CRR302667.fastq.gz type the path to the ONT reads and save read_list.txt.
necat config arabidopsis_config.txt create a config file template.
nano arabidopsis_config.txt modify the config file with the relative information, e.g.:

PROJECT=Arabidopsis
ONT_READ_LIST=read_list.txt
GENOME_SIZE=135000000
THREADS=16
MIN_READ_LENGTH=10000
PREP_OUTPUT_COVERAGE=40
OVLP_FAST_OPTIONS=-n 500 -z 20 -b 2000 -e 0.5 -j 0 -u 1 -a 1000
OVLP_SENSITIVE_OPTIONS=-n 500 -z 10 -e 0.5 -j 0 -u 1 -a 1000
CNS_FAST_OPTIONS=-a 2000 -x 4 -y 12 -l 1000 -e 0.5 -p 0.8 -u 0
CNS_SENSITIVE_OPTIONS=-a 2000 -x 4 -y 12 -l 1000 -e 0.5 -p 0.8 -u 0
TRIM_OVLP_OPTIONS=-n 100 -z 10 -b 2000 -e 0.5 -j 1 -u 1 -a 400
ASM_OVLP_OPTIONS=-n 100 -z 10 -b 2000 -e 0.5 -j 1 -u 0 -a 400
NUM_ITER=2
CNS_OUTPUT_COVERAGE=30
CLEANUP=1
USE_GRID=false
GRID_NODE=0
GRID_OPTIONS=
SMALL_MEMORY=0
FSA_OL_FILTER_OPTIONS=
FSA_ASSEMBLE_OPTIONS=
FSA_CTG_BRIDGE_OPTIONS=
POLISH_CONTIGS=true.

After modifying arabidopsis_config.txt, press CTRL+O to save the file and CTRL+X to exit the editing mode. Alternatively, if you are not familiar with Unix, you can modify arabidopsis_config.txt via VSC or WinSCP.

Please note, we will keep only the longest 30x corrected reads for the downstream assembly.

[Step 2 | NextDenovo]

We will use NextDenovo to assemble the genome.

cd Assembly move to the Assembly folder
ln /home/Downloads/ONT/Arabidopsis/1-consensus/cns_final.fasta.gz > input.fofn prepare an input file with the corrected ONT reads and update run.cfg, e.g.:

[General]
job_type = local # local, slurm, sge, pbs, lsf
job_prefix = nextDenovo
task = assemble # all, correct, assemble
rewrite = yes # yes/no
deltmp = yes
parallel_jobs = 16 # number of tasks used to run in parallel
input_type = corrected # raw, corrected
read_type = ont # clr, ont, hifi
input_fofn = input.fofn
workdir = /home/Assembly/
[correct_option]
read_cutoff = 5k
genome_size = 135m # estimated genome size
sort_options = -m 10g -t 10
minimap2_options_raw = -t 8
pa_correction = 3 # number of corrected tasks used to run in parallel, each corrected task requires ~TOTAL_INPUT_BASES/4 bytes of memory usage.
correction_options = -p 15
[assemble_option]
minimap2_options_cns = -t 8
nextgraph_options = -a 1

nohup nextDenovo run.cfg & run the program.
assembly-stats /home/Assembly/nextDenovo/03.ctg_graph/01.ctg_graph.sh.work/ctg_graph1/nd.asm.p.fasta > NextDenovo_ONT_corr_30x.txt get a brief overview of the assembly:

stats for nd.asm.p.fasta
sum = 122664679, n = 15, ave = 8177645.27, largest = 16219875
N50 = 15030013, n = 4
N60 = 14767803, n = 5
N70 = 13827629, n = 6
N80 = 11316166, n = 7
N90 = 9383066, n = 8
N100 = 251324, n = 15
N_count = 0
Gaps = 0

Please note, a similar result may be obtained with the Flye assembler. Particularly, I got an assembly with the total length of 118,548,372 bp represented by 19 fragments. The subsequent BLASTing against chloroplast and mitochondrial genomes revealed a similarity to three contigs. In addition, one contig was too short (707 bp). Thus, the final nuclear genome assembly contained 15 fragments with the total length of 118,420,125 bp.

[Step 3 | Remove contaminants]

In this step, we may need to remove contaminants, e.g., chloroplast, mitochondrial and/or bacterial sequences.

ln -s /home/Assembly/nextDenovo/03.ctg_graph/01.ctg_graph.sh.work/ctg_graph1/nd.asm.p.fasta Assembly_NextDenovo_ONT_15_fragments.fasta

create a soft link for the assembly.
egrep 'loop' Assembly_NextDenovo_ONT_15_fragments.fasta check if the final assembly contains circular fragments:

>ctg000013 type:s:loop length:i:304269 node:i:16
>ctg000014 type:s:loop length:i:251324 node:i:7

We will remove these fragments from the assembly. Firstly, let’s create a list with the headers of all contigs:
egrep '>' Assembly_NextDenovo_ONT_15_fragments.fasta > name.lst you can find some basics on grep/egrep here.
sed 's/^.//' name.lst > name1.lst remove > from the list. More on sed is here.
sed '14,15d' name1.lst > 13_fragments.lst remove two looped fragments from the list. Here is one more useful link.
seqtk subseq Assembly_NextDenovo_ONT_15_fragments.fasta 13_fragments.lst > Assembly_NextDenovo_ONT_13_fragments.fasta extract the linear sequences from the assembly.

Just out of curiosity, you may extract the looped sequences in the same manner and BLAST them to verify if these fragments do represent chloroplast, mitochondrial and bacterial sequences.

[Step 4 | NextPolish]

In order to improve the draft genome by searching misassemblies and other inconsistencies and correcting them, we will polish our assembly with the highly accurate long HiFi reads.

cd /home/
mkdir Nextpolish
cd Nextpolish
ln -s /home/Assembly/Assembly_NextDenovo_ONT_13_fragments.fasta Assembly_ONT.fasta
ls /home/Downloads/HiFi/CRR302668.fastq.gz > hifi.fofn prepare an input file with the HiFi reads and create a run.cfg file, e.g.:

[General]
job_type = local
job_prefix = nextPolish
task = best
rewrite = yes
rerun = 3
parallel_jobs = 6
multithread_jobs = 5
genome = /home/Nextpolish/Assembly_ONT.fasta #genome file
genome_size = auto
workdir = /home/Nextpolish/
polish_options = -p {multithread_jobs}
[hifi_option]
hifi_fofn = /home/Nextpolish/hifi.fofn
hifi_options = -min_read_len 10k -max_read_len 45k -max_depth 150
hifi_minimap2_options = -x map-hifi

nohup nextPolish run.cfg & run the program.

[Step 5 | Scaffolding]

With a reference

Let’s consider a case when we have a reference genome. We will use D-Genies that provides a synthetic similarity overview. This software can be run locally or online. All you need is to provide a reference and your newly assembled genome. Here, we will compare genome.nextpolish.fasta, which we got in the previous step, and GWHBDNP00000000.genome.fasta.gz that was downloaded before. In Figure 2 you can see (A) comparisons ‘Reference’ vs ‘Reference’ and (B) ‘Reference’ vs ‘genome.nextpolish’.

(See Figure 2)

Figure 2. Synthetic similarities produced by D-Genies. (A) Reference vs Reference. (B) Reference vs Polished genome.

Notice, the first chromosome is represented by contigs 4 and 10 (shown in Figure 2); the second chromosome by contigs 7 and 11; the third chromosome by contigs 2 and 12; the fourth chromosome by contigs 5 and 8; the fifth chromosome by contigs 3 and 6.

Now we need to order our contigs into chromosomes according to Figure 2B. Here is a potential scheme for Chromosome 1:

touch ctg000010.txt create a text file.
nano ctg000010.txt write the contig name ctg000010 and save the file.
seqtk subseq genome.nextpolish.fasta ctg000010.txt > ctg000010.fasta extract the contig ctg000010 that we want to reverse.
seqtk seq -r ctg000010.fasta > ctg000010_REV.fasta reverse the contig.
touch ctg000004.txt create a new text file.
nano ctg000004.txt write the contig name ctg000004 and save the file.
seqtk subseq genome.nextpolish.fasta ctg000004.txt > ctg000004.fasta extract the contig ctg000004 that represents the first chromosome.
paste -d '' ctg000004.fasta ctg000010_REV.fasta > Chromosome_1.fasta concatenate two contigs representing Chromosome 1.
nano Chromosome_1.fasta you may want to edit the header.

In the end, you should get the following result (Figure 3):

(See Figure 3)

Figure 3. Synthetic similarities produced by D-Genies. (A) Reference vs Reference. (B) Reference vs Scaffolded genome. Red ellipses indicate the missing sequences in the centromeric regions.

Note, here we will not fill the remaining gaps in the centromeric regions. Please see an approach proposed in the original article (Replacing ONT-HiC-assemblies with HiFi-contigs).

Without a reference

If you have genome-wide chromatin interactions data (Hi-C), you can scaffold contigs into chromosomes without a reference genome.

You may want to know more about the workflow. Here are some useful links:
The Genome Assembly Cookbook provides a strategy on the chromosome-length genome assemblies in detail.
The Aiden Lab YouTube channel provides a brief overview on Juicebox Assembly Tools.

Juicer

IMPORTANT! You have to properly organise your directory structure for running Juicer. Please try to follow the instruction.

bwa index genome.nextpolish.fasta builds an index for the genome.
nohup juicer.sh -D /juicer/scripts/ -g AT_2022 -z /home/juicer/references/genome.nextpolish.fasta -p contig.sizes -s none -d /home/juicer/work/AT_2022 -t 16 & runs Juicer.

nohup keeps your process running even after exiting the shell.
juicer.sh the executed script.
-D a path to the Juicer tools.
-g the genomeID.
-z a genome file.
-p a file containing chromosome/contig sizes*.
-s a restriction enzyme file.
-d a path to the directory containing the Hi-C data (a fastq folder).
-t a number of threads used.
& runs a process in the background using a subshell.
* - you may get these sizes from genome.nextpolish.fasta.fai:
cut -f1,2 genome.nextpolish.fasta.fai > contig.sizes

Next, we will generate a candidate assembly with 3D de novo assembly (3D-DNA) pipeline:

nohup 3d-dna /home/juicer/references/genome.nextpolish.fasta /home/juicer/work/AT_2022/aligned/merged_nodups.txt &

Juicebox

Open the .final.hic file in Juicebox and load the .final.assembly file. You will need to examine and rectify the assembly if necessary (see Figure 4).

(See Figure 4)

Figure 4. Scaffolding using Hi-C data. (A) The Hi-C map for the assembly obtained by scaffolding the original input scaffolds, without any editing. (B) The Hi-C map after editing.

Notice, Figure 4B is not exemplary. Ideally, this assembly requires additional changes in the centromeric and telomeric regions (I did not integrate ‘debris’ to the final assembly).

Save changes via the Export Assembly menu item in Juicebox as, e.g., Juicebox.final.review.assembly and run the following script:

nohup run-asm-pipeline-post-review.sh -r Juicebox.final.review.assembly genome.nextpolish.fasta merged_nodups.txt & generates the final assembly fasta file.

In the end, we should receive (Figure 5):

(See Figure 5)

Figure 5. Synthetic similarities produced by D-Genies. (A) Reference vs Reference. (B) Reference vs Scaffolded ‘Hi-C’ genome. Red ellipses indicate the missing sequences in the centromeric regions.

[IV] Genome annotation

[Step 1 | Repeat identification and masking]

Now, we need to identify transposable elements and mask them for forthcoming structural and functional genome annotations.

RepeatModeler

Firstly, we will use RepeatModeler, a de novo Transposable Element (TE) family identification and modeling package:

BuildDatabase -name A.thaliana.DB -engine rmblast A.thaliana_Scaffolded.fasta

BuildDatabase creates a database for RepeatModeler.
-name specifies a name of your database.
-engine chooses a search engine, e.g., abblast/wublast or ncbi/rmblast version.
A.thaliana_Scaffolded.fasta is the name of your scaffolded genome.

nohup RepeatModeler -database A.thaliana.DB -engine rmblast -pa 16 &

RepeatModeler calls the ‘RepeatModeler’ package itself.
-database indicates the database created in the previous step.
-pa specifies the number of shared-memory processors available.

RepeatMasker

Secondly, we will use RepeatMasker to mask TE identified with RepeatModeler:

nohup RepeatMasker -pa 16 -gff -lib consensi.fa.classified A.thaliana_Scaffolded.fasta &

RepeatMasker calls the program.
-gff produces a .gff output.
-lib specifies the TE library identified with RepeatModeler.

Notice, we do hardmasking (repetitive sequences will be replaced by stretches of N). Alternatively, you can use the ‘-xsmall’ flag for softmasking (repetitive elements will be written in lower case). The usage of the hardmasked genome has one advantage. The annotation pipeline Braker, which we will use in the next step, does not “see” genes in masked regions and avoids the gene prediction there. On the other hand, it may lead to an annotation where some genes may be truncated or missing. Please consider the usage of both options, hardmasking and softmasking, and compare the results.

Your results will be stored in a ‘.fasta.tbl’ file, e.g.:

==================================================
file name: A.thaliana_Scaffolded.fasta
sequences:          5
total length:       121411830 bp  (121411830 bp excl N/X-runs)
GC level:             36.10 %
bases masked:   21536817 bp ( 17.74 %)
==================================================
                              number of             length        percentage
                              elements*         occupied       of sequence ==================================================
Retroelements         6901      6477344 bp    5.34 %
   \(\underline{SINEs:}\)                             75             4468 bp   0.00 %
   \(\underline{LINEs:}\)                        2276       1287298 bp    1.06 %
     L1/CIN4                    2276       1287298 bp    1.06 %
   \(\underline{LTR\space elements:}\)              4550       5185578 bp    4.27 %
     Ty1/Copia                 1062        1135249 bp    0.94 %
     Gypsy/DIRS1           3488      4050329 bp    3.34 %

DNA transposons    2577        1673976 bp   1.38 %
   hobo-Activator            260           155922 bp   0.13 %
   Tc1-IS630-Pogo           189            39029 bp   0.03 %
   Tourist/Harbinger       121             35031 bp   0.03 %

Rolling-circles           1204            711175 bp   0.59 %
Unclassified:             25101     10564094 bp   8.70 %
Total interspersed repeats: 18715414 bp   15.41 %
Small RNA:                    460          297266 bp   0.24 %
Simple repeats:        35123        1397657 bp   1.15 %
Low complexity:        8649          415305 bp   0.34 %
==================================================
* most repeats fragmented by insertions or deletions have been counted as one element

RepeatMasker version 4.1.2-p1 , default mode
run with rmblastn version 2.11.0+
The query was compared to classified sequences in “…/consensi.fa.classified”

[Step 2 | Structural annotation]

In this step, we will use BRAKER2 to structurally annotate the assembled genome with RNA-Seq data and information from proteins of any evolutionary distance. Before we start we need to process RNA-seq reads.

Let’s convert RNA-Seq archives into the FASTQ format:

nohup fasterq-dump --split-files SRR3581356 &
nohup fasterq-dump --split-files SRR3581681 &
nohup fasterq-dump --split-files SRR3581693 &
nohup fasterq-dump --split-files SRR3581703 &
nohup fasterq-dump --split-files SRR3581705 &
nohup fasterq-dump --split-files SRR3581706 &
nohup fasterq-dump --split-files SRR3581708 &

Then, we will trim adaptors and remove low quality bases with Trim Galore:

nohup trim_galore -q 20 --cores 4 SRR3581356.fastq &
nohup trim_galore -q 20 --cores 4 SRR3581681.fastq &
nohup trim_galore -q 20 --cores 4 SRR3581693.fastq &
nohup trim_galore -q 20 --cores 4 SRR3581703.fastq &
nohup trim_galore -q 20 --cores 4 SRR3581705.fastq &
nohup trim_galore -q 20 --cores 4 SRR3581706.fastq &
nohup trim_galore -q 20 --cores 4 SRR3581708.fastq &

Next, we will use hisat2 for mapping the reads to the masked genome:

hisat2-build A.thaliana_Scaffolded.fasta.masked hisat2_masked_genome creates an index file.

nohup hisat2 -x hisat2_masked_genome --rna-strandness F -U SRR3581706_trimmed.fq -S SRR3581706.sam &
nohup hisat2 -x hisat2_masked_genome --rna-strandness F -U SRR3581708_trimmed.fq -S SRR3581708.sam &
nohup hisat2 -x hisat2_masked_genome --rna-strandness F -U SRR3581693_trimmed.fq -S SRR3581693.sam &
nohup hisat2 -x hisat2_masked_genome --rna-strandness F -U SRR3581681_trimmed.fq -S SRR3581681.sam &
nohup hisat2 -x hisat2_masked_genome --rna-strandness F -U SRR3581356_trimmed.fq -S SRR3581356.sam &
nohup hisat2 -x hisat2_masked_genome --rna-strandness F -U SRR3581705_trimmed.fq -S SRR3581705.sam &
nohup hisat2 -x hisat2_masked_genome --rna-strandness F -U SRR3581703_trimmed.fq -S SRR3581703.sam &

Note, for paired-end reads alignment you should use hisat2 -x genome -1 reads_1.fq -2 read2_2.fq -S output.sam.

Now, we need to convert the .sam files to the .bam format:

nohup samtools view --threads 16 -b -F 4 SRR3581706.sam > SRR3581706.bam &
nohup samtools view --threads 16 -b -F 4 SRR3581708.sam > SRR3581708.bam &
nohup samtools view --threads 16 -b -F 4 SRR3581693.sam > SRR3581693.bam &
nohup samtools view --threads 16 -b -F 4 SRR3581681.sam > SRR3581681.bam &
nohup samtools view --threads 16 -b -F 4 SRR3581356.sam > SRR3581356.bam &
nohup samtools view --threads 16 -b -F 4 SRR3581703.sam > SRR3581703.bam &
nohup samtools view --threads 16 -b -F 4 SRR3581705.sam > SRR3581705.bam &

samtools view calls the program.
--threads is the number of BAM compression threads to use.
-b provides an output in the BAM format.
-F 4 keeps only mapped reads.

Subsequently, we have to sort the .bam files:

nohup samtools sort -@ 4 -m 10G SRR3581706.bam -o SRR3581706_sorted.bam &
nohup samtools sort -@ 4 -m 10G SRR3581708.bam -o SRR3581708_sorted.bam &
nohup samtools sort -@ 4 -m 10G SRR3581693.bam -o SRR3581693_sorted.bam &
nohup samtools sort -@ 4 -m 10G SRR3581681.bam -o SRR3581681_sorted.bam &
nohup samtools sort -@ 4 -m 10G SRR3581356.bam -o SRR3581356_sorted.bam &
nohup samtools sort -@ 4 -m 10G SRR3581703.bam -o SRR3581703_sorted.bam &
nohup samtools sort -@ 4 -m 10G SRR3581705.bam -o SRR3581705_sorted.bam &

samtools sort calls the program.
-@ sets the number of sorting and compression threads.
-m approximately specifies the maximum required memory per thread.
-o output file.

Thereafter, we need to download protein data related to plants:

nohup wget https://v100.orthodb.org/download/odb10_plants_fasta.tar.gz > odb10.out &
tar xvf odb10_plants_fasta.tar.gz
cat plants/Rawdata/* > proteins.fasta

Finally, we will run Braker2.

Note, although BRAKER supports the combination of RNA-Seq and protein data within the BRAKER pipeline, e.g.: braker.pl --genome=genome.fasta.masked --bam=RNAseq.bam, --prot_seq=orthodb.fa --etpmode the developers strongly recommend to run BRAKER twice (1x with RNA-Seq only, 1x with protein data only) and subsequently combine the results of both runs with TSEBRA, the Transcript Selector for BRAKER. Additionally, if you use a softmasked genome, you need to specify it with --softmasking. You also may consider detecting untranslated regions (UTRs) with the --UTR=on flag; however, please carefully read about potential pitfalls of doing so.

Thus, we will firstly run BRAKER with RNA-seq data:

nohup braker.pl --species=A.thaliana --genome=A.thaliana_Scaffolded.fasta.masked --bam=SRR3581706_sorted.bam,SRR3581708_sorted.bam,SRR3581693_sorted.bam,SRR3581681_sorted.bam, SRR3581356_sorted.bam,SRR3581703_sorted.bam,SRR3581705_sorted.bam --cores=16 &

braker.pl calls the program.
--species specifies the original species of the original run.
--genome provides a reference genome.
--bam provides RNA-seq data.
--cores specifies the maximum number of cores that can be used during computation.

Secondly, we will run BRAKER with the protein data:

nohup braker.pl --species=A.thaliana2 --genome=A.thaliana_Scaffolded.fasta.masked --prot_seq=proteins.fasta --epmode --cores=16 &

--prot_seq provides protein data.
--epmode runs BRAKER in EP-mode, i.e., with proteins of any evolutionary distance as processed by ProtHint within BRAKER.

Now, we will combine the results:

tsebra.py -g braker1_out/augustus.hints.gtf,braker2_out/augustus.hints.gtf -c default.cfg -e braker1_out/hintsfile.gff,braker2_out/hintsfile.gff -o braker1+2_combined.gtf

IMPORTANT! “BRAKER produces a second complete gene set named braker.gtf, besides the official output augustus.hints.gtf. The braker.gtf is the result of merging augustus.hints.gtf with some ‘high-confidents’ genes from the GeneMark prediction. However, the merging process leads to a formatting issue in braker.gtf. A quick fix for this formatting issue is the script fix_gtf_ids.py; however, take note that the braker.gtf and fix_gtf_ids.py haven’t been tested sufficently and there is no guarantee that this gene set is superior to augustus.hints.gtf” Additionally, TSEBRA provides several configuration files: (1) default, (2) keep_ab_initio.cfg and (3) pref_braker1.cfg.

My analysis resulted in:
Braker [only proteins data] - 29’472 genes.
Braker [only RNA-seq data] - 28’444 genes.
Braker [ETP mode] - 30’721 genes.
Braker + TSEBRA [default] - 27’695 genes.
Braker + TSEBRA [keep_ab_initio] - 31’671 genes.
Braker + TSEBRA [pref_braker1] - 28’309 genes.
Further we will assess these results with BUSCO.

[Step 3 | Functional annotation]

There are several options to functionally annotate genomes. Here, I would like to share three external tutorials:

(1) Annotation with Gene Ontology (GO) terms written by Yi Jin Liew. The pipeline combines a search against three protein database sequences: (a) the SwissProt database (manually reviewed entries), (b) the TrEMBL database (contains high quality computationally analysed records that are enriched with automatic annotation and classification) and (c) the NCBI database (RefSeq non-redundant proteins). In addition, there you will find several scripts to assign GO terms.

(2) The MAKER Tutorial for WGS Assembly and Annotation Winter School 2018 (section Add functional annotations and meta-data). It utilises (a) the SwissProt database and (b) InterProScan to classify protein families and establish GO terms.

(3) Last but not least, a pipeline for crickets genome annotation provided by Guillem Ylla. It uses the UniProt database (a combination of Swiss-Prot, TrEMBL, UniParc and UniRef) and InterProScan to classify protein families and establish GO terms.

IMPORTANT! The above-mentioned databases (decompressed) will need a fairly substantial storage, e.g., TrEMBL and NCBI require > 100 GB each. In addition, the BLASTing of the newly assembled genome against these databases may take several days.

[V] Quality assessment & visualisation

[Step 1 | Qualimap]

We will use Qualimap to (1) examine sequencing alignment data according to the features of the mapped reads and their genomic properties; (2) get an overall view of the data that may help in decision-making for further analysis.

minimap2 -t 16 -ax map-ont A.thaliana_Scaffolded.fasta cns_final.fasta.gz > A.thaliana_Scaffolded.aligned.sam

minimap2 calls the pairwise aligner.
-t specifies the maximum number of cores that can be used during computation.
-ax specifies the output format in .sam (a) and the usage of long sequences (x).
map-ont specifies the the exact type of long reads (here ONT reads).
A.thaliana_Scaffolded.fasta specifies the reference genome.
cns_final.fasta.gz specifies the path for ONT reads (here we use the longest reads from the NECAT step).
A.thaliana_Scaffolded.aligned.sam specifies an output file.

samtools view --threads 16 -b -F 4 A.thaliana_Scaffolded.aligned.sam > A.thaliana_Scaffolded.aligned.bam

samtools sort -@ 10 -m 10G A.thaliana_Scaffolded.aligned.bam -o A.thaliana_Scaffolded.aligned.sorted.bam

nohup qualimap bamqc -bam A.thaliana_Scaffolded.aligned.sorted.bam --java-mem-size=60G -c --outdir /Qualimap outformat pdf &

qualimap bamqc runs the application.
-bam specifies an inmput file.
--java-mem-size sets the desired memory size.
-c paints chromosome limits inside charts.
--outdir specifies an output directory.
outformat specifies an output format.

At the end Qualimap produces a bunch of different metrics, e.g.:

Reference size:          121,411,830
Number of reads:      27,493
Mapped reads:           27,493 / 100%
GC Percentage:          36.87%
General error rate:    1.55%

In addition, Qualimap creates several plots. For instance, Figure 6 shows the number of reads that cover each section of the genome. The red line indicates an average coverage that in our case is around 30x (please keep in mind that we retained only the longest 30x corrected reads in the NECAT step). Nonetheless, you may also want to use Qualimap with HiFi reads that we used to polish our assembly.

(See Figure 6)

Figure 6. Coverage across reference.

Another important plot is present in Figure 7. Notice, the mapping quality drops in the centromeric and two telomeric (chromosomes 2 and 4) regions. You may try to improve the assembly of these sequences by a gap closing step (please see the original article).

(See Figure 7)

Figure 7. Mapping quality across reference.

[Step 2 | QUAST]

We may also use the QUality ASsessment Tool (QUAST) to get a summary report on our genome as well as to assess the quality of this assembly based on a reference.

quast.py --output-dir quast A.thaliana_Scaffolded.fasta gives an overview on our assembly.
nohup quast.py -o /Quast -r GWHBDNP00000000.genome.fasta A.thaliana_Scaffolded.fasta --threads 16 --large --min-alignment 20000 --extensive-mis-size 500000 --min-identity 90 & compares our assembly with a reference genome.

quast.py calls the program.
-o specifies an output directory.
-r specifies a reference genome.
A.thaliana_Scaffolded.fasta specifies the assembled genome for the comparison with the reference.
--threads specifies the maximum number of threads.
--large use this flag for large genomes (typically > 100 Mbp).
--min-alignment specifies the minimum length of alignment in bp.
--extensive-mis-size see the manual for details.
--min-identity alignments with identity (%) worse than this value will be filtered.

You should get a report.html file (See Interactive Figure 8) that contains all the stats. Moreover, you will get an icarus.html file (a genome visualiser for accurate assessment and analysis of genomic draft assemblies) where you can browse for the misassembled sequences (See Interactive Figure 8).

Interactive Figure 8. The QUAST output file.

If you want to scrutinise the output, it is better to use the QUAST manual and the GitHub page.

[Step 3 | k-mer analysis]

In this step, we will estimate genome size and heterozygosity using a kmer-based approach with Illumina reads.

nohup jellyfish count <(zcat CRR302670_f1.fastq.gz) <(zcat CRR302670_r2.fastq.gz) -C -m 21 -s 1G -t 16 -o reads.jf &

jellyfish count calls the program to count all k-mers.
zcat allows to expand and view a compressed file without uncompressing it.
-C considers both a kmer and its reverse “complement” as equivalent.
-m specifies mers.
-s specifies the number of GB (G) or MB (M) of system memory to use for k-mer counting.
-t specifies the number of cores for the computation.
-o specifies an output file.

Next, we need to compute the histogram of the k-mer occurrences:
nohup jellyfish histo -t 16 reads.jf > reads.histo &

Lastly, we will visualise the result via GenomeScope 2.0 using defult parameters:

Figure 9. The k-mer analysis result for the A. thaliana specimen using the paired-end Illumina data.

Please note, Figure 9 demonstrates that (1) the estimated genome size is 135.6 Mb, close to the current estimate of 134.6 Mb, and (2) the heterozygosity of this specimen is very low (0.1486%). Nonetheless, the genome size assembled here is only 121 Mb, mostly due to the lack of the centromeric regions. The heterozygosity observed here is similar to that one from the original article (0.0865%).
Additionally, please consider reading a comment of Professor Michael Schatz, a developer of GenomeScope, regarding the usage of HiFi reads for such an analysis and potential pitfalls with Illumina data.

[Step 4 | BUSCO]

To assess genome assembly and annotation completeness we will use Benchmarking Universal Single-Copy Orthologs (BUSCO). So, let’s firstly assess three assemblies (Col-XJTU, TAIR10.1 and our assembly):

nohup busco -i GCF_000001735.4_TAIR10.1_genomic.fasta -o /TAIR10.1 -m genome -l eudicots_odb10 -c 16 &
nohup busco -i GWHBDNP00000000.genome.fasta -o /Col_XJTU -m genome -l eudicots_odb10 -c 16 &
nohup busco -i A.thaliana.assembled.fasta -o /Assembled -m genome -l eudicots_odb10 -c 16 &

busco calls the program.
-i specifies an input genome file.
-o specifies an output file.
-m sets the assessment MODE.
-l specifies a lineage dataset (all BUSCO datasets are placed here).
-c specifies the number of threads/cores to use.

After the successful runs we should get:

                                                                        TAIR10.1                              Col_XJTU                              Assembled here
Complete BUSCOs:                                  99.6% (2316)                        99.7% (2317)                              99.6% (2317)
Complete and single-copy BUSCOs:    98.8% (2297)                        98.8% (2297)                             98.7% (2296)
Complete and duplicated BUSCOs:      0.8% (19)                               0.9% (20)                                   0.9% (21)
Fragmented BUSCOs:                             0.1% (2)                                  0.1% (2)                                      0.1% (2)
Missing BUSCOs:                                     0.3% (8)                                 0.2% (7)                                      0.3% (7)
Total BUSCO groups searched:             100% (2326)                         100% (2326)                              100% (2326)

As we see here, all three assemblies are quite similar. Additionally, we will assess structural annotation of these genomes (use the -m protein flag for annotated gene sets):

                                                                        TAIR10.1                              Col_XJTU                              Assembled here
                                                                                                                                                                       TSEBRA [default]
Complete BUSCOs:                                  99.8% (2320)                        85.7% (1992)                              96.4% (2243)
Complete and single-copy BUSCOs:    59.5% (1383)                          51.1% (1188)                               91.5% (2129)
Complete and duplicated BUSCOs:      40.3% (937)                           34.6% (804)                               4.9% (114)
Fragmented BUSCOs:                             0.0% (0)                                  0.3% (8)                                      0.9% (21)
Missing BUSCOs:                                     0.2% (6)                                 14.0% (326)                                2.7% (62)
Total BUSCO groups searched:             100% (2326)                         100% (2326)                              100% (2326)

IMPORTANT! To get amino acids (AA) data from a .gff file you can use GffRead, e.g.:
gffread -y Output_File_AA.fa -g Input_Genome_assembly.fasta Input_GFF_File.gff
A Col_XJTU annotation is here.
A TAIR10.1 annotation is here. Alternatively, you can download an AA file here or here.

In addition, I have provided all BUSCOs assessments for the annotated genome with Braker2 (see the Structural annotation step for more details):

                  TSEBRA                  TSEBRA                  TSEBRA                  ETP mode                RNA-seq                Proteins
                  [default]           [keep_ab_initio]    [pref_braker1]                                                    only                         only
C:            96.4% (2243)       96.6% (2247)          95.5% (2222)          97.4% (2265)          94.4% (2196)    97.4% (2265)
C & SC: 91.5% (2129)        87.9% (2045)          85.8% (1996)          89.4% (2079)          88.0% (2047) 92.5% (2152)
C & D:    4.9% (114)             8.7% (202)              9.7% (114)                8.0% (186)              6.4% (149)         4.9% (113)
F:            0.9% (21)               0.8% (19)                1.1% (25)                   1.2% (29)                 2.1% (48)           1.0% (23)
M:           2.7% (62)               2.6% (60)               3.4% (79)                  1.4% (32)                 3.5% (82)           1.6% (38)
T:            100% (2326)         100% (2326)          100% (2326)           100% (2326)           100% (2326)      100% (2326)

IMPORTANT! You can try to improve these annotations by using softmasking instead of hardmasking in the RepeatMasker step.

[Step 5 | IGV]

It is also useful to ‘manually’ check the quality of your assembly. Here we will use the .bam files from the Qualimap step. All we need to do is to index the assembled genome and the .bam files:

samtools faidx A.thaliana_Scaffolded.fasta
samtools index A.thaliana_Scaffolded.aligned.sorted.bam

IMPORTANT! It may take quite a bit of time to create .bam files for the HiFi reads. Be sure that you have enough resources to finish the job. Alternatively, you may use a subsampling option of samtools, e.g.:
samtools view -s 0.5 this will subsample half of the reads that were mapped.

Now, we will open IGV. From the main window, click on ‘Genomes’ and load the assembled genome. Then, click on ‘File’ and load the .bam file with the longest 30x corrected ONT reads (Figure 10A). If you have the HiFi reads, you may also add them (Figure 10B).

(See Figure 10)

Figure 10. IGV interface. (A) The longest 30x corrected ONT reads. (B) The HiFi reads.

Please note, the corrected ONT reads still have some noise/mismatches (the low base quality support against the reference, see the red rectangular boxes in Figure 10A). On the other hand, the HiFi reads perfectly support the reference sequence (Figure 10B).

[Step 6 | Synteny]

Here, we will plot synteny and structural rearrangements between the TAIR10.1 genome and our assembly.

minimap2 -ax asm5 --eqx TAIR10.1.fasta A.thaliana.assembled.fasta > TAIR10.1_vs_GENOME.sam

asm5 aligns assembly to reference genome.
--eqx output =/X CIGAR operators for sequence match/mismatch.
TAIR10.1.fasta specifies a reference genome.
A.thaliana.assembled.fasta specifies a query genome.
TAIR10.1_vs_GENOME.sam specifies an output file.

samtools view -b TAIR10.1_vs_GENOME.sam > TAIR10.1_vs_GENOME.bam converts .sam to .bam

syri -c TAIR10.1_vs_GENOME.bam -r TAIR10.1.fasta -q GENOME.fasta -k -F B

syri calls the program.
-c specifies the file containing alignment coordinates.
-r specifies genome a reference genome.
-q specifies a query genome.
-k keeps intermediate output files.
-F specifies an input file type. T: Table, S: SAM, B: BAM, P: PAF.

plotsr --sr syri.out --genomes genomes.txt -H 8 -W 5 -d 600 -o output_plot.png

plotsr calls the program.
--sr specifies structural annotation mappings identified by SyRI.
--genomes specifies a list of genomes.
-H specifies the height of the plot.
-W specifies the width of the plot.
-d specifies the DPI for the final image.
-o specifies an output file name (acceptable formats: pdf, png and svg).

In the end, you should get the following result (Figure 11):

(See Figure 11)

Figure 11. The synteny and structural rearrangements plot (only Chromosome 1 is shown).

[Step 7 | Circos]

There are many options to generate Circos-like plots:

[1] The Circos software package.
[2] Circos in Galaxy.
[3] Circos in TBtools.
[4] circlize in R.
[5] shinyCircos in R.

Here, we will use shinyCircos to produce Figure 12 (please see this tutorial for more details).

Figure 12. Circos plot of the A. thaliana genome assembly calculated in 10-kb windows. (A) The karyotype of the assembled chromosomes. (B) GC density. (C) Density of transposable elements. (D) Gene density. (E) Syntenic blocks across chromosomes.

To reproduce Figure 12 we need:

[1] Create a .csv file with the chromosome data (see the QUAST output) for Figure 12A:

chr        start     end
Ch_1       1         30726451
Ch_2       1         20485188
Ch_3       1        23269929
Ch_4       1        19225652
Ch_5       1         27704610

[2] Generate GC density data for Figure 12B (please download two python scripts, fasta2GCcontentHeatmapTrack.py and gff2circos-heatmap.py, from here):

python fasta2GCcontentHeatmapTrack.py -window 10000 -fasta A.thaliana.assembled.fasta > GC_density.csv

python fasta2GCcontentHeatmapTrack.py calls the script.
-window specifies the number of bases in the sliding window.
-fasta specifies an input file.
> specifies an output file.

You may want to add the proper headings to the GC_density.csv file, e.g.:

chr start end value1
Ch_1 0 9999 0.3315

[3] Generate density of transposable elements for Figure 12C:

python gff2circos-heatmap.py -window 10000 -fasta A.thaliana.assembled.fasta -gff A.thaliana.assembled.fasta.out.gff > TE_density.csv

python gff2circos-heatmap.py calls the script.
-window specifies the number of bases in the sliding window.
-fasta specifies an input file.
-gff specifies a TE input file (from the RepeatMasker step).
> specifies an output file.

Add the headings:

chr start end value1
Ch_1 0 9999 0.0277

[3] Generate gene density for Figure 12D:

python gff2circos-heatmap.py -window 10000 -fasta A.thaliana.assembled.fasta -gff braker.gff3 > Gene_density.csv

-gff specifies a .gff annotation file (from the Braker step).

Add the headings:

chr start end value1
Ch_1 0 9999 0.3928

IMPORTANT! If you have a .gtf file you can convert it to .gff with AGAT.

[4] Identify Syntenic blocks for Figure 12E:

minimap2 -x asm5 A.thaliana.assembled.fasta A.thaliana.assembled.fasta > Synteny_Circos.csv

Keep only relevant data, e.g.:

chr1 start1 end1 chr2 start2 end2
Ch_1 7353566 7365826 Ch_2 13347346 13359606

For a more comprehensive scan for syntenic blocks you may use MCScanX or Blast+ tools, e.g.:

makeblastdb -in Your_Genome.fa -dbtype nucl -out Genome_blast_db
blastn -db Genome_blast_db -query Your_Genome.fa -outfmt 6 -out all_vs_all.tsv -num_threads 4

Acknowledgements

I would like to express my gratitude to Dr Guillem Ylla, the head of the laboratory of Bioinformatics and Genome Biology at the Jagiellonian University (Kraków, Poland), for providing the assistance, feedback and computing resources for this tutorial.

Tutorial: genome assembly and annotation of Arabidopsis thaliana using ONT, HiFi, Illumina and Hi-C data

Created by Evgenii Baiakhmetov

[I] Preface

In a nutshell

Learning outcomes

Keywords

DOI

[II] Software & sequencing data

Note for Windows users

Specs of the server

Conda and setting an environment

Packages installation

Additional software (not in Conda)

Downloading raw sequence data

[III] Genome assembly

[Step 1 | NECAT]

[Step 2 | NextDenovo]

[Step 3 | Remove contaminants]

[Step 4 | NextPolish]

[Step 5 | Scaffolding]

With a reference

Without a reference

Juicer

Juicebox

[IV] Genome annotation

[Step 1 | Repeat identification and masking]

RepeatModeler

RepeatMasker

[Step 2 | Structural annotation]

[Step 3 | Functional annotation]

[V] Quality assessment & visualisation

[Step 1 | Qualimap]

[Step 2 | QUAST]

[Step 3 | k-mer analysis]

[Step 4 | BUSCO]

[Step 5 | IGV]

[Step 6 | Synteny]

[Step 7 | Circos]

Acknowledgements