Skip to content
Snippets Groups Projects

HiFi metabarcoding analysis

1. Install FROGS 4.1.0

You can create a conda environnement with all the dependance needed by frogs by following the instruction on the Frogs github : https://github.com/geraldinepascal/FROGS#from-conda

  1. Get FROGS 4.1.0 repository
wget https://github.com/geraldinepascal/FROGS/archive/refs/tags/v4.1.0.tar.gz

tar -xf v4.1.0.tar.gz
rm v4.1.0.tar.gz
  1. Install with conda
conda env create --name frogs@4.1.0 --file FROGS-4.1.0/frogs-conda-requirements.yaml
# to use FROGS, first you need to activate your environment
conda activate frogs@4.1.0

2. 16S full Frogs analysis with swarm

2.1. Preprocess

Prepare data for frogs preprocess by making a tar gz of all fastq files.

using h to handle correctly symblink

cd fastq
tar -czvhf 16Sfull_zymomock.tar.gz *

Remove primer and dereplicate sequences with preprocess.py

fastq_files="fastq/16Sfull_zymomock.tar.gz"

# 16S primers 
five_prime_primer=AGRGTTYGATYMTGGCTCAG
three_prim_primer=AAGTCGTAACAAGGTARCY 


nb_cpu_preprocess=2

preprocess.py longreads --min-amplicon-size 1000  --max-amplicon-size 2000 \
                         --five-prim-primer $five_prime_primer --three-prim-primer $three_prim_primer \
                          --input-archive $fastq_files -p $nb_cpu_preprocess

2.2. Clustering with swarm


nb_cpu_clustering=10


clustering.py --fastidious --distance 1 --input-fasta preprocess.fasta \
                --input-count preprocess_counts.tsv \
                --nb-cpus $nb_cpu_clustering --output-biom clustering.biom \
                --output-fasta clustering.fasta --log-file clustering.log


clusters_stat.py -i clustering.biom \
                -o clusters_stat_clustering.html \
                --log-file clusters_stat_clustering.log

2.3. Remove chimera

nb_cpu_rm_chimera=12

remove_chimera.py --input-fasta clustering.fasta --input-biom clustering.biom --nb-cpus $nb_cpu_rm_chimera\
                  --out-abundance remove_chimera.biom --non-chimera remove_chimera.fasta \
                  --log-file remove_chimera.log --summary remove_chimera.html


clusters_stat.py -i remove_chimera.biom \
                -o clusters_stat_remove_chimera.html \
                --log-file clusters_stat_remove_chimera.log

2.4. Cluster filters

  • MIN_SAMPLE_PRESENCE: Keep cluster present in at least this number of samples.
nb_cpu_filters=1
#mem_filters=30GB

min_sample_presence=2
cluster_filters.py --min-abundance 0.00005 --min-sample-presence $min_sample_presence --nb-cpus $nb_cpu_filters --input-fasta remove_chimera.fasta --input-biom remove_chimera.biom \
               --output-biom filters.biom --output-fasta filters.fasta --log-file filters.log --summary filters.html

2.5. affiliation taxonomic

nb_cpu_affiliation=32
mem_affiliation=100GB

taxonomic_ranks='Domain Phylum Class Order Family Genus Species'
database="/save/frogs/galaxy_databanks/SILVA/16S/silva_138.1_16S/silva_138.1_16S.fasta"

taxonomic_affiliation.py --nb-cpus $nb_cpu_affiliation --reference $database \
                          --input-biom filters.biom \
                          --input-fasta filters.fasta --output-biom affiliation_abundance.biom \
                          --summary affiliation.html --log-file affiliation.log \
                          --taxonomy-ranks $taxonomic_ranks

2.6. Affiliation stat


taxonomic_ranks='Domain Phylum Class Order Family Genus Species'


affiliations_stat.py -i affiliation_abundance.biom  \
                                           --tax-consensus-tag 'blast_taxonomy' \
                                          --identity-tag 'perc_identity' \
                                          --coverage-tag 'perc_query_coverage' \
                                          --multiple-tag 'blast_affiliations' \
                                          --rarefaction-ranks Family Genus Species \
                                          --taxonomic-ranks $taxonomic_ranks

3. 16S full Dada2 analysis

Several 16S HiFi analysis with dada2 can be found in this repo : https://github.com/benjjneb/LRASManuscript

The Rmarkdown used here is an adaptation of this Rmarkdown made by Benjamin J Callahan: https://benjjneb.github.io/LRASManuscript/LRASms_Zymo.html

3.1. Remove primers with cutadapt

Dada2 is not as good as cutadapt to remove primers and we cannot use preprocess step of frogs because it transformed fastq into fasta files while dada2 needs the fastq files.

Let's remove primers with cutadapt using this bash script cutadapt_16S_primers.sh.

3.2. Make dada2 ASV with a Rmarkdown

First create a Dada2 conda environment.

conda create -n dada2 bioconductor-dada2 python jupyter
conda activate dada2
conda install -c r r-irkernel
conda install  bioconductor-biomformat
conda install  jupytext
conda install -c conda-forge r-pander

ASVs creation is made by this Rmarkdown : dada2_analysis_simple

It can be launched through slurm with this sbatch script: launch_dada2_simple.sh.

The Rmarkdown creates two important files needed to continue the analysis with frogs:

  • clustering.fasta
  • clustering.biom

3.3. Launch frogs on dada2 ASVs

Frogs pipeline steps can be applied on these two files similarly as with a swarm clustering starting from remove chimera.

4. 16S-23S HiFi analysis

The analysis steps are mainly identical for 16S full. Some parameter in the preprocess step need to be adjusted and the affiliation step uses a specific 16S−23S database. And a specific affiliation strategy can be applied to use on top of 16S-23S database, a 16S and a 23S databases to improve the affiliations.

Dada2 analysis of 16S-23S data does not work properly on real data.

4.1. 16S-23S specificity at the preprocess step

With 16S-23S sequences primers and expected lengths change compared to the 16S full.

Example of command to preprocess 16S-23S sequences:

preprocess.py pacbio-hifi --min-amplicon-size 3500 --max-amplicon-size 5000 \
                         --five-prim-primer AGRGTTTGATYHTGGCTCAG --three-prim-primer AGTTTDACTGGGGYGGT \
                         --input-archive fastq/16S-23S_zymomock.tar.gz -p 2

4.2. 16-23S specificity at the affiliation step

The database used is build from ncbi refseq assemblies. So the taxonomy used is the one from the ncbi which use 8 main ranks. This ranks need to be given to frogs script taxonomic_affiliation.py using the parameter --taxonomy-ranks

database="/work/project/seqoccin/metaG/marker_identification/region_identification_workflow/frogs_db_formatting/16S-23S_bacteria_and_archaea_refseq_2022_02_07/refseq-all_16S-23S_2022-02-07_flagged/refseq-all_16S-23S_2022-02-07_flagged.fasta"
taxonomic_ranks='Domain Phylum Class Order Family Genus Species Strain'


nb_cpu_affiliation=32
mem_affiliation=20GB


taxonomic_affiliation.py --nb-cpus $nb_cpu_affiliation --reference $database --input-biom filters.biom \
                   --input-fasta filters.fasta --output-biom affiliation_abundance.biom \
                   --summary affiliation.html --log-file affiliation.log --taxonomy-ranks $taxonomic_ranks

Affiliation stat and biom2tsv can be launched similarly as with the 16S full data but using the correct taxonomic ranks

taxonomic_ranks='Domain Phylum Class Order Family Genus Species Strain'

affiliations_stat.py -i unfiltered_affiliation_abundance.biom  \
                                           --tax-consensus-tag 'blast_taxonomy' \
                                          --identity-tag 'perc_identity' \
                                          --coverage-tag 'perc_query_coverage' \
                                          --multiple-tag 'blast_affiliations' \
                                          --rarefaction-ranks Family Genus Species \
                                          --taxonomic-ranks $taxonomic_ranks
                                          


# Biom to tsv
biom_to_tsv.py --input-biom unfiltered_affiliation_abundance.biom \
                    --input-fasta filters.fasta --output-tsv affiliation_abundance.tsv \
                     --output-multi-affi multiaff.tsv --log-file biom_to_tsv_affi.log

4.3. Filter out suspicious sequences

We have identified in our custom 16S-23S database some suspicious sequences that a likely coming from contamination or HGT. They are not informative and therefor we want to remove these affiliation. For that we can use the Frogs script affiliations_stat.py

affiliation_filters.py --input-biom unfiltered_affiliation_abundance.biom --input-fasta filters.fasta \
                        --output-biom affiliation_abundance.biom --output-fasta affiliation_filtered.fasta \
                       --ignore-blast-taxa suspicious --delete --taxonomic-ranks Domain Phylum Class Order Family Genus Species Strain

We can now relaunch the affiliation stat.


affiliations_stat.py -i affiliation_abundance.biom --tax-consensus-tag 'blast_taxonomy' --identity-tag 'perc_identity' --coverage-tag 'perc_query_coverage' --multiple-tag 'blast_affiliations' --rarefaction-ranks Family Genus Species --taxonomic-ranks Domain Phylum Class Order Family Genus Species Strain


biom_to_tsv.py --input-biom affiliation_abundance.biom --input-fasta affiliation_filtered.fasta  --output-tsv affiliation_abundance.tsv --output-multi-affi multiaff.tsv --log-file biom_to_tsv_affi-filtered.log

4.4. Affiliate 16S-23S cluster with Silva 16S and 23S databases

Let's affiliate our 16S-23S clusters with the 16S Silva database and the 23S Silva database.

16S Silva :


database="/save/frogs/galaxy_databanks/SILVA/16S/silva_138.1_16S/silva_138.1_16S.fasta"
taxonomic_ranks='Domain Phylum Class Order Family Genus Species'

taxonomic_affiliation.py --nb-cpus $nb_cpu_affiliation --reference $database --input-biom filters.biom \
                   --input-fasta filters.fasta --output-biom affiliation_abundance_16Ssilva.biom \
                   --summary affiliation_16Ssilva.html --log-file affiliation_16Ssilva.log --taxonomy-ranks $taxonomic_ranks


affiliation_stats.py -i affiliation_abundance_16Ssilva.biom  \
                                           --tax-consensus-tag 'blast_taxonomy' \
                                          --identity-tag 'perc_identity' \
                                          --coverage-tag 'perc_query_coverage' \
                                          --multiple-tag 'blast_affiliations' \
                                          --rarefaction-ranks Family Genus Species \
                                          --taxonomic-ranks $taxonomic_ranks

biom_to_tsv.py --input-biom affiliation_abundance_16Ssilva.biom \
                    --input-fasta filters.fasta --output-tsv affiliation_abundance_16Ssilva.tsv \
                     --output-multi-affi multiaff_16Ssilva.tsv --log-file biom_to_tsv_affi_16Ssilva.log

23S Silva :


database="/save/frogs/galaxy_databanks/SILVA/23S/silva_138.1/silva_138.1_23S.fasta"
taxonomic_ranks='Domain Phylum Class Order Family Genus Species'



taxonomic_affiliation.py --nb-cpus $nb_cpu_affiliation --reference $database --input-biom filters.biom \
                   --input-fasta filters.fasta --output-biom affiliation_abundance_23Ssilva.biom \
                   --summary affiliation_23Ssilva.html --log-file affiliation_23Ssilva.log --taxonomy-ranks $taxonomic_ranks


affiliation_stats.py -i affiliation_abundance_23Ssilva.biom  \
                                           --tax-consensus-tag 'blast_taxonomy' \
                                          --identity-tag 'perc_identity' \
                                          --coverage-tag 'perc_query_coverage' \
                                          --multiple-tag 'blast_affiliations' \
                                          --rarefaction-ranks Family Genus Species \
                                          --taxonomic-ranks $taxonomic_ranks

biom_to_tsv.py --input-biom affiliation_abundance_23Ssilva.biom \
                    --input-fasta filters.fasta --output-tsv affiliation_abundance_23Ssilva.tsv \
                     --output-multi-affi multiaff_23Ssilva.tsv --log-file biom_to_tsv_affi_23Ssilva.log

4.5. Post process affiliation tables given by Frogs

To plot and merge the affiliations, we first need to postprocess affiliation tables with the local python script add_multiaffi_to_abd_table.py. For each database used, we use the affiliation abundance table and the multiaffi table generated by biom_to_tsv.py (launched on the biom generated by affiliations_stat.py). We also give identity and coverage thresholds.

Be careful: the taxonomic_ranks are not the same in 16S23S custom db and in Silva databases.

For the custom 16S23S database:

python scripts/add_multiaffi_to_abd_table.py --abundance_table affiliation_abundance-filtered.tsv  --multiaffi_table multiaff-filtered.tsv --region 16S-23S --affi_db_name 16S23S_custom --taxonomic_ranks 'Domain Phylum Class Order Family Genus Species Species' -o abundance_and_multiaffi_16S23S_custom.tsv  --min_identity 98 --min_coverage 99 -v

For 16S Silva database:

python scripts/add_multiaffi_to_abd_table.py --abundance_table affiliation_abundance_16Ssilva.tsv  --multiaffi_table multiaff_16Ssilva.tsv --region 16S-23S --affi_db_name 16Ssilva --taxonomic_ranks 'Domain Phylum Class Order Family Genus Species' -o abundance_and_multiaffi_16Ssilva.tsv  --min_identity 98 --min_coverage 30 -v 

For 23S Silva database:

python scripts/add_multiaffi_to_abd_table.py --abundance_table affiliation_abundance_23Ssilva.tsv  --multiaffi_table multiaff_23Ssilva.tsv --region 16S-23S --affi_db_name 23Ssilva --taxonomic_ranks 'Domain Phylum Class Order Family Genus Species' -o abundance_and_multiaffi_23Ssilva.tsv  --min_identity 98 --min_coverage 40 -v

4.6. MultiDB approach : Merge the 16S23S db affiliations with 16S and 23S Silva db affiliations

We difined weak and valid affiliations for each database used to affiliated the 16S23S amplicons.

For a 16S23S db affiliation an affiliation is weak when the affiliation is <98% identity and < 99% query coverage.

For 16S and 23S Silva affiliation, the required coverage is lower because sequences in the database will cover only partially our 16S23S amplicons.

  • 16S silva weak affiliation: <98% identity and <30% query coverage.
  • 23S silva weak affiliation: <98% identity and <40% query coverage.

Rule to merge the affiliations: When 16S23S db affiliation is valid we keep it. When 16S23S db affiliation is weak, we look at the 16S and 23S silva affiliation:

  • Only one of the affiliation is valid --> we use it as the new affiliation for the cluster
  • None of the affiliation is valid --> The cluster is seen as week affiliated
  • Both affiliations are valid --> We go down to the last common taxon. In case of multiaffiliation, if the there is a shared affiliation in both database, we use it as the final affiliation.

4.6.1. Simple exemple:

16S affiliation: Bacteria;Firmicutes;Bacilli;Lactobacillales;Streptococcaceae;Streptococcus;Streptococcus alactolyticus 23S affiliation: Bacteria;Firmicutes;Bacilli;Erysipelotrichales;Erysipelotrichaceae;Turicibacter;Turicibacter sp.

Final affiliation: Bacteria;Firmicutes;Bacilli;Multi-affiliation;Multi-affiliation;Multi-affiliation;Multi-affiliation

4.6.2. Multiaffiliation exemple:

16S affiliations: Bacteria;Firmicutes;Bacilli;Lactobacillales;Streptococcaceae;Streptococcus;Streptococcus alactolyticus Bacteria;Firmicutes;Bacilli;Lactobacillales;Streptococcaceae;Lactococcus;Lactococcus lactis

23S affiliation: Bacteria;Firmicutes;Bacilli;Lactobacillales;Streptococcaceae;Streptococcus;Streptococcus salivarius CCHSS3

Final affiliation: Bacteria;Firmicutes;Bacilli;Lactobacillales;Streptococcaceae;Streptococcus;Multi-affiliation

4.6.3. Scripts and commands to merge affiliations

To merge affiliations, we use the tables created by the script add_multiaffi_to_abd_table.py that we give to the local script (merge_affiliation_from_multiple_db.py)[scripts/merge_affiliation_from_multiple_db.py]/

python scripts/merge_affiliation_from_multiple_db.py \
                            --table_16S23Sdb abundance_and_multiaffi_16S23S_custom.tsv \
                            --table_16Sdb abundance_and_multiaffi_16Ssilva.tsv \
                            --table_23Sdb abundance_and_multiaffi_23Ssilva.tsv \
                            -v \
                            --min_identity 98 --min_qcov_23S 40 --min_qcov_16S 30 --min_qcov_16S23S 98 \
                            -o affiliation_multiple_db_merged.tsv 

5. Plot taxonomic rank of the affiliations

To describe the level of precision the affiliation is in diverse condition (different target or database used) let's plot the abundance of affiliation made at the different taxonomic ranks.

We can visualise that globally per analysis:
taxonomic_ranks_per_target


Or per sample:

taxonomic_ranks_per_target_and_per_sample

We can produce these plots with the local python script plot_taxo_ranks.py


python scripts/plot_taxo_ranks.py --affi_tables abundance_and_multiaffi_16S23S_custom.tsv abundance_and_multiaffi_16Ssilva.tsv abundance_and_multiaffi_23Ssilva.tsv\
                                    --labels 16S23S_customDB 1623S_16Ssilva 16S_23S_23Ssilva  --samples 1:32  -v 

It takes as input the output of the script add_multiaffi_to_abd_table.py

6. Generate newick tree of taxonomic affiliation and itol metadata

Affiliation of the different target can be visualize as a taxonomic tree. To do this tree we can use the web tool iTOL [https://itol.embl.de/]. To use iTOL we need the tree in newick format and putative metadata file to customise the tree.

To generate the newick tree and the metadata we can use the script affiliation_to_newick.py it takes as input the output of the script [scripts/add_multiaffi_to_abd_table.py] or can also process the multi db affiliation (output of [scripts/merge_affiliation_from_multiple_db.py]). It uses only the column blast_taxonomy_cleaned which is the taxonomy without unprecise taxon such as unknown rank and something metagenome.

python scripts/affiliation_to_newick.py --affiliation_table abundance_and_affiliation_16S23S_multiDB.tsv --samples 8  -o abundance_and_multiaffi_16S23S_multiDB  -v 
python scripts/affiliation_to_newick.py --affiliation_table abundance_and_multiaffi_16Sfull.tsv --samples 8 -o  abundance_and_multiaffi_16Sfull -v 
python scripts/affiliation_to_newick.py --affiliation_table abundance_and_multiaffi_16S23S_custom.tsv --samples 8 -o  abundance_and_multiaffi_16S23S_custom -v
python scripts/affiliation_to_newick.py --affiliation_table abundance_and_multiaffi_16Sv3v4.tsv --samples 8 -o  abundance_and_multiaffi_16Sv3v4 -v
python scripts/affiliation_to_newick.py --affiliation_table abundance_and_affiliation_16S23S_multiDB.tsv --samples 32  -o abundance_and_multiaffi_16S23S_multiDB  -v 
python scripts/affiliation_to_newick.py --affiliation_table abundance_and_multiaffi_16Sfull.tsv --samples 32 -o  abundance_and_multiaffi_16Sfull -v 
python scripts/affiliation_to_newick.py --affiliation_table abundance_and_multiaffi_16S23S_custom.tsv --samples 32 -o  abundance_and_multiaffi_16S23S_custom -v
python scripts/affiliation_to_newick.py --affiliation_table abundance_and_multiaffi_16Sv3v4.tsv --samples 32 -o  abundance_and_multiaffi_16Sv3v4 -v

7. Slurm pipeline

For 16Sfull analysis, all the frogs steps can be launch through this slurm pipeline: frogs_pipeline

It uses slurm dependencies to launch each step successively.