-
Jean Mainguy authoredJean Mainguy authored
- HiFi metabarcoding analysis
- 1. Install FROGS 4.1.0
- 2. 16S full Frogs analysis with swarm
- 2.1. Preprocess
- 2.2. Clustering with swarm
- 2.3. Remove chimera
- 2.4. Cluster filters
- 2.5. affiliation taxonomic
- 2.6. Affiliation stat
- 3. 16S full Dada2 analysis
- 3.1. Remove primers with cutadapt
- 3.2. Make dada2 ASV with a Rmarkdown
- 3.3. Launch frogs on dada2 ASVs
- 4. 16S-23S HiFi analysis
- 4.1. 16S-23S specificity at the preprocess step
- 4.2. 16-23S specificity at the affiliation step
- 4.3. Filter out suspicious sequences
- 4.4. Affiliate 16S-23S cluster with Silva 16S and 23S databases
- 4.5. Post process affiliation tables given by Frogs
- 4.6. MultiDB approach : Merge the 16S23S db affiliations with 16S and 23S Silva db affiliations
- 4.6.1. Simple exemple:
- 4.6.2. Multiaffiliation exemple:
- 4.6.3. Scripts and commands to merge affiliations
- 5. Plot taxonomic rank of the affiliations
- 6. Generate newick tree of taxonomic affiliation and itol metadata
- 7. Slurm pipeline
-
- 2.1. Preprocess
- 2.2. Clustering with swarm
- 2.3. Remove chimera
- 2.4. Cluster filters
- 2.5. affiliation taxonomic
- 2.6. Affiliation stat
-
- 4.1. 16S-23S specificity at the preprocess step
- 4.2. 16-23S specificity at the affiliation step
- 4.3. Filter out suspicious sequences
- 4.4. Affiliate 16S-23S cluster with Silva 16S and 23S databases
- 4.5. Post process affiliation tables given by Frogs
- 4.6. MultiDB approach : Merge the 16S23S db affiliations with 16S and 23S Silva db affiliations
- 4.6.1. Simple exemple:
- 4.6.2. Multiaffiliation exemple:
- 4.6.3. Scripts and commands to merge affiliations
HiFi metabarcoding analysis
Install FROGS 4.1.0
1.You can create a conda environnement with all the dependance needed by frogs by following the instruction on the Frogs github : https://github.com/geraldinepascal/FROGS#from-conda
- Get FROGS 4.1.0 repository
wget https://github.com/geraldinepascal/FROGS/archive/refs/tags/v4.1.0.tar.gz
tar -xf v4.1.0.tar.gz
rm v4.1.0.tar.gz
- Install with conda
conda env create --name frogs@4.1.0 --file FROGS-4.1.0/frogs-conda-requirements.yaml
# to use FROGS, first you need to activate your environment
conda activate frogs@4.1.0
16S full Frogs analysis with swarm
2.Preprocess
2.1.Prepare data for frogs preprocess by making a tar gz of all fastq files.
using h to handle correctly symblink
cd fastq
tar -czvhf 16Sfull_zymomock.tar.gz *
Remove primer and dereplicate sequences with preprocess.py
fastq_files="fastq/16Sfull_zymomock.tar.gz"
# 16S primers
five_prime_primer=AGRGTTYGATYMTGGCTCAG
three_prim_primer=AAGTCGTAACAAGGTARCY
nb_cpu_preprocess=2
preprocess.py longreads --min-amplicon-size 1000 --max-amplicon-size 2000 \
--five-prim-primer $five_prime_primer --three-prim-primer $three_prim_primer \
--input-archive $fastq_files -p $nb_cpu_preprocess
Clustering with swarm
2.2.
nb_cpu_clustering=10
clustering.py --fastidious --distance 1 --input-fasta preprocess.fasta \
--input-count preprocess_counts.tsv \
--nb-cpus $nb_cpu_clustering --output-biom clustering.biom \
--output-fasta clustering.fasta --log-file clustering.log
clusters_stat.py -i clustering.biom \
-o clusters_stat_clustering.html \
--log-file clusters_stat_clustering.log
Remove chimera
2.3.nb_cpu_rm_chimera=12
remove_chimera.py --input-fasta clustering.fasta --input-biom clustering.biom --nb-cpus $nb_cpu_rm_chimera\
--out-abundance remove_chimera.biom --non-chimera remove_chimera.fasta \
--log-file remove_chimera.log --summary remove_chimera.html
clusters_stat.py -i remove_chimera.biom \
-o clusters_stat_remove_chimera.html \
--log-file clusters_stat_remove_chimera.log
Cluster filters
2.4.- MIN_SAMPLE_PRESENCE: Keep cluster present in at least this number of samples.
nb_cpu_filters=1
#mem_filters=30GB
min_sample_presence=2
cluster_filters.py --min-abundance 0.00005 --min-sample-presence $min_sample_presence --nb-cpus $nb_cpu_filters --input-fasta remove_chimera.fasta --input-biom remove_chimera.biom \
--output-biom filters.biom --output-fasta filters.fasta --log-file filters.log --summary filters.html
affiliation taxonomic
2.5.nb_cpu_affiliation=32
mem_affiliation=100GB
taxonomic_ranks='Domain Phylum Class Order Family Genus Species'
database="/save/frogs/galaxy_databanks/SILVA/16S/silva_138.1_16S/silva_138.1_16S.fasta"
taxonomic_affiliation.py --nb-cpus $nb_cpu_affiliation --reference $database \
--input-biom filters.biom \
--input-fasta filters.fasta --output-biom affiliation_abundance.biom \
--summary affiliation.html --log-file affiliation.log \
--taxonomy-ranks $taxonomic_ranks
Affiliation stat
2.6.
taxonomic_ranks='Domain Phylum Class Order Family Genus Species'
affiliations_stat.py -i affiliation_abundance.biom \
--tax-consensus-tag 'blast_taxonomy' \
--identity-tag 'perc_identity' \
--coverage-tag 'perc_query_coverage' \
--multiple-tag 'blast_affiliations' \
--rarefaction-ranks Family Genus Species \
--taxonomic-ranks $taxonomic_ranks
16S full Dada2 analysis
3.Several 16S HiFi analysis with dada2 can be found in this repo : https://github.com/benjjneb/LRASManuscript
The Rmarkdown used here is an adaptation of this Rmarkdown made by Benjamin J Callahan: https://benjjneb.github.io/LRASManuscript/LRASms_Zymo.html
Remove primers with cutadapt
3.1.Dada2 is not as good as cutadapt to remove primers and we cannot use preprocess step of frogs because it transformed fastq into fasta files while dada2 needs the fastq files.
Let's remove primers with cutadapt using this bash script cutadapt_16S_primers.sh.
Make dada2 ASV with a Rmarkdown
3.2.First create a Dada2 conda environment.
conda create -n dada2 bioconductor-dada2 python jupyter
conda activate dada2
conda install -c r r-irkernel
conda install bioconductor-biomformat
conda install jupytext
conda install -c conda-forge r-pander
ASVs creation is made by this Rmarkdown : dada2_analysis_simple
It can be launched through slurm with this sbatch script: launch_dada2_simple.sh.
The Rmarkdown creates two important files needed to continue the analysis with frogs:
- clustering.fasta
- clustering.biom
Launch frogs on dada2 ASVs
3.3.Frogs pipeline steps can be applied on these two files similarly as with a swarm clustering starting from remove chimera.
16S-23S HiFi analysis
4.The analysis steps are mainly identical for 16S full. Some parameter in the preprocess step need to be adjusted and the affiliation step uses a specific 16S−23S database. And a specific affiliation strategy can be applied to use on top of 16S-23S database, a 16S and a 23S databases to improve the affiliations.
Dada2 analysis of 16S-23S data does not work properly on real data.
16S-23S specificity at the preprocess step
4.1.With 16S-23S sequences primers and expected lengths change compared to the 16S full.
Example of command to preprocess 16S-23S sequences:
preprocess.py pacbio-hifi --min-amplicon-size 3500 --max-amplicon-size 5000 \
--five-prim-primer AGRGTTTGATYHTGGCTCAG --three-prim-primer AGTTTDACTGGGGYGGT \
--input-archive fastq/16S-23S_zymomock.tar.gz -p 2
16-23S specificity at the affiliation step
4.2.The database used is build from ncbi refseq assemblies. So the taxonomy used is the one from the ncbi which use 8 main ranks. This ranks need to be given to frogs script taxonomic_affiliation.py
using the parameter --taxonomy-ranks
database="/work/project/seqoccin/metaG/marker_identification/region_identification_workflow/frogs_db_formatting/16S-23S_bacteria_and_archaea_refseq_2022_02_07/refseq-all_16S-23S_2022-02-07_flagged/refseq-all_16S-23S_2022-02-07_flagged.fasta"
taxonomic_ranks='Domain Phylum Class Order Family Genus Species Strain'
nb_cpu_affiliation=32
mem_affiliation=20GB
taxonomic_affiliation.py --nb-cpus $nb_cpu_affiliation --reference $database --input-biom filters.biom \
--input-fasta filters.fasta --output-biom affiliation_abundance.biom \
--summary affiliation.html --log-file affiliation.log --taxonomy-ranks $taxonomic_ranks
Affiliation stat and biom2tsv can be launched similarly as with the 16S full data but using the correct taxonomic ranks
taxonomic_ranks='Domain Phylum Class Order Family Genus Species Strain'
affiliations_stat.py -i unfiltered_affiliation_abundance.biom \
--tax-consensus-tag 'blast_taxonomy' \
--identity-tag 'perc_identity' \
--coverage-tag 'perc_query_coverage' \
--multiple-tag 'blast_affiliations' \
--rarefaction-ranks Family Genus Species \
--taxonomic-ranks $taxonomic_ranks
# Biom to tsv
biom_to_tsv.py --input-biom unfiltered_affiliation_abundance.biom \
--input-fasta filters.fasta --output-tsv affiliation_abundance.tsv \
--output-multi-affi multiaff.tsv --log-file biom_to_tsv_affi.log
Filter out suspicious sequences
4.3.We have identified in our custom 16S-23S database some suspicious sequences that a likely coming from contamination or HGT. They are not informative and therefor we want to remove these affiliation.
For that we can use the Frogs script affiliations_stat.py
affiliation_filters.py --input-biom unfiltered_affiliation_abundance.biom --input-fasta filters.fasta \
--output-biom affiliation_abundance.biom --output-fasta affiliation_filtered.fasta \
--ignore-blast-taxa suspicious --delete --taxonomic-ranks Domain Phylum Class Order Family Genus Species Strain
We can now relaunch the affiliation stat.
affiliations_stat.py -i affiliation_abundance.biom --tax-consensus-tag 'blast_taxonomy' --identity-tag 'perc_identity' --coverage-tag 'perc_query_coverage' --multiple-tag 'blast_affiliations' --rarefaction-ranks Family Genus Species --taxonomic-ranks Domain Phylum Class Order Family Genus Species Strain
biom_to_tsv.py --input-biom affiliation_abundance.biom --input-fasta affiliation_filtered.fasta --output-tsv affiliation_abundance.tsv --output-multi-affi multiaff.tsv --log-file biom_to_tsv_affi-filtered.log
Affiliate 16S-23S cluster with Silva 16S and 23S databases
4.4.Let's affiliate our 16S-23S clusters with the 16S Silva database and the 23S Silva database.
16S Silva :
database="/save/frogs/galaxy_databanks/SILVA/16S/silva_138.1_16S/silva_138.1_16S.fasta"
taxonomic_ranks='Domain Phylum Class Order Family Genus Species'
taxonomic_affiliation.py --nb-cpus $nb_cpu_affiliation --reference $database --input-biom filters.biom \
--input-fasta filters.fasta --output-biom affiliation_abundance_16Ssilva.biom \
--summary affiliation_16Ssilva.html --log-file affiliation_16Ssilva.log --taxonomy-ranks $taxonomic_ranks
affiliation_stats.py -i affiliation_abundance_16Ssilva.biom \
--tax-consensus-tag 'blast_taxonomy' \
--identity-tag 'perc_identity' \
--coverage-tag 'perc_query_coverage' \
--multiple-tag 'blast_affiliations' \
--rarefaction-ranks Family Genus Species \
--taxonomic-ranks $taxonomic_ranks
biom_to_tsv.py --input-biom affiliation_abundance_16Ssilva.biom \
--input-fasta filters.fasta --output-tsv affiliation_abundance_16Ssilva.tsv \
--output-multi-affi multiaff_16Ssilva.tsv --log-file biom_to_tsv_affi_16Ssilva.log
23S Silva :
database="/save/frogs/galaxy_databanks/SILVA/23S/silva_138.1/silva_138.1_23S.fasta"
taxonomic_ranks='Domain Phylum Class Order Family Genus Species'
taxonomic_affiliation.py --nb-cpus $nb_cpu_affiliation --reference $database --input-biom filters.biom \
--input-fasta filters.fasta --output-biom affiliation_abundance_23Ssilva.biom \
--summary affiliation_23Ssilva.html --log-file affiliation_23Ssilva.log --taxonomy-ranks $taxonomic_ranks
affiliation_stats.py -i affiliation_abundance_23Ssilva.biom \
--tax-consensus-tag 'blast_taxonomy' \
--identity-tag 'perc_identity' \
--coverage-tag 'perc_query_coverage' \
--multiple-tag 'blast_affiliations' \
--rarefaction-ranks Family Genus Species \
--taxonomic-ranks $taxonomic_ranks
biom_to_tsv.py --input-biom affiliation_abundance_23Ssilva.biom \
--input-fasta filters.fasta --output-tsv affiliation_abundance_23Ssilva.tsv \
--output-multi-affi multiaff_23Ssilva.tsv --log-file biom_to_tsv_affi_23Ssilva.log
Post process affiliation tables given by Frogs
4.5.To plot and merge the affiliations, we first need to postprocess affiliation tables with the local python script add_multiaffi_to_abd_table.py.
For each database used, we use the affiliation abundance table and the multiaffi table generated by biom_to_tsv.py
(launched on the biom generated by affiliations_stat.py
).
We also give identity and coverage thresholds.
Be careful: the taxonomic_ranks are not the same in 16S23S custom db and in Silva databases.
For the custom 16S23S database:
python scripts/add_multiaffi_to_abd_table.py --abundance_table affiliation_abundance-filtered.tsv --multiaffi_table multiaff-filtered.tsv --region 16S-23S --affi_db_name 16S23S_custom --taxonomic_ranks 'Domain Phylum Class Order Family Genus Species Species' -o abundance_and_multiaffi_16S23S_custom.tsv --min_identity 98 --min_coverage 99 -v
For 16S Silva database:
python scripts/add_multiaffi_to_abd_table.py --abundance_table affiliation_abundance_16Ssilva.tsv --multiaffi_table multiaff_16Ssilva.tsv --region 16S-23S --affi_db_name 16Ssilva --taxonomic_ranks 'Domain Phylum Class Order Family Genus Species' -o abundance_and_multiaffi_16Ssilva.tsv --min_identity 98 --min_coverage 30 -v
For 23S Silva database:
python scripts/add_multiaffi_to_abd_table.py --abundance_table affiliation_abundance_23Ssilva.tsv --multiaffi_table multiaff_23Ssilva.tsv --region 16S-23S --affi_db_name 23Ssilva --taxonomic_ranks 'Domain Phylum Class Order Family Genus Species' -o abundance_and_multiaffi_23Ssilva.tsv --min_identity 98 --min_coverage 40 -v
MultiDB approach : Merge the 16S23S db affiliations with 16S and 23S Silva db affiliations
4.6.We difined weak and valid affiliations for each database used to affiliated the 16S23S amplicons.
For a 16S23S db affiliation an affiliation is weak when the affiliation is <98% identity and < 99% query coverage.
For 16S and 23S Silva affiliation, the required coverage is lower because sequences in the database will cover only partially our 16S23S amplicons.
- 16S silva weak affiliation: <98% identity and <30% query coverage.
- 23S silva weak affiliation: <98% identity and <40% query coverage.
Rule to merge the affiliations: When 16S23S db affiliation is valid we keep it. When 16S23S db affiliation is weak, we look at the 16S and 23S silva affiliation:
- Only one of the affiliation is valid --> we use it as the new affiliation for the cluster
- None of the affiliation is valid --> The cluster is seen as week affiliated
- Both affiliations are valid --> We go down to the last common taxon. In case of multiaffiliation, if the there is a shared affiliation in both database, we use it as the final affiliation.
Simple exemple:
4.6.1.16S affiliation:
Bacteria;Firmicutes;Bacilli;Lactobacillales;Streptococcaceae;Streptococcus;Streptococcus alactolyticus
23S affiliation:
Bacteria;Firmicutes;Bacilli;Erysipelotrichales;Erysipelotrichaceae;Turicibacter;Turicibacter sp.
Final affiliation: Bacteria;Firmicutes;Bacilli;Multi-affiliation;Multi-affiliation;Multi-affiliation;Multi-affiliation
Multiaffiliation exemple:
4.6.2.16S affiliations:
Bacteria;Firmicutes;Bacilli;Lactobacillales;Streptococcaceae;Streptococcus;Streptococcus alactolyticus
Bacteria;Firmicutes;Bacilli;Lactobacillales;Streptococcaceae;Lactococcus;Lactococcus lactis
23S affiliation:
Bacteria;Firmicutes;Bacilli;Lactobacillales;Streptococcaceae;Streptococcus;Streptococcus salivarius CCHSS3
Final affiliation: Bacteria;Firmicutes;Bacilli;Lactobacillales;Streptococcaceae;Streptococcus;Multi-affiliation
Scripts and commands to merge affiliations
4.6.3.To merge affiliations, we use the tables created by the script add_multiaffi_to_abd_table.py that we give to the local script (merge_affiliation_from_multiple_db.py)[scripts/merge_affiliation_from_multiple_db.py]/
python scripts/merge_affiliation_from_multiple_db.py \
--table_16S23Sdb abundance_and_multiaffi_16S23S_custom.tsv \
--table_16Sdb abundance_and_multiaffi_16Ssilva.tsv \
--table_23Sdb abundance_and_multiaffi_23Ssilva.tsv \
-v \
--min_identity 98 --min_qcov_23S 40 --min_qcov_16S 30 --min_qcov_16S23S 98 \
-o affiliation_multiple_db_merged.tsv
Plot taxonomic rank of the affiliations
5.To describe the level of precision the affiliation is in diverse condition (different target or database used) let's plot the abundance of affiliation made at the different taxonomic ranks.
We can visualise that globally per analysis:
Or per sample:
We can produce these plots with the local python script plot_taxo_ranks.py
python scripts/plot_taxo_ranks.py --affi_tables abundance_and_multiaffi_16S23S_custom.tsv abundance_and_multiaffi_16Ssilva.tsv abundance_and_multiaffi_23Ssilva.tsv\
--labels 16S23S_customDB 1623S_16Ssilva 16S_23S_23Ssilva --samples 1:32 -v
It takes as input the output of the script add_multiaffi_to_abd_table.py
Generate newick tree of taxonomic affiliation and itol metadata
6.Affiliation of the different target can be visualize as a taxonomic tree. To do this tree we can use the web tool iTOL [https://itol.embl.de/]. To use iTOL we need the tree in newick format and putative metadata file to customise the tree.
To generate the newick tree and the metadata we can use the script affiliation_to_newick.py it takes as input the output of the script [scripts/add_multiaffi_to_abd_table.py] or can also process the multi db affiliation (output of [scripts/merge_affiliation_from_multiple_db.py]). It uses only the column blast_taxonomy_cleaned
which is the taxonomy without unprecise taxon such as unknown rank
and something metagenome
.
python scripts/affiliation_to_newick.py --affiliation_table abundance_and_affiliation_16S23S_multiDB.tsv --samples 8 -o abundance_and_multiaffi_16S23S_multiDB -v
python scripts/affiliation_to_newick.py --affiliation_table abundance_and_multiaffi_16Sfull.tsv --samples 8 -o abundance_and_multiaffi_16Sfull -v
python scripts/affiliation_to_newick.py --affiliation_table abundance_and_multiaffi_16S23S_custom.tsv --samples 8 -o abundance_and_multiaffi_16S23S_custom -v
python scripts/affiliation_to_newick.py --affiliation_table abundance_and_multiaffi_16Sv3v4.tsv --samples 8 -o abundance_and_multiaffi_16Sv3v4 -v
python scripts/affiliation_to_newick.py --affiliation_table abundance_and_affiliation_16S23S_multiDB.tsv --samples 32 -o abundance_and_multiaffi_16S23S_multiDB -v
python scripts/affiliation_to_newick.py --affiliation_table abundance_and_multiaffi_16Sfull.tsv --samples 32 -o abundance_and_multiaffi_16Sfull -v
python scripts/affiliation_to_newick.py --affiliation_table abundance_and_multiaffi_16S23S_custom.tsv --samples 32 -o abundance_and_multiaffi_16S23S_custom -v
python scripts/affiliation_to_newick.py --affiliation_table abundance_and_multiaffi_16Sv3v4.tsv --samples 32 -o abundance_and_multiaffi_16Sv3v4 -v
Slurm pipeline
7.For 16Sfull analysis, all the frogs steps can be launch through this slurm pipeline: frogs_pipeline
It uses slurm dependencies to launch each step successively.