VK6K2ZKRI4XTJKHSJKHC3F4F447JCPEVHX4TQBJJPVUIB3WURIOAC
ZW5JL2RSM3UIBU4DDWMALMRDXGDWGOBW4YFUISRJY3HDEXVXYB7AC
RWZ457JG7GN4JJKPRNLB22OHOGYSKR756OQUWLJDAEXRCHFBK42AC
FXA3ZBV64FML7W47IPHTAJFJHN3J3XHVHFVNYED47XFSBIGMBKRQC
2HYO3KBJWTOQFX2B3H4NHADODGNYOIDVINAUHC7JBTEKZMGE2J3AC
KJH535WVR3KSN625AUFRI3GXBUADDP7E5NF6IMBX743E44PAVBLQC
MN5IIXFRUQQTOCWN66V6ZWIPUOUOM62EWE4BAKKYCTK6FOWR3ZDAC
BMKGT3OO2GYX72RBA33J4KG2TRWKTLM2DAV4LITWZB6T3KFGIY3AC
flow
CLOSED: [2022-11-19 Sat 21:34]
*** DONE Haplotype caller
CLOSED: [2022-10-09 Sun 22:40]
*** DONE Filter variants
CLOSED: [2022-10-09 Sun 22:40]
*** DONE Filter common snp not clinvar path
CLOSED: [2022-11-07 Mon 23:00]
Voir [[*common dbSNP not clinvar patho][common dbSNP not clinvar patho]]
*** DONE Filter variant only in consensual sequence
CLOSED: [2022-11-08 Tue 22:23]
*** DONE Filter technical variants
CLOSED: [2022-11-19 Sat 21:34]
*** DONE Utilise AVX pour accélerer l'exécution
CLOSED: [2023-04-29 Sat 15:46]
Sans cela, on a l'avertissement
#+begin_quote
17:28:00.720 INFO PairHMM - OpenMP multi-threaded AVX-accelerated native PairHMM implementation is not supported
17:28:00.721 INFO NativeLibraryLoader - Loading libgkl_utils.so from jar:file:/nix/store/cy9ckxqwrkifx7wf02hm4ww1p6lnbxg9-gatk-4.2.4.1/bin/gatk-package-4.2.4.1-local.jar!/com/intel/gkl/native/libgkl_utils.so
17:28:00.733 WARN NativeLibraryLoader - Unable to load libgkl_utils.so from native/libgkl_utils.so (/Work/Users/apraga/bisonex/out/NA12878_NIST7035/preprocessing/applybqsr/libgkl_utils821485189051585397.so: libgomp.so.1: cannot open shared object file: No such file or directory)
17:28:00.733 WARN IntelPairHmm - Intel GKL Utils not loaded
17:28:00.733 WARN PairHMM - ***WARNING: Machine does not have the AVX instruction set support needed for the accelerated AVX PairHmm. Falling back to the MUCH slower LOGLESS_CACHING implementation!
17:28:00.763 INFO ProgressMeter - Starting traversal
#+end_quote
libgomp.so est fourni par gcc donc il faut charger le module
module load gcc@11.3.0/gcc-12.1.0
** KILL Utiliser subworkflow
CLOSED: [2023-04-02 Sun 18:08]
Notre version permet d'être plus souple
*** KILL Alignement
CLOSED: [2023-04-02 Sun 18:08] SCHEDULED: <2023-04-05 Wed>
*** KILL Vep
CLOSED: [2023-04-02 Sun 18:08] SCHEDULED: <2023-04-05 Wed>
vcf_annotate_ensemblvep
** TODO Annotation avec nextflow :annotation:
*** KILL VEP : --gene-phenotype ?
CLOSED: [2023-04-18 mar. 18:32]
Vu avec alexis : bases de données non à jour
https://www.ensembl.org/info/genome/variation/phenotype/sources_phenotype_documentation.html
*** DONE plugin VEP
CLOSED: [2023-04-18 mar. 18:32]
Cloner dépôt git avec plugin
Puis utiliser --dir_plugins
*** HOLD Utiliser code d’Alexis
*** TODO Nouvelle version avec VEP
Example avec --custom
https://www.ensembl.org/info/docs/tools/vep/script/vep_custom.html
**** TODO Ajout spliceAI
SCHEDULED: <2023-04-30 Sun>
plugin VEP
***** DONE Télécharger les données
CLOSED: [2023-05-11 Thu 19:01]
Difficile d'automatiser, le lien est temporaire...
***** DONE PLugin
CLOSED: [2023-05-11 Thu 20:16]
***** DONE Séparer score en plusieurs colonnes
CLOSED: [2023-05-11 Thu 20:16]
Test avec ce fichier pour avoir une ligne avec annotation et une ligne sans
#CHROM POS ID REF ALT
1 9091 . A C
1 69091 . A C
et
#+begin_src sh
rm -f postvep.tsv* && vep -i testspliceai.vcf.gz -o postvep.tsv --tab --dir 109 --merged --pick --use_given_ref --offline --plugin SpliceAI,snv=spliceai_scores.raw.snv.hg38.vcf.gz,indel=spliceai_scores.raw.indel.hg38.vcf.gz
#+end_src
#+begin_src
$ bgzip postvep.tsv
$ python spliceai.py
$ cat postvep2.tsv
,variation,Location,Allele,Gene,Feature,Feature_type,Consequence,cDNA_position,CDS_position,Protein_position,Amino_acids,Codons,Existing_variation,IMPACT,DISTANCE,STRAND,FLAGS,REFSEQ_MATCH,SOURCE,REFSEQ_OFFSET,SpliceAI_AG,SpliceAI_AL,SpliceAI_DG,SpliceAI_DL
0,1_9091_A/C,1:9091,C,ENSG00000290825,ENST00000456328,Transcript,upstream_gene_variant,-,-,-,-,-,-,MODIFIER,2778,1,-,-,Ensembl,-,,,,
1,1_69091_A/C,1:69091,C,ENSG00000186092,ENST00000641515,Transcript,missense_variant,124,64,22,M/L,Atg/Ctg,-,MODERATE,-,1,-,-,Ensembl,-,0.01,0.00,0.00,0.01
#+end_src
Test
cp work/bf/437ae511958509e43072f032f4d495/small.tab.gz tests/vep-spip.tab.gz
cp work/d5/3b1244b5ae83d54409ee0d456e8c55/small_cadd.tab.gz tests/vep-cadd-splice.tab.gz
**** TODO Ajout LOEUF et pli
plugin VEP
**** TODO NMD
**** KILL Ajout LOEUF
CLOSED: [2023-04-19 mer. 16:32]
plugin VEP
**** DONE Spip
CLOSED: [2023-05-01 Mon 23:07] SCHEDULED: <2023-04-30 Sun>
BED ne semble pas bien marcher (il faut définir une zone)
VCF : trop d’information
Attention, plusieurs transcripts mais résultats identiques. On supprimer les doublons
***** DONE interpretation + score + intervalle de confiance séparé
CLOSED: [2023-05-01 Mon 23:07] SCHEDULED: <2023-04-30 Sun>
Tests :
dans tests/
vep -i 63004925-small.vcf -o postvep.vcf --vcf --fasta genomeRef.fna --dir 109 --merged --pick --offline --custom ../script/spip_annotation.vcf.gz,SPIP,vcf,exact,0,spipInterp,spipScore,spipConfidence
***** DONE Score
CLOSED: [2023-04-22 Sat 15:30]
**** DONE CADD: remplacer par plugin VEP
CLOSED: [2023-05-07 Sun 14:45] SCHEDULED: <2023-05-07 Sun>
***** Test
#+begin_src
vep -i test.vcf -o lol.vcf --offline --dir /Work/Projects/bisonex/data/vep/GRCh38/ --merged --vcf --fasta /Work/Projects/bisonex/data/genome/GRCh38.p13/genomeRef.fna --plugin CADD,/Work/Users/apraga/bisonex/work/13/9287a7fef17ab9365f5696f20710cd/gnomad.genomes.r3.0.snv.tsv.gz,/Work/Users/apraga/bisonex/work/13/9287a7fef17ab9365f5696f20710cd/gnomad.genomes.r3.0.indel.tsv.gz --dir_plugins ../VEP_plugins/ -v
#+end_src
Test
#+begin_src sh
vep --id "1 230710048 230710048 A/G 1" --offline --dir /Work/Projects/bisonex/data/vep/GRCh38/ --merged --vcf --fasta /Work/Projects/bisonex/data/genome/GRCh38.p13/genomeRef.fna --plugin CADD,/Work/Users/apraga/bisonex/work/13/9287a7fef17ab9365f5696f20710cd/gnomad.genomes.r3.0.snv.tsv.gz,/Work/Users/apraga/bisonex/work/13/9287a7fef17ab9365f5696f20710cd/gnomad.genomes.r3.0.indel.tsv.gz --hgvsg --plugin pLI --plugin LOEUF -o lol
#+end_src
CSQ=G|missense_variant|MODERATE|AGT|ENSG00000135744|Transcript|ENST00000366667|protein_coding|2/5||||843|776|259|M/T|aTg/aCg|||-1||HGNC|HGNC:333||Ensembl||A|A||1:g.230710048A>G|0.347|-0.277922|
Correspond bien à https://www.ensembl.org/Homo_sapiens/Tools/VEP/Results?tl=I7ZsIbrj14P6lD43-9115494
***** DONE Utiliser whole genome
CLOSED: [2023-04-29 Sat 15:46]
***** KILL Renommer les chromosome avant ...
CLOSED: [2023-05-01 Mon 09:14] SCHEDULED: <2023-04-30 Sun>
Trop long !
- Téléchargement de CADD: 4h20
- renommer les chromosome pour SNV : 6h20
- tabix sur les SNV : job tué au bout de 21h....
***** DONE annoter séparément et fusionner les tableaux
CLOSED: [2023-05-07 Sun 14:45] SCHEDULED: <2023-05-01 Mon>
NB: on pourrait filtrer CADD avec tabix pour se restreindre à nos variants
**** DONE clinvar
CLOSED: [2023-04-22 Sat 15:31]
**** KILL Vérifier résultats HGVS avec mutalyzer
CLOSED: [2023-05-01 Mon 09:26]
**** TODO Parallélisation
***** HOLD par chromosome avec workflow VEP
https://github.com/Ensembl/ensembl-vep/blob/release/109/nextflow/workflows/run_vep.nf
***** HOLD Avec option --fork
**** DONE Utiliser la version de nf-core de VEP
CLOSED: [2023-05-13 Sat 18:27] SCHEDULED: <2023-05-07 Sun>
**** DONE OMIM
CLOSED: [2023-05-08 Mon 15:02] SCHEDULED: <2023-05-01 Mon>
**** TODO Grantham
SCHEDULED: <2023-05-01 Mon>
**** TODO ACMG incidental
SCHEDULED: <2023-05-01 Mon>
**** TODO Gnomad ?
SCHEDULED: <2023-05-01 Mon>
**** DONE Filtrer après VEP avec filter_vep
CLOSED: [2023-04-29 Sat 15:47]
nNon testé
*** HOLD Ancienne version
**** TODO HGVS
**** TODO Filtrer après VEP
**** TODO OMIM
**** TODO clinvar
**** TODO ACMG incidental
**** TODO Grantham
**** KILL LRG
CLOSED: [2023-04-18 mar. 17:22] SCHEDULED: <2023-04-18 Tue>
Vu avec alexis, n’est plus à jour
**** TODO Gnomad
** DONE Porter exactement la version d'Alexis sur Helios
CLOSED: [2023-01-14 Sat 17:56]
Branche "prod"
** STRT Tester version d'alexis avec Nix
*** DONE Ajouter clinvar
CLOSED: [2022-11-13 Sun 19:37]
*** DONE Alignement
CLOSED: [2022-11-13 Sun 12:52]
*** DONE Haplotype caller
CLOSED: [2022-11-13 Sun 13:00]
*** TODO Filter
- [X] depth
- [ ] comon snp not path
Problème avec liste des ID
**** TODO variant annotation
Besoin de vep
*** TODO Variant calling
* Amélioration :amelioration:
* Documentation :doc:
** DONE Procédure d'installat
flow
CLOSED: [2022-11-19 Sat 21:34]
*** DONE Haplotype caller
CLOSED: [2022-10-09 Sun 22:40]
*** DONE Filter variants
CLOSED: [2022-10-09 Sun 22:40]
*** DONE Filter common snp not clinvar path
CLOSED: [2022-11-07 Mon 23:00]
Voir [[*common dbSNP not clinvar patho][common dbSNP not clinvar patho]]
*** DONE Filter variant only in consensual sequence
CLOSED: [2022-11-08 Tue 22:23]
*** DONE Filter technical variants
CLOSED: [2022-11-19 Sat 21:34]
*** DONE Utilise AVX pour accélerer l'exécution
CLOSED: [2023-04-29 Sat 15:46]
Sans cela, on a l'avertissement
#+begin_quote
17:28:00.720 INFO PairHMM - OpenMP multi-threaded AVX-accelerated native PairHMM implementation is not supported
17:28:00.721 INFO NativeLibraryLoader - Loading libgkl_utils.so from jar:file:/nix/store/cy9ckxqwrkifx7wf02hm4ww1p6lnbxg9-gatk-4.2.4.1/bin/gatk-package-4.2.4.1-local.jar!/com/intel/gkl/native/libgkl_utils.so
17:28:00.733 WARN NativeLibraryLoader - Unable to load libgkl_utils.so from native/libgkl_utils.so (/Work/Users/apraga/bisonex/out/NA12878_NIST7035/preprocessing/applybqsr/libgkl_utils821485189051585397.so: libgomp.so.1: cannot open shared object file: No such file or directory)
17:28:00.733 WARN IntelPairHmm - Intel GKL Utils not loaded
17:28:00.733 WARN PairHMM - ***WARNING: Machine does not have the AVX instruction set support needed for the accelerated AVX PairHmm. Falling back to the MUCH slower LOGLESS_CACHING implementation!
17:28:00.763 INFO ProgressMeter - Starting traversal
#+end_quote
libgomp.so est fourni par gcc donc il faut charger le module
module load gcc@11.3.0/gcc-12.1.0
** KILL Utiliser subworkflow
CLOSED: [2023-04-02 Sun 18:08]
Notre version permet d'être plus souple
*** KILL Alignement
CLOSED: [2023-04-02 Sun 18:08] SCHEDULED: <2023-04-05 Wed>
*** KILL Vep
CLOSED: [2023-04-02 Sun 18:08] SCHEDULED: <2023-04-05 Wed>
vcf_annotate_ensemblvep
** TODO Annotation avec nextflow :annotation:
*** KILL VEP : --gene-phenotype ?
CLOSED: [2023-04-18 mar. 18:32]
Vu avec alexis : bases de données non à jour
https://www.ensembl.org/info/genome/variation/phenotype/sources_phenotype_documentation.html
*** DONE plugin VEP
CLOSED: [2023-04-18 mar. 18:32]
Cloner dépôt git avec plugin
Puis utiliser --dir_plugins
*** HOLD Utiliser code d’Alexis
*** TODO Nouvelle version avec VEP
Example avec --custom
https://www.ensembl.org/info/docs/tools/vep/script/vep_custom.html
**** DONE Ajout spliceAI
CLOSED: [2023-05-18 Thu 11:02] SCHEDULED: <2023-04-30 Sun>
plugin VEP
***** DONE Télécharger les données
CLOSED: [2023-05-11 Thu 19:01]
Difficile d'automatiser, le lien est temporaire...
***** DONE PLugin
CLOSED: [2023-05-11 Thu 20:16]
***** DONE Séparer score en plusieurs colonnes
CLOSED: [2023-05-11 Thu 20:16]
Test avec ce fichier pour avoir une ligne avec annotation et une ligne sans
#CHROM POS ID REF ALT
1 9091 . A C
1 69091 . A C
et
#+begin_src sh
rm -f postvep.tsv* && vep -i testspliceai.vcf.gz -o postvep.tsv --tab --dir 109 --merged --pick --use_given_ref --offline --plugin SpliceAI,snv=spliceai_scores.raw.snv.hg38.vcf.gz,indel=spliceai_scores.raw.indel.hg38.vcf.gz
#+end_src
#+begin_src
$ bgzip postvep.tsv
$ python spliceai.py
$ cat postvep2.tsv
,variation,Location,Allele,Gene,Feature,Feature_type,Consequence,cDNA_position,CDS_position,Protein_position,Amino_acids,Codons,Existing_variation,IMPACT,DISTANCE,STRAND,FLAGS,REFSEQ_MATCH,SOURCE,REFSEQ_OFFSET,SpliceAI_AG,SpliceAI_AL,SpliceAI_DG,SpliceAI_DL
0,1_9091_A/C,1:9091,C,ENSG00000290825,ENST00000456328,Transcript,upstream_gene_variant,-,-,-,-,-,-,MODIFIER,2778,1,-,-,Ensembl,-,,,,
1,1_69091_A/C,1:69091,C,ENSG00000186092,ENST00000641515,Transcript,missense_variant,124,64,22,M/L,Atg/Ctg,-,MODERATE,-,1,-,-,Ensembl,-,0.01,0.00,0.00,0.01
#+end_src
Test
cp work/bf/437ae511958509e43072f032f4d495/small.tab.gz tests/vep-spip.tab.gz
cp work/d5/3b1244b5ae83d54409ee0d456e8c55/small_cadd.tab.gz tests/vep-cadd-splice.tab.gz
**** TODO Ajout LOEUF et pli
plugin VEP
**** TODO NMD
**** KILL Ajout LOEUF
CLOSED: [2023-04-19 mer. 16:32]
plugin VEP
**** DONE Spip
CLOSED: [2023-05-01 Mon 23:07] SCHEDULED: <2023-04-30 Sun>
BED ne semble pas bien marcher (il faut définir une zone)
VCF : trop d’information
Attention, plusieurs transcripts mais résultats identiques. On supprimer les doublons
***** DONE interpretation + score + intervalle de confiance séparé
CLOSED: [2023-05-01 Mon 23:07] SCHEDULED: <2023-04-30 Sun>
Tests :
dans tests/
vep -i 63004925-small.vcf -o postvep.vcf --vcf --fasta genomeRef.fna --dir 109 --merged --pick --offline --custom ../script/spip_annotation.vcf.gz,SPIP,vcf,exact,0,spipInterp,spipScore,spipConfidence
***** DONE Score
CLOSED: [2023-04-22 Sat 15:30]
**** DONE CADD: remplacer par plugin VEP
CLOSED: [2023-05-07 Sun 14:45] SCHEDULED: <2023-05-07 Sun>
***** Test
#+begin_src
vep -i test.vcf -o lol.vcf --offline --dir /Work/Projects/bisonex/data/vep/GRCh38/ --merged --vcf --fasta /Work/Projects/bisonex/data/genome/GRCh38.p13/genomeRef.fna --plugin CADD,/Work/Users/apraga/bisonex/work/13/9287a7fef17ab9365f5696f20710cd/gnomad.genomes.r3.0.snv.tsv.gz,/Work/Users/apraga/bisonex/work/13/9287a7fef17ab9365f5696f20710cd/gnomad.genomes.r3.0.indel.tsv.gz --dir_plugins ../VEP_plugins/ -v
#+end_src
Test
#+begin_src sh
vep --id "1 230710048 230710048 A/G 1" --offline --dir /Work/Projects/bisonex/data/vep/GRCh38/ --merged --vcf --fasta /Work/Projects/bisonex/data/genome/GRCh38.p13/genomeRef.fna --plugin CADD,/Work/Users/apraga/bisonex/work/13/9287a7fef17ab9365f5696f20710cd/gnomad.genomes.r3.0.snv.tsv.gz,/Work/Users/apraga/bisonex/work/13/9287a7fef17ab9365f5696f20710cd/gnomad.genomes.r3.0.indel.tsv.gz --hgvsg --plugin pLI --plugin LOEUF -o lol
#+end_src
CSQ=G|missense_variant|MODERATE|AGT|ENSG00000135744|Transcript|ENST00000366667|protein_coding|2/5||||843|776|259|M/T|aTg/aCg|||-1||HGNC|HGNC:333||Ensembl||A|A||1:g.230710048A>G|0.347|-0.277922|
Correspond bien à https://www.ensembl.org/Homo_sapiens/Tools/VEP/Results?tl=I7ZsIbrj14P6lD43-9115494
***** DONE Utiliser whole genome
CLOSED: [2023-04-29 Sat 15:46]
***** KILL Renommer les chromosome avant ...
CLOSED: [2023-05-01 Mon 09:14] SCHEDULED: <2023-04-30 Sun>
Trop long !
- Téléchargement de CADD: 4h20
- renommer les chromosome pour SNV : 6h20
- tabix sur les SNV : job tué au bout de 21h....
***** DONE annoter séparément et fusionner les tableaux
CLOSED: [2023-05-07 Sun 14:45] SCHEDULED: <2023-05-01 Mon>
NB: on pourrait filtrer CADD avec tabix pour se restreindre à nos variants
**** DONE clinvar
CLOSED: [2023-04-22 Sat 15:31]
**** KILL Vérifier résultats HGVS avec mutalyzer
CLOSED: [2023-05-01 Mon 09:26]
**** TODO Parallélisation
***** HOLD par chromosome avec workflow VEP
https://github.com/Ensembl/ensembl-vep/blob/release/109/nextflow/workflows/run_vep.nf
***** HOLD Avec option --fork
**** DONE Utiliser la version de nf-core de VEP
CLOSED: [2023-05-13 Sat 18:27] SCHEDULED: <2023-05-07 Sun>
**** DONE OMIM
CLOSED: [2023-05-08 Mon 15:02] SCHEDULED: <2023-05-01 Mon>
**** TODO Grantham
SCHEDULED: <2023-05-01 Mon>
**** TODO ACMG incidental
SCHEDULED: <2023-05-01 Mon>
**** TODO Gnomad ?
SCHEDULED: <2023-05-01 Mon>
**** DONE Filtrer après VEP avec filter_vep
CLOSED: [2023-04-29 Sat 15:47]
nNon testé
*** TODO Comparer les annotations sur 63003856
SCHEDULED: <2023-05-18 Thu>
**** Relancer le nouveau pipeline
*** HOLD Ancienne version
**** TODO HGVS
**** TODO Filtrer après VEP
**** TODO OMIM
**** TODO clinvar
**** TODO ACMG incidental
**** TODO Grantham
**** KILL LRG
CLOSED: [2023-04-18 mar. 17:22] SCHEDULED: <2023-04-18 Tue>
Vu avec alexis, n’est plus à jour
**** TODO Gnomad
** DONE Porter exactement la version d'Alexis sur Helios
CLOSED: [2023-01-14 Sat 17:56]
Branche "prod"
** STRT Tester version d'alexis avec Nix
*** DONE Ajouter clinvar
CLOSED: [2022-11-13 Sun 19:37]
*** DONE Alignement
CLOSED: [2022-11-13 Sun 12:52]
*** DONE Haplotype caller
CLOSED: [2022-11-13 Sun 13:00]
*** TODO Filter
- [X] depth
- [ ] comon snp not path
Problème avec liste des ID
**** TODO variant annotation
Besoin de vep
*** TODO Variant calling
* Amélioration :amelioration:
* Documentation :doc:
** DONE Procédure d'installat
reSAMs LENIENT_LOW_MQ_ALIGNMENT=true LENIENT_DUP=true tmp_63003856_S135/63003856_S135.bam /Work/Groups/bisonex/ref/tmp_63003856_S135/63003856_S135.bam O=compare-bam.tsv
picard CompareSAMs -LENIENT_LOW_MQ_ALIGNMENT true -LENIENT_DUP true tmp_63003856_S135/63003856_S135.bam /Work/Groups/bisonex/ref/tmp_63003856_S135/63003856_S135.bam -O compare-bam.tsv
VN Program Record attribute differs.
File 1: 1.13
File 2: 1.10
SAM files differ.
[Tue Jan 24 23:12:50 CET 2023] picard.sam.CompareSAMs done. Elapsed time: 7.32 minutes.
***** DONE Relancer avec la même version de samtools
CLOSED: [2023-01-25 Wed 21:58]
Pas d'impact
***** TODO Comparer tsv de sortie
***** TODO Regarder où sont les variants différents
** TODO GIAB Validation :giab:
https://github.com/ga4gh/benchmarking-tools
Prérequis :
- [[*hap.py][hap.py]]
- [[*NA12878][NA12878]]
*** DONE GIAB : exome :giab:
CLOSED: [2023-04-16 Sun 16:33]
**** Notes
https://github.com/genome-in-a-bottle/giab_FAQ
**** Résultats résumés :resultats:
***** DONE HG001 :
CLOSED: [2023-04-06 Thu 21:41] SCHEDULED: <2023-04-02 Sun>
| Données | Algorithm | Type | Recall | Precision |
|---------+-----------+---------+--------+-----------|
| Bisonex | Happy | SNP | 0.8552 | 0.9708 |
| Bisonex | vcfeval | SNP | 0.8547 | 0.9727 |
| Bisonex | Happy | INDEL | 0.7105 | 0.6929 |
| Bisonex | vcfeval | Non-SNP | 0.7139 | 0.7136 |
|---------+-----------+---------+--------+-----------|
| GIAB | happy | INDEL | 0.7551 | 0.7415 |
| GIAB | vcfeval | INDEL | 0.7598 | 0.7445 |
| GIAB | happy | SNP | 0.8937 | 0.9621 |
| giab | vcfeval | SNP | 0.8937 | 0.9621 |
***** DONE HG002, HG003, HG004
CLOSED: [2023-04-14 Fri 11:36] SCHEDULED: <2023-04-14 Fri>
Capture Agilent
| Patient | Algorithm | Type | Recall | Precision |
| HG002 | happy | INDEL | 0.851495 | 0.923616 |
| HG002 | happy | SNP | 0.905926 | 0.992158 |
| HG002 | vcfeval | indel | 0.8523 | 0.9212 |
| HG002 | vcfeval | snp | 0.9054 | 0.9934 |
| HG003 | vcfeval | indel | 0.8363 | 0.9115 |
| HG003 | vcfeval | snp | 0.9069 | 0.9928 |
| HG003 | happy | INDEL | 0.838521 | 0.917296 |
| HG003 | happy | SNP | 0.907466 | 0.991204 |
| HG004 | happy | INDEL | 0.856835 | 0.925086 |
| HG004 | happy | SNP | 0.905067 | 0.992704 |
| HG004 | vcfeval | indel | 0.8568 | 0.9240 |
| HG004 | vcfeval | snp | 0.9048 | 0.9938 |
**** DONE télécharger données avec Nextflow
CLOSED: [2023-04-16 Sun 16:32]
***** DONE Renommer les chromosomes
CLOSED: [2023-02-17 Fri 19:30]
****** DONE Genome de reference NCBI
CLOSED: [2023-02-25 Sat 19:46]
****** DONE Bed avec les exons
CLOSED: [2023-03-29 Wed 23:04]
****** DONE hg19
CLOSED: [2023-02-26 Sun 22:37]
****** DONE hg38
CLOSED: [2023-03-29 Wed 23:04]
- [X] Télécharger hg19 : ok
- [X] convertir bed en interval list
picard BedToIntervalList -I exons_illumina.bed -O exons_illumina.list -SD ../../genome/GRCh19/genomeRef.dict
- [X] puis en hg38
picard LiftOverIntervalList -I exons_illumina.list -O exons_illumina_hg38.list --CHAIN hg19ToHg38.over.chain -SD ../../genome/GRCh38.p13/genomeRef.dict
- [X] puis en bed
***** KILL VCF de référence
CLOSED: [2023-04-16 Sun 16:32]
****** TODO NA12878 (HG001)
******* DONE Fastq HiSeq
CLOSED: [2023-02-25 Sat 19:46]
On prend le Hiseq, qui est probablement ce qu'utilise Centogène :
https://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/
On utilisé les données "trimmés" (https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-1069-7), i.e qui ont enlevé les fragments plus petits que la taille d'un read.
Informations:
- https://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/Garvan_NA12878_HG001_HiSeq_Exome.README
- Sequencer: HiSeq2500
- kit: Nextera Rapid Capture Exome and Expanded Exome
Il y a 2 samples (NIST7035 et NIST7086), chacun sur 2 lanes -> à concaténer
NB : liste techno illumina https://www.illumina.com/systems/sequencing-platforms.html
Hiseq postérieur nextseq 550
******* TODO Fastq hiseq sans trimming
SCHEDULED: <2023-04-30 Sun>
******* DONE Capture : Exons (bed)
CLOSED: [2023-02-25 Sat 19:46]
https://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/nexterarapidcapture_expandedexome_targetedregions.bed.gz
******* DONE Bed, vcf
CLOSED: [2023-02-24 Fri 23:45]
****** DONE Ashkenazy trio HG002, HG003, HGQ004
CLOSED: [2023-04-06 Thu 21:43] SCHEDULED: <2023-04-01 Sat>
****** KILL Chinese trio HG005, 6, 7
CLOSED: [2023-04-16 Sun 16:32]
***** KILL Fastq :fastq:
CLOSED: [2023-04-16 Sun 16:32]
****** DONE NA12878 (HG001)
CLOSED: [2023-02-25 Sat 19:46]
******* DONE Fastq HiSeq
CLOSED: [2023-02-25 Sat 19:46]
On prend le Hiseq, qui est probablement ce qu'utilise Centogène :
https://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/
On utilisé les données "trimmés" (https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-1069-7), i.e qui ont enlevé les fragments plus petits que la taille d'un read.
Informations:
- https://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/Garvan_NA12878_HG001_HiSeq_Exome.README
- Sequencer: HiSeq2500
- kit: Nextera Rapid Capture Exome and Expanded Exome
Il y a 2 samples (NIST7035 et NIST7086), chacun sur 2 lanes -> à concaténer
NB : liste techno illumina https://www.illumina.com/systems/sequencing-platforms.html
Hiseq postérieur nextseq 550
******* DONE Capture : Exons (bed)
CLOSED: [2023-02-25 Sat 19:46]
https://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/nexterarapidcapture_expandedexome_targetedregions.bed.gz
****** DONE Ashkenazy trio HG002, HG003, HG004
CLOSED: [2023-04-15 Sat 23:24] SCHEDULED: <2023-04-05 Wed>
******* DONE Capture
CLOSED: [2023-04-15 Sat 23:24]
https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/analysis/OsloUniversityHospital_Exome_GATK_jointVC_11242015/wex_Agilent_SureSelect_v05_b37.baits.slop50.merged.list
******* DONE Capture Agilent
CLOSED: [2023-04-15 Sat 23:24]
******* DONE Bam à partir des fastq
CLOSED: [2023-04-15 Sat 23:24]
Bam + index + checksum
https://raw.githubusercontent.com/genome-in-a-bottle/giab_data_indexes/master/AshkenazimTrio/alignment.index.AJtrio_OsloUniversityHospital_IlluminaExome_bwamem_GRCh37_11252015
****** KILL Chinese trio
CLOSED: [2023-04-16 Sun 16:32]
Whole exome pour HG005 seulement
******* KILL HG005
CLOSED: [2023-04-16 Sun 16:32]
https://raw.githubusercontent.com/genome-in-a-bottle/giab_data_indexes/master/ChineseTrio/alignment.index.Chinesetrio_HG005_OsloUniversityHospital_IlluminaExome_bwamem_GRCh37_11252015
**** DONE NA12878 / HG001 :na12878:
CLOSED: [2023-04-15 Sat 23:53]
***** DONE Discussion alexis : Mail
CLOSED: [2023-03-29 Wed 22:40]
Avec le patient NA12878 et comparaison avec hap.py du VCF de Genome In A Bottle ("gold" standard), on avait pour rappel
- sensibilité (=recall) 71% pour indel, 85% SNP
- précision (= VPP) 69 et 97% respectivement
| Type | TRUTH | TP | FN | QUERY | FP | UNK | FP.gt | FP.al | Recall | Precision |
| INDEL | 4871 | 3461 | 1410 | 7048 | 1554 | 1987 | 193 | 346 | 0.710532 | 0.692946 |
| SNP | 46032 | 39369 | 6663 | 44600 | 1186 | 4041 | 304 | 30 | 0.855253 | 0.970759 |
Les statistiques sur les génomes sont bien meilleurs (cf precisionFDA challenge).
Pour les exome, un article [1] a fait a des meilleures stats sur ce patient avec BWA et GATK mais ils ont moins de variant (on a presque un facteur 2 !).
Je soupçonne qu'on ne travaille pas sur les mêmes zones de capture (pas réussi à récupérer leur .bed)
| Exome | Type | TP | FP | FN | Sensitivity | Precision | F-Score | FDR |
| 1 | SNV | 23689 | 1397 | 613 | 0.975 | 0.944 | 0.959 | 0.057 |
| 2 | SNV | 23946 |
reSAMs LENIENT_LOW_MQ_ALIGNMENT=true LENIENT_DUP=true tmp_63003856_S135/63003856_S135.bam /Work/Groups/bisonex/ref/tmp_63003856_S135/63003856_S135.bam O=compare-bam.tsv
picard CompareSAMs -LENIENT_LOW_MQ_ALIGNMENT true -LENIENT_DUP true tmp_63003856_S135/63003856_S135.bam /Work/Groups/bisonex/ref/tmp_63003856_S135/63003856_S135.bam -O compare-bam.tsv
VN Program Record attribute differs.
File 1: 1.13
File 2: 1.10
SAM files differ.
[Tue Jan 24 23:12:50 CET 2023] picard.sam.CompareSAMs done. Elapsed time: 7.32 minutes.
***** DONE Relancer avec la même version de samtools
CLOSED: [2023-01-25 Wed 21:58]
Pas d'impact
***** TODO Comparer tsv de sortie
***** TODO Regarder où sont les variants différents
** TODO GIAB Validation :giab:
https://github.com/ga4gh/benchmarking-tools
Prérequis :
- [[*hap.py][hap.py]]
- [[*NA12878][NA12878]]
*** DONE GIAB : exome :giab:
CLOSED: [2023-04-16 Sun 16:33]
**** Notes
https://github.com/genome-in-a-bottle/giab_FAQ
**** Résultats résumés :resultats:
***** DONE HG001 :
CLOSED: [2023-04-06 Thu 21:41] SCHEDULED: <2023-04-02 Sun>
| Données | Algorithm | Type | Recall | Precision |
|---------+-----------+---------+--------+-----------|
| Bisonex | Happy | SNP | 0.8552 | 0.9708 |
| Bisonex | vcfeval | SNP | 0.8547 | 0.9727 |
| Bisonex | Happy | INDEL | 0.7105 | 0.6929 |
| Bisonex | vcfeval | Non-SNP | 0.7139 | 0.7136 |
|---------+-----------+---------+--------+-----------|
| GIAB | happy | INDEL | 0.7551 | 0.7415 |
| GIAB | vcfeval | INDEL | 0.7598 | 0.7445 |
| GIAB | happy | SNP | 0.8937 | 0.9621 |
| giab | vcfeval | SNP | 0.8937 | 0.9621 |
***** DONE HG002, HG003, HG004
CLOSED: [2023-04-14 Fri 11:36] SCHEDULED: <2023-04-14 Fri>
Capture Agilent
| Patient | Algorithm | Type | Recall | Precision |
| HG002 | happy | INDEL | 0.851495 | 0.923616 |
| HG002 | happy | SNP | 0.905926 | 0.992158 |
| HG002 | vcfeval | indel | 0.8523 | 0.9212 |
| HG002 | vcfeval | snp | 0.9054 | 0.9934 |
| HG003 | vcfeval | indel | 0.8363 | 0.9115 |
| HG003 | vcfeval | snp | 0.9069 | 0.9928 |
| HG003 | happy | INDEL | 0.838521 | 0.917296 |
| HG003 | happy | SNP | 0.907466 | 0.991204 |
| HG004 | happy | INDEL | 0.856835 | 0.925086 |
| HG004 | happy | SNP | 0.905067 | 0.992704 |
| HG004 | vcfeval | indel | 0.8568 | 0.9240 |
| HG004 | vcfeval | snp | 0.9048 | 0.9938 |
**** DONE télécharger données avec Nextflow
CLOSED: [2023-04-16 Sun 16:32]
***** DONE Renommer les chromosomes
CLOSED: [2023-02-17 Fri 19:30]
****** DONE Genome de reference NCBI
CLOSED: [2023-02-25 Sat 19:46]
****** DONE Bed avec les exons
CLOSED: [2023-03-29 Wed 23:04]
****** DONE hg19
CLOSED: [2023-02-26 Sun 22:37]
****** DONE hg38
CLOSED: [2023-03-29 Wed 23:04]
- [X] Télécharger hg19 : ok
- [X] convertir bed en interval list
picard BedToIntervalList -I exons_illumina.bed -O exons_illumina.list -SD ../../genome/GRCh19/genomeRef.dict
- [X] puis en hg38
picard LiftOverIntervalList -I exons_illumina.list -O exons_illumina_hg38.list --CHAIN hg19ToHg38.over.chain -SD ../../genome/GRCh38.p13/genomeRef.dict
- [X] puis en bed
***** KILL VCF de référence
CLOSED: [2023-04-16 Sun 16:32]
****** TODO NA12878 (HG001)
******* DONE Fastq HiSeq
CLOSED: [2023-02-25 Sat 19:46]
On prend le Hiseq, qui est probablement ce qu'utilise Centogène :
https://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/
On utilisé les données "trimmés" (https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-1069-7), i.e qui ont enlevé les fragments plus petits que la taille d'un read.
Informations:
- https://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/Garvan_NA12878_HG001_HiSeq_Exome.README
- Sequencer: HiSeq2500
- kit: Nextera Rapid Capture Exome and Expanded Exome
Il y a 2 samples (NIST7035 et NIST7086), chacun sur 2 lanes -> à concaténer
NB : liste techno illumina https://www.illumina.com/systems/sequencing-platforms.html
Hiseq postérieur nextseq 550
******* TODO Fastq hiseq sans trimming
SCHEDULED: <2023-05-25 Thu>
******* DONE Capture : Exons (bed)
CLOSED: [2023-02-25 Sat 19:46]
https://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/nexterarapidcapture_expandedexome_targetedregions.bed.gz
******* DONE Bed, vcf
CLOSED: [2023-02-24 Fri 23:45]
****** DONE Ashkenazy trio HG002, HG003, HGQ004
CLOSED: [2023-04-06 Thu 21:43] SCHEDULED: <2023-04-01 Sat>
****** KILL Chinese trio HG005, 6, 7
CLOSED: [2023-04-16 Sun 16:32]
***** KILL Fastq :fastq:
CLOSED: [2023-04-16 Sun 16:32]
****** DONE NA12878 (HG001)
CLOSED: [2023-02-25 Sat 19:46]
******* DONE Fastq HiSeq
CLOSED: [2023-02-25 Sat 19:46]
On prend le Hiseq, qui est probablement ce qu'utilise Centogène :
https://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/
On utilisé les données "trimmés" (https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-1069-7), i.e qui ont enlevé les fragments plus petits que la taille d'un read.
Informations:
- https://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/Garvan_NA12878_HG001_HiSeq_Exome.README
- Sequencer: HiSeq2500
- kit: Nextera Rapid Capture Exome and Expanded Exome
Il y a 2 samples (NIST7035 et NIST7086), chacun sur 2 lanes -> à concaténer
NB : liste techno illumina https://www.illumina.com/systems/sequencing-platforms.html
Hiseq postérieur nextseq 550
******* DONE Capture : Exons (bed)
CLOSED: [2023-02-25 Sat 19:46]
https://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/nexterarapidcapture_expandedexome_targetedregions.bed.gz
****** DONE Ashkenazy trio HG002, HG003, HG004
CLOSED: [2023-04-15 Sat 23:24] SCHEDULED: <2023-04-05 Wed>
******* DONE Capture
CLOSED: [2023-04-15 Sat 23:24]
https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/analysis/OsloUniversityHospital_Exome_GATK_jointVC_11242015/wex_Agilent_SureSelect_v05_b37.baits.slop50.merged.list
******* DONE Capture Agilent
CLOSED: [2023-04-15 Sat 23:24]
******* DONE Bam à partir des fastq
CLOSED: [2023-04-15 Sat 23:24]
Bam + index + checksum
https://raw.githubusercontent.com/genome-in-a-bottle/giab_data_indexes/master/AshkenazimTrio/alignment.index.AJtrio_OsloUniversityHospital_IlluminaExome_bwamem_GRCh37_11252015
****** KILL Chinese trio
CLOSED: [2023-04-16 Sun 16:32]
Whole exome pour HG005 seulement
******* KILL HG005
CLOSED: [2023-04-16 Sun 16:32]
https://raw.githubusercontent.com/genome-in-a-bottle/giab_data_indexes/master/ChineseTrio/alignment.index.Chinesetrio_HG005_OsloUniversityHospital_IlluminaExome_bwamem_GRCh37_11252015
**** DONE NA12878 / HG001 :na12878:
CLOSED: [2023-04-15 Sat 23:53]
***** DONE Discussion alexis : Mail
CLOSED: [2023-03-29 Wed 22:40]
Avec le patient NA12878 et comparaison avec hap.py du VCF de Genome In A Bottle ("gold" standard), on avait pour rappel
- sensibilité (=recall) 71% pour indel, 85% SNP
- précision (= VPP) 69 et 97% respectivement
| Type | TRUTH | TP | FN | QUERY | FP | UNK | FP.gt | FP.al | Recall | Precision |
| INDEL | 4871 | 3461 | 1410 | 7048 | 1554 | 1987 | 193 | 346 | 0.710532 | 0.692946 |
| SNP | 46032 | 39369 | 6663 | 44600 | 1186 | 4041 | 304 | 30 | 0.855253 | 0.970759 |
Les statistiques sur les génomes sont bien meilleurs (cf precisionFDA challenge).
Pour les exome, un article [1] a fait a des meilleures stats sur ce patient avec BWA et GATK mais ils ont moins de variant (on a presque un facteur 2 !).
Je soupçonne qu'on ne travaille pas sur les mêmes zones de capture (pas réussi à récupérer leur .bed)
| Exome | Type | TP | FP | FN | Sensitivity | Precision | F-Score | FDR |
| 1 | SNV | 23689 | 1397 | 613 | 0.975 | 0.944 | 0.959 | 0.057 |
| 2 | SNV | 23946 |
--test.compare=happy,vcfeval --test.query=giab --test.id=HG001
#+end_src
Notre version avec hap.py + vcfeval
#+begin_src sh
NXF_OPTS=-D"user.name=${USER}" nextflow run workflows/compareVCF.nf -profile standard,helios -resume --outdir=compareNA12878 --test.vcfeval --test.query="out/NA12878_NIST/variantCalling/haplotypecaller/NA12878_NIST.vcf.gz" --test.happy
#+end_src
On concatene les csv avec une colonne indicant le type
# awk '{if (NR==1) {print "Data,Algorithm" $0} else {print "bisonx,happy,"$0}}' compareNA12878/happy/NA12878.summary.csv
compareNA12878/happy/NA12878.summary.csv
| Type | Filter | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision | METRIC.Frac_NA | METRIC.F1_Score | TRUTH.TOTAL.TiTv_ratio | QUERY.TOTAL.TiTv_ratio | TRUTH.TOTAL.het_hom_ratio | QUERY.TOTAL.het_hom_ratio |
| INDEL | ALL | 4871 | 3461 | 1410 | 7048 | 1554 | 1987 | 193 | 346 | 0.710532 | 0.692946 | 0.281924 | 0.701629 | | | 1.6174985978687606 | 3.0674091441969518 |
| INDEL | PASS | 4871 | 3461 | 1410 | 7048 | 1554 | 1987 | 193 | 346 | 0.710532 | 0.692946 | 0.281924 | 0.701629 | | | 1.6174985978687606 | 3.0674091441969518 |
| SNP | ALL | 46032 | 39367 | 6665 | 44599 | 1186 | 4042 | 304 | 30 | 0.855209 | 0.970757 | 0.09063 | 0.909327 | 2.529551552318896 | 2.402150701647346 | 1.6206857273037931 | 1.6273423688862698 |
| SNP | PASS | 46032 | 39367 | 6665 | 44599 | 1186 | 4042 | 304 | 30 | 0.855209 | 0.970757 | 0.09063 | 0.909327 | 2.529551552318896 | 2.402150701647346 | 1.6206857273037931 | 1.6273423688862698 |
compareNA12878/vcfeval/NA12878.summary.txt
| Threshold | True-pos-baseline | True-pos-call | False-pos | False-neg | Precision | Sensitivity | F-measure |
|-----------+-------------------+---------------+-----------+-----------+-----------+-------------+-----------|
| 3.000 | 42789 | 42416 | 2598 | 8080 | 0.9423 | 0.8412 | 0.8889 |
| None | 42798 | 42425 | 2616 | 8071 | 0.9419 | 0.8413 | 0.8888 |
Indel avec le plus petit seuil : zcat NA12878.non_snp_roc.tsv.gz
Attention à inverser precision et recall !
zcat NA12878.non_snp_roc.tsv.gz | tail -n 1 | awk '{print $7 $6}'
0.71390.7136
SNP avec le plus petit seuil : zcat NA12878.non_snp_roc.tsv.gz
Attention à inverser precision et recall !
$ zcat NA12878.snp_roc.tsv.gz | tail -n 1 | awk '{print $7 $6}'
0.85470.9727
compareNA12878-giab/vcfeval/NA12878.summary.txt
| Threshold | True-pos-baseline | True-pos-call | False-pos | False-neg | Precision | Sensitivity | F-measure |
| 1.000 | 44812 | 44812 | 2878 | 6057 | 0.9397 | 0.8809 | 0.9093 |
| None | 44813 | 44813 | 2882 | 6056 | 0.9396 | 0.8809 | 0.9093 |
SNP:
$ zcat NA12878.snp_roc.tsv.gz | tail -n 1 | awk '{print $7 $6}'
0.89370.9621
indel
$ zcat NA12878.non_snp_roc.tsv.gz | tail -n 1 | awk '{print $7 $6}'
0.75980.7445
compareNA12878-giab/happy/NA12878.summary.csv
| Type | Filter | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision | METRIC.Frac_NA | METRIC.F1_Score | TRUTH.TOTAL.TiTv_ratio | QUERY.TOTAL.TiTv_ratio | TRUTH.TOTAL.het_hom_ratio | QUERY.TOTAL.het_hom_ratio |
|-------+--------+-------------+-----
-----+----------+-------------+----------+-----------+-------+-------+---------------+------------------+----------------+-----------------+------------------------+------------------------+---------------------------+---------------------------|
| INDEL | ALL | 4871 | 3678 | 1193 | 7036 | 1299 | 2011 | 208 | 217 | 0.755081 | 0.741493 | 0.285816 | 0.748225 | | | 1.6174985978687606 | 2.5240506329113925 |
| INDEL | PASS | 4871 | 3678 | 1193 | 7036 | 1299 | 2011 | 208 | 217 | 0.755081 | 0.741493 | 0.285816 | 0.748225 | | | 1.6174985978687606 | 2.5240506329113925 |
| SNP | ALL | 46032 | 41138 | 4894 | 47694 | 1622 | 4930 | 362 | 31 | 0.893683 | 0.962071 | 0.103367 | 0.926617 | 2.529551552318896 | 2.4124463519313304 | 1.6206857273037931 | 1.6888675840288743 |
| SNP | PASS | 46032 | 41138 | 4894 | 47694 | 1622 | 4930 | 362 | 31 | 0.893683 | 0.962071 | 0.103367 | 0.926617 | 2.529551552318896 | 2.4124463519313304 | 1.6206857273037931 | 1.688867584028874 |
***** TODO Résultats sans trimming
SCHEDULED: <2023-04-30 Sun>
**** DONE HG002 :hg002:
CLOSED: [2023-04-14 Fri 09:54] SCHEDULED: <2023-04-10 Mon>
#+begin_src
NXF_OPTS=-D"user.name=${USER}" nextflow run workflows/giabFastq.nf -profile standard,helios
NXF_OPTS=-D"user.name=${USER}" nextflow run main.nf -profile standard,helios -resume --input="/Work/Groups/bisonex/data/giab/GRCh38/HG002_{1,2}.fq.gz --test.id=HG002
Only the capture file differs. Results are better using the capture file given by Agilent, stored in data/
NXF_OPTS=-D"user.name=${USER}" nextflow run workflows/compareVCF.nf -profile standard,helios -resume --outdir=compareHG002 --test.id=HG002 --test.query=out/HG002_1/variantCalling/haplotypecaller/HG002_1.vcf.gz --test.compare=vcfeval,happy --test.capture=data/AgilentSureSelectv05_hg38.bed
#
#+end_src
***** DONE Mauvais résultats
CLOSED: [2023-04-14 Fri 09:42]
avec vcfeval
Threshold True-pos-baseline True-pos-call False-pos False-neg Precision Sensitivity F-measure
----------------------------------------------------------------------------------------------------
0.000 24585 24390 10060 39415 0.7080 0.3841 0.4980
None 24585 24390 10060 39415 0.7080 0.3841 0.4980
La sortie du variantCalling est celle d'happy ???
On relance...
***** DONE Vérifier vcf en hg38
CLOSED: [2023-04-12 Wed 10:33] SCHEDULED: <2023-04-12 Wed>
***** KILL Capture en hg19 ?
CLOSED: [2023-04-13 Thu 09:46] SCHEDULED: <2023-04-12 Wed>
***** KILL Vraiment fichier de capture ou zone d'intérêt ?
CLOSED: [2023-04-13 Thu 09:45] SCHEDULED: <2023-04-12 Wed>
"target region" +/- 50bp
[[https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/analysis/OsloUniversityHospital_Exome_GATK_jointVC_11242015/README.txt][README]]
list file describing the variant calling regions (target regions extended with 50 bp on each end)
***** DONE .bed fourni par AGilent: sensbilité très mauvaise
CLOSED: [2023-04-13 Thu 09:46] SCHEDULED: <2023-04-13 Thu>
Agilent SureSelect Human All Exon V5 kit
Disponible en hg38
Threshold True-pos-baseline True-pos-call False-pos False-neg Precision Sensitivity F-measure
----------------------------------------------------------------------------------------------------
0.000 19653 19501 6410 21657 0.7526 0.4757 0.5830
None 19653 19501 6410 21657 0.7526 0.4757 0.5830
***** DONE Trier par nom avec samtools sort : bons résultats
CLOSED: [2023-04-14 Fri 09:25] SCHEDULED: <2023-04-13 Thu>
Avec capture fourni par GIAB
vcf eval
Threshold True-pos-baseline True-pos-call False-pos False-neg Precision Sensitivity F-measure
------------------------------------------------------------------------------------
--test.compare=happy,vcfeval --test.query=giab --test.id=HG001
#+end_src
Notre version avec hap.py + vcfeval
#+begin_src sh
NXF_OPTS=-D"user.name=${USER}" nextflow run workflows/compareVCF.nf -profile standard,helios -resume --outdir=compareNA12878 --test.vcfeval --test.query="out/NA12878_NIST/variantCalling/haplotypecaller/NA12878_NIST.vcf.gz" --test.happy
#+end_src
On concatene les csv avec une colonne indicant le type
# awk '{if (NR==1) {print "Data,Algorithm" $0} else {print "bisonx,happy,"$0}}' compareNA12878/happy/NA12878.summary.csv
compareNA12878/happy/NA12878.summary.csv
| Type | Filter | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision | METRIC.Frac_NA | METRIC.F1_Score | TRUTH.TOTAL.TiTv_ratio | QUERY.TOTAL.TiTv_ratio | TRUTH.TOTAL.het_hom_ratio | QUERY.TOTAL.het_hom_ratio |
| INDEL | ALL | 4871 | 3461 | 1410 | 7048 | 1554 | 1987 | 193 | 346 | 0.710532 | 0.692946 | 0.281924 | 0.701629 | | | 1.6174985978687606 | 3.0674091441969518 |
| INDEL | PASS | 4871 | 3461 | 1410 | 7048 | 1554 | 1987 | 193 | 346 | 0.710532 | 0.692946 | 0.281924 | 0.701629 | | | 1.6174985978687606 | 3.0674091441969518 |
| SNP | ALL | 46032 | 39367 | 6665 | 44599 | 1186 | 4042 | 304 | 30 | 0.855209 | 0.970757 | 0.09063 | 0.909327 | 2.529551552318896 | 2.402150701647346 | 1.6206857273037931 | 1.6273423688862698 |
| SNP | PASS | 46032 | 39367 | 6665 | 44599 | 1186 | 4042 | 304 | 30 | 0.855209 | 0.970757 | 0.09063 | 0.909327 | 2.529551552318896 | 2.402150701647346 | 1.6206857273037931 | 1.6273423688862698 |
compareNA12878/vcfeval/NA12878.summary.txt
| Threshold | True-pos-baseline | True-pos-call | False-pos | False-neg | Precision | Sensitivity | F-measure |
|-----------+-------------------+---------------+-----------+-----------+-----------+-------------+-----------|
| 3.000 | 42789 | 42416 | 2598 | 8080 | 0.9423 | 0.8412 | 0.8889 |
| None | 42798 | 42425 | 2616 | 8071 | 0.9419 | 0.8413 | 0.8888 |
Indel avec le plus petit seuil : zcat NA12878.non_snp_roc.tsv.gz
Attention à inverser precision et recall !
zcat NA12878.non_snp_roc.tsv.gz | tail -n 1 | awk '{print $7 $6}'
0.71390.7136
SNP avec le plus petit seuil : zcat NA12878.non_snp_roc.tsv.gz
Attention à inverser precision et recall !
$ zcat NA12878.snp_roc.tsv.gz | tail -n 1 | awk '{print $7 $6}'
0.85470.9727
compareNA12878-giab/vcfeval/NA12878.summary.txt
| Threshold | True-pos-baseline | True-pos-call | False-pos | False-neg | Precision | Sensitivity | F-measure |
| 1.000 | 44812 | 44812 | 2878 | 6057 | 0.9397 | 0.8809 | 0.9093 |
| None | 44813 | 44813 | 2882 | 6056 | 0.9396 | 0.8809 | 0.9093 |
SNP:
$ zcat NA12878.snp_roc.tsv.gz | tail -n 1 | awk '{print $7 $6}'
0.89370.9621
indel
$ zcat NA12878.non_snp_roc.tsv.gz | tail -n 1 | awk '{print $7 $6}'
0.75980.7445
compareNA12878-giab/happy/NA12878.summary.csv
| Type | Filter | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision | METRIC.Frac_NA | METRIC.F1_Score | TRUTH.TOTAL.TiTv_ratio | QUERY.TOTAL.TiTv_ratio | TRUTH.TOTAL.het_hom_ratio | QUERY.TOTAL.het_hom_ratio |
|-------+--------+-------------+----------+----------+-------------+----------+-----------+-------+-------+---------------+------------------+----------------+-----------------+------------------------+------------------------+---------------------------+---------------------------|
| INDEL | ALL | 4871 | 3678 | 1193 | 7036 | 1299 | 2011 | 208 | 217 | 0.755081 | 0.741493 | 0.285816 | 0.748225 | | | 1.6174985978687606 | 2.5240506329113925 |
| INDEL | PASS | 4871 | 3678 | 1193 | 7036 | 1299 | 2011 | 208 | 217 | 0.755081 | 0.741493 | 0.285816 | 0.748225 | | | 1.6174985978687606 | 2.5240506329113925 |
| SNP | ALL | 46032 | 41138 | 4894 | 47694 | 1622 | 4930 | 362 | 31 | 0.893683 | 0.962071 | 0.103367 | 0.926617 | 2.529551552318896 | 2.4124463519313304 | 1.6206857273037931 | 1.6888675840288743 |
| SNP | PASS | 46032 | 41138 | 4894 | 47694 | 1622 | 4930 | 362 | 31 | 0.893683 | 0.962071 | 0.103367 | 0.926617 | 2.529551552318896 | 2.4124463519313304 | 1.6206857273037931 | 1.688867584028874 |
***** TODO Résultats sans trimming
SCHEDULED: <2023-05-25 Thu>
**** DONE HG002 :hg002:
CLOSED: [2023-04-14 Fri 09:54] SCHEDULED: <2023-04-10 Mon>
#+begin_src
NXF_OPTS=-D"user.name=${USER}" nextflow run workflows/giabFastq.nf -profile standard,helios
NXF_OPTS=-D"user.name=${USER}" nextflow run main.nf -profile standard,helios -resume --input="/Work/Groups/bisonex/data/giab/GRCh38/HG002_{1,2}.fq.gz --test.id=HG002
Only the capture file differs. Results are better using the capture file given by Agilent, stored in data/
NXF_OPTS=-D"user.name=${USER}" nextflow run workflows/compareVCF.nf -profile standard,helios -resume --outdir=compareHG002 --test.id=HG002 --test.query=out/HG002_1/variantCalling/haplotypecaller/HG002_1.vcf.gz --test.compare=vcfeval,happy --test.capture=data/AgilentSureSelectv05_hg38.bed
#
#+end_src
***** DONE Mauvais résultats
CLOSED: [2023-04-14 Fri 09:42]
avec vcfeval
Threshold True-pos-baseline True-pos-call False-pos False-neg Precision Sensitivity F-measure
----------------------------------------------------------------------------------------------------
0.000 24585 24390 10060 39415 0.7080 0.3841 0.4980
None 24585 24390 10060 39415 0.7080 0.3841 0.4980
La sortie du variantCalling est celle d'happy ???
On relance...
***** DONE Vérifier vcf en hg38
CLOSED: [2023-04-12 Wed 10:33] SCHEDULED: <2023-04-12 Wed>
***** KILL Capture en hg19 ?
CLOSED: [2023-04-13 Thu 09:46] SCHEDULED: <2023-04-12 Wed>
***** KILL Vraiment fichier de capture ou zone d'intérêt ?
CLOSED: [2023-04-13 Thu 09:45] SCHEDULED: <2023-04-12 Wed>
"target region" +/- 50bp
[[https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/analysis/OsloUniversityHospital_Exome_GATK_jointVC_11242015/README.txt][README]]
list file describing the variant calling regions (target regions extended with 50 bp on each end)
***** DONE .bed fourni par AGilent: sensbilité très mauvaise
CLOSED: [2023-04-13 Thu 09:46] SCHEDULED: <2023-04-13 Thu>
Agilent SureSelect Human All Exon V5 kit
Disponible en hg38
Threshold True-pos-baseline True-pos-call False-pos False-neg Precision Sensitivity F-measure
----------------------------------------------------------------------------------------------------
0.000 19653 19501 6410 21657 0.7526 0.4757 0.5830
None 19653 19501 6410 21657 0.7526 0.4757 0.5830
***** DONE Trier par nom avec samtools sort : bons résultats
CLOSED: [2023-04-14 Fri 09:25] SCHEDULED: <2023-04-13 Thu>
Avec capture fourni par GIAB
vcf eval
Threshold True-pos-baseline True-pos-call False-pos False-neg Precision Sensitivity F-measure
------------------------------------------------------------------------------------
ient ?
- mauvaise génération ? -> comparer avec ceux donnés sur github
- nom des chromosomes ?
***** DONE [#A] Tester sur exon 6 GATAD2B pour NC_000001.11:g.153817496A>T
CLOSED: [2023-04-29 Sat 19:56] SCHEDULED: <2023-04-29 Sat>
****** DONE Configuration + Profile 63003856.profile: idem, mal centré
CLOSED: [2023-04-29 Sat 19:18]
Téléchargement des données
#+begin_src sh :dir ~/code/bisonex/test-simuscop
scp meso:/Work/Projects/bisonex/data/genome/GRCh38.p14/genomeRef.fna .
scp meso:Work/Projects/bisonex/data/simuscop/*.profile .
scp -r meso:/Work/Projects/bisonex/data/genome/GRCh38.p13/bwa .
#+end_src
On récupère l'exon (NB: org-mode ne lance pas le code...)
#+begin_src julia
using CSV,DataFramesMeta
d = CSV.read("VCGS_Exome_Covered_Targets_hg38_40.1MB_renamed.bed", header=false, delim="\t", DataFrame)
@subset d :Column1 .== "NC_000001.11" :Column2 .<= 153817496 :Column3 .>= 153817496
#+end_src
NC_000001.11 153817371 153817542
Génération du bed
#+begin_src sh :dir ~/code/bisonex/test-simuscop
echo -e "NC_000001.11\t153817371\t153817542" > gatad2b-exon6.bed
#+end_src
#+RESULTS:
Génération d'un variant
#+begin_src sh :dir ~/code/bisonex/test-simuscop
echo -e "s\tsingle\tNC_000001.11\t153817496\tA\tT\thet"> variant.txt
#+end_src
#+RESULTS:
Génération du fichier de config
#+begin_src sh :dir ~/code/bisonex/test-simuscop
cat > config_wes.txt << EOL
ref = genomeRef.fna
profile = ./63003856.profile
variation = ./variant.txt
target = ./gatad2b-exon6.bed
layout = PE
threads = 1
name = single
output = test-gatad2b
coverage = 20
EOL
#+end_src
#+RESULTS:
On démarre la simulation
#+begin_src sh :dir ~/code/bisonex/test-simuscop
simuReads config_wes.txt
#+end_src
#+RESULTS:
Alignement
#+begin_src sh :dir ~/code/bisonex/test-simuscop
bwa mem -R '@RG\tID:sample\tSM:sample\tPL:ILLUMINA\tPM:Miseq\tCN:lol\tLB:definition_to_add' bwa/genomeRef test-gatad2b/single_1.fq test-gatad2b/single_2.fq | samtools sort -o single.bam
#+end_src
#+RESULTS:
****** DONE Profile github HiSeq2000
CLOSED: [2023-04-29 Sat 19:56]
#+begin_src sh :dir ~/code/bisonex/test-simuscop :result file
wget https://raw.githubusercontent.com/qasimyu/simuscop/master/testData/Illumina_HiSeq2000.profile
#+end_src
#+RESULTS:
#+begin_src sh :dir ~/code/bisonex/test-simuscop
cat > config_wes.txt << EOL
ref = genomeRef.fna
profile = ./Illumina_HiSeq2000.profile
variation = ./variant.txt
target = ./gatad2b-exon6.bed
layout = PE
threads = 1
name = single
output = test-gatad2b-hiseq2000
coverage = 20
EOL
simuReads config_wes.txt
bwa mem -R '@RG\tID:sample\tSM:sample\tPL:ILLUMINA\tPM:Miseq\tCN:lol\tLB:definition_to_add' bwa/genomeRef test-gatad2b-hiseq2000/single_1.fq test-gatad2b-hiseq2000/single_2.fq | samtools sort -o single-hiseq2000.bam
samtools index single-hiseq2000.bam
#+end_src
#+RESULTS:
****** KILL Tester exemple sur github
CLOSED: [2023-04-29 Sat 19:56]
#+begin_src sh
git clone https://github.com/qasimyu/simuscop/
cd simuscop
simuReads configFiles/config_test_wes.txt
#+end_src
****** KILL Centrer la fenêtre sur les zones de capture
CLOSED: [2023-04-30 Sun 13:28] SCHEDULED: <2023-04-29 Sat>
1000bp par défaut, ce qui est plus grand que les zones de captures...
Changer fragzip ne fonctionne pas
Si on rajoute un offset sur l'exon: 200bp, est encore plus allongé
NC_000001.11 153817371 153817542 ->
NC_000001.11 153817171 153817742
Si on désactive les target ?
Regarder les target sur le chromosome 1
#+begin_src sh :dir ~/code/bisonex/test-simuscop :results silent
scp meso:/Work/Projects/bisonex/data/simuscop/VCGS_Exome_Covered_Targets_hg38_40.1MB_renamed.bed .
#+end_src
#+begin_src sh :dir ~/code/bisonex/test-simuscop :results silent
head -n 100 VCGS_Exome_Covered_Targets_hg38_40.1MB_renamed.bed > 100exons.bed
echo -e "s\tsingle\tNC_000001.11\t153817496\tA\tT\thet"> variant.txt
cat > config_wes.txt << EOL
ref = genomeRef.fna
profile = ./63003856.profile
variation = ./variant.txt
layout = PE
threads = 4
target = 100exons.bed
name = single
output = test-gatad2b
coverage = 200
EOL
./simuscop/bin/simuReads config_wes.txt
bwa mem bwa/genomeRef test-gatad2b/single_1.fq test-gatad2b/single_2.fq | samtools sort -o single.bam
samtools index single.bam
#+end_src
**** TODO Vérifier tous les variants sont retrouvés
SCHEDULED: <2023-04-22 Sat>
***** DONE Après alignement
CLOSED: [2023-04-29 Sat 18:27] SCHEDULED: <2023-04-28 Fri>
****** DONE SNV: avec doublons
CLOSED: [2023-04-28 Fri 18:12]
On utilise [[file:~/recherche/bisonex/simuscop/checkBam.jl][checkBam.jl]]
#+begin_src julia
d = prepareVariant("../parsevariants/variant_genomic.csv")
root = "/home/alex/code/bisonex/simuscop-centogene/cento"
bam = root * "/preprocessing/applybqsr/cento.bam"
bai = root * "/preprocessing/recalibrated/cento.bam.bai"
snv = getSNV(d, bam, bai)
#+end_src
Nombreux faux homozygouteS
Vérification avec checkFalseHemizygous(snv) : nombreux doublons dans le fichier pour simuscop...
****** DONE SNV sans doublons
CLOSED: [2023-04-29 Sat 18:27]
******* DONE 18 faux homozygote mais avec peu de reads
CLOSED: [2023-04-29 Sat 18:27]
julia> @subset snv :refCount .== 0 :altCount .> 0 :zygosity .== "heterozygous"
18×10 DataFrame
Row │ chrom pos variant variantType zygosity ref alt refCount altCount readsCount
│ SubStrin…? Int64 SubStrin…? String? String15 SubStrin… SubStrin… Int64 Int64 Int64
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ NC_000022.11 42213078 g.42213078T>G snv heterozygous T G 0 1 1
2 │ NC_000012.12 101680427 g.101680427C>A snv heterozygous C A 0 3 3
3 │ NC_000014.9 105385684 g.105385684G>C snv heterozygous G C 0 4 4
4 │ NC_000011.10 125978299 g.125978299C>T snv heterozygous C T 0 3 3
5 │ NC_000023.11 77998618 g.77998618C>T snv heterozygous C T 0 2 2
6 │ NC_000015.10 66703292 g.66703292C>T snv heterozygous C T 0 3 3
7 │ NC_000010.11 87961118 g.87961118G>A snv heterozygous G A 0 3 3
8 │ NC_000012.12 112477719 g.112477719A>G snv heterozygous A G 0 2 2
9 │ NC_000020.11 6778406 g.6778406C>T snv heterozygous C T 0 3 3
10 │ NC_000023.11 68192943 g.68192943G>A snv heterozygous G A 0 2 2
11 │ NC_000004.12 987858 g.987858C>T snv heterozygous C T 0 3 4
12 │ NC_000015.10 66435145 g.66435145G>A snv heterozygous G A 0 1 2
13 │ NC_000002.12 47809595 g.47809595C>T snv heterozygous C T 0 2 2
14 │ NC_000003.12 136477305 g.136477305C>G snv heterozygous C G 0 4 4
15 │ NC_000005.10 157285458 g.157285458C>T snv heterozygous C T 0 3 3
16 │ NC_000012.12 23604413 g.23604413T>G snv heterozygous T G 0 5 5
17 │ NC_000019.10 52219703 g.52219703C>T snv heterozygous C T 0 1 1
18 │ NC_000016.10 88856757 g.88856757C>T snv heterozygous C T 0 8 8
******* DONE 8 non retrouvé => probablement hors de la zjone de capture
CLOSED: [2023-04-28 Fri 19:49]
julia> @subset snv :refCount .== 0 :altCount .== 0
8×10 DataFrame
Row │ chrom pos variant variantType zygosity ref alt refCount altCount readsCount
│ SubStrin…? Int64 SubStrin…? String? String15 SubStrin… SubStrin… Int64 Int64 Int64
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ NC_000015.10 74343027 g.74343027C>T snv heterozygous C T 0 0 0
2 │ NC_000011.10 20638345 g.20638345A>G snv heterozygous A G 0 0 0
3 │ NC_000004.12 139370252 g.139370252C>T snv heterozygous C T 0 0 2
4 │ NC_000017.11 61966475 g.61966475G>T snv heterozygous G T 0 0 0
5 │ NC_000019.10 54144058 g.54144058G>A snv heterozygous G A 0 0 0
6 │ NC_000023.11 77635947 g.77635947A>G snv hemizygous A G 0
0 0
7 │ NC_000005.10 1258495 g.1258495G>A snv heterozygous G A 0 0 0
8 │ NC_000012.12 2449086 g.2449086C>G snv heterozygous C G 0 0 0
***** TODO Après haplotypecaller
SCHEDULED: <2023-04-28 Fri>
****** KILL 20x
CLOSED: [2023-04-29 Sat 15:39]
Manque 183 sur 766
[[file:~/recherche/bisonex/simuscop/checkVCF.jl][checkVCF.jl]]
#+begin_src julia
@subset leftjoin(d2, dHaplo2, on=:genomic) ismissing.(:Column1)
#+end_src
Problème de profondeur ?
Ex: chr13 nombre de 101081606
NC_000011.10 16014966 g.16014966G>A
1 read sur 11 pour allèle alternative
Sur le patient de référence, 202 reads!
Celui-ci n'est pas le fichier de capture (ni dans le bam !)
ex: NC_000015.10 74343027 g.74343027C>T
Pour les autres, on devrait les retrouver...
Vérifier le nombre de reads sur 63003856
Vérifier la paramétrisation du modèle également
****** TODO [#B] 200x
SCHEDULED: <2023-04-30 Sun>
120 manquants (99 sans doublon)!
On vérifie dans IGV (vcf + bam après alignement) :
******* snv NC_000015.10 74343027
- rien d'appelé
- pas une région répétée
- base quality (voir [[*Phred score][Phred score]] ) à 37 donc ok
- variant retrouvé à 26/42
- Bam après aplybqsr: base qualità 35 donc ok
chr15 également à 89318565, variant retrouvé à 25/33 avec basequal de 37
Sans oublier de charger les instructions avx
#+begin_src sh
module load gcc@11.3.0/gcc-12.1.0
#+end_src
On coupe le .bam par chromosome pour débugger (sur le mesocentre)
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/simuscop-centogene-200x/cento/testing :results silent
ln -s ../preprocessing/applybqsr/cento.bam .
ln -s ../preprocessing/recalibrated/cento.bam.bai .
ln -s /Work/Projects/bisonex/data/dbSNP/GRCh38.p13/dbSNP.gz .
ln -s /Work/Projects/bisonex/data/dbSNP/GRCh38.p13/dbSNP.gz.tbi .
ln -s /Work/Projects/bisonex/data/genome/GRCh38.p13/genomeRef.dict .
ln -s /Work/Projects/bisonex/data/genome/GRCh38.p13/genomeRef.fna .
ln -s /Work/Projects/bisonex/data/genome/GRCh38.p13/genomeRef.fna.fai .
#+end_src
On doit lancer à la main (org-mode ne connait pas le chemin de samtools)
samtools view -b cento.bam NC_000015.10 > cento_chr15.bam
samtools index cento_chr15.bam
Puis on se restreint au chronmosome 15
samtools faidx genomeRef.fna NC_000015.10 > genomeRef_chr15.fa
samtools faidx genomeRef_chr15.fa
gatk CreateSequenceDictionary -R genomeRef_chr15.fa -O genomeRef_chr15.dict
On restreint au chromosome 15 avec l'option -L (dure = 1min)
gatk --java-options "-Xmx3072M" HaplotypeCaller --input cento_chr15.bam \
--output test.vcf.gz --reference genomeRef.fna --dbsnp dbSNP.gz --tmp-dir . --max-mnp-distance 2 -L NC_000015.10
******* DONE Tutorial haplotycaller
CLOSED: [2023-05-01 Mon 19:58]
Procédure : https://gatk.broadinstitute.org/hc/en-us/articles/360043491652-When-HaplotypeCaller-and-Mutect2-do-not-call-an-expected-variant
******** DONE Supprimer --max-mnp-distance = 2: idem
CLOSED: [2023-04-30 Sun 15:42]
******** DONE --debug &> run.log : Non appelé...
CLOSED: [2023-04-30 Sun 15:52]
******** DONE --linked-de-bruijn-graph: idem
CLOSED: [2023-04-30 Sun 15:55]
******** DONE --recover-all-dangling-branches
CLOSED: [2023-04-30 Sun 16:01]
******** DONE --min-pruning 0 : plus mais pas celui là
CLOSED: [2023-04-30 Sun 15:59]
******** DONE --bam-output
CLOSED: [2023-04-30 Sun 16:50]
********* DONE : rien !
CLOSED: [2023-04-30 Sun 16:08]
********* DONE + --recover-all-dangling-branches : rien !
CLOSED: [2023-04-30 Sun 16:08]
******** DONE Données filtrées ? apparement non
CLOSED: [2023-04-30 Sun 16:41]
183122 read(s) filtered by: MappingQualityReadFilter
3674 read(s) filtered by: NotDuplicateReadFilter
********* DONE --disable-read-filter MappingQualityReadFilter: idem
CLOSED: [2023-04-30 Sun 16:34]
On a bien - 0 read(s) filtered by: MappingQualityAvailableReadFilter
********* DONE --disable-read-filter NotDuplicateReadFilter: idem
CLOSED: [2023-04-30 Sun 16:40]
******** DONE Essayer freebayes : idem
CLOSED: [2023-04-30 Sun 16:22]
freebayes -f genomeRef.fna -r NC_000015.10 cento_chr15.bam > freebayes-test-chr15.vcf
******** DONE Avec toutes les options : idem
--linked-de-bruijn-graph --recover-all-dangling-branches --min-pruning 0 --bam-output debug.bam
CLOSED: [2023-04-30 Sun 16:50]
******** DONE Vérifier qu'on regarde le même bam : oui
CLOSED: [2023-04-30 Sun 16:50]
******** DONE Désactiver dbSNP : idem
CLOSED: [2023-04-30 Sun 16:52]
******** DONE Changer kmer size : idem
CLOSED: [2023-04-30 Sun 16:56]
par exemple[[https://gatk.broadinstitute.org/hc/en-us/community/posts/360075653152-REAL-Variant-not-called-by-HaplotypeCaller][forum gatk]] --kmer-size 18 --kmer-size 22
******** DONE --adaptive-pruning true
CLOSED: [2023-05-01 Mon 19:57]
******* DONE Mapping quality : est à 0 !!!!
CLOSED: [2023-05-01 Mon 19:58]
****** TODO Comparer VCF avec vcfeval
SCHEDULED: <2023-05-01 Mon>
On prépare les données en julia
#+begin_src ~/recherche/bisonex/simuscop
julia --project=. toVCF.jl
#+end_src
Puis on export sur le mésocentre
#+begin_src
scp variants_for_vcfeval.tsv.gz* meso:centogene_variants/
#+end_src
#+begin_src
z bis
cd simuscop-200x
rtg vcfeval -b ~/centogene_variants/variants_for_vcfeval.tsv.gz -c cento/variantCalling/haplotypecaller/cento.vcf.gz -o compare-haplotypecaller -t /Work/Groups/bisonex/data/giab/GRCh38/genomeRef.sdf
#+end_src
Threshold True-pos-baseline True-pos-call False-pos False-neg Precision Sensitivity F-measure
----------------------------------------------------------------------------------------------------
82.000 540 540 60 45 0.9000 0.9231 0.9114
None 546 546 329 39 0.6240 0.9333 0.7479
****** DONE Méthode naïve 549/585
CLOSED: [2023-05-04 Thu 21:57]
Haplotypecaller: Nb reference SNV 692 vs found 585
Variant calling, filter technical: reference SNV 692 vs found 521
***** TODO Avant annotation
SCHEDULED: <2023-04-28 Fri>
#+begin_src
cd cento/variantCalling
bgzip filter-technical.vcf
tabix -p vcf filter-technical.vcf.gz -f
#+end_src
Threshold True-pos-baseline True-pos-call False-pos False-neg Precision
ient ?
- mauvaise génération ? -> comparer avec ceux donnés sur github
- nom des chromosomes ?
***** DONE [#A] Tester sur exon 6 GATAD2B pour NC_000001.11:g.153817496A>T
CLOSED: [2023-04-29 Sat 19:56] SCHEDULED: <2023-04-29 Sat>
****** DONE Configuration + Profile 63003856.profile: idem, mal centré
CLOSED: [2023-04-29 Sat 19:18]
Téléchargement des données
#+begin_src sh :dir ~/code/bisonex/test-simuscop
scp meso:/Work/Projects/bisonex/data/genome/GRCh38.p14/genomeRef.fna .
scp meso:Work/Projects/bisonex/data/simuscop/*.profile .
scp -r meso:/Work/Projects/bisonex/data/genome/GRCh38.p13/bwa .
#+end_src
On récupère l'exon (NB: org-mode ne lance pas le code...)
#+begin_src julia
using CSV,DataFramesMeta
d = CSV.read("VCGS_Exome_Covered_Targets_hg38_40.1MB_renamed.bed", header=false, delim="\t", DataFrame)
@subset d :Column1 .== "NC_000001.11" :Column2 .<= 153817496 :Column3 .>= 153817496
#+end_src
NC_000001.11 153817371 153817542
Génération du bed
#+begin_src sh :dir ~/code/bisonex/test-simuscop
echo -e "NC_000001.11\t153817371\t153817542" > gatad2b-exon6.bed
#+end_src
#+RESULTS:
Génération d'un variant
#+begin_src sh :dir ~/code/bisonex/test-simuscop
echo -e "s\tsingle\tNC_000001.11\t153817496\tA\tT\thet"> variant.txt
#+end_src
#+RESULTS:
Génération du fichier de config
#+begin_src sh :dir ~/code/bisonex/test-simuscop
cat > config_wes.txt << EOL
ref = genomeRef.fna
profile = ./63003856.profile
variation = ./variant.txt
target = ./gatad2b-exon6.bed
layout = PE
threads = 1
name = single
output = test-gatad2b
coverage = 20
EOL
#+end_src
#+RESULTS:
On démarre la simulation
#+begin_src sh :dir ~/code/bisonex/test-simuscop
simuReads config_wes.txt
#+end_src
#+RESULTS:
Alignement
#+begin_src sh :dir ~/code/bisonex/test-simuscop
bwa mem -R '@RG\tID:sample\tSM:sample\tPL:ILLUMINA\tPM:Miseq\tCN:lol\tLB:definition_to_add' bwa/genomeRef test-gatad2b/single_1.fq test-gatad2b/single_2.fq | samtools sort -o single.bam
#+end_src
#+RESULTS:
****** DONE Profile github HiSeq2000
CLOSED: [2023-04-29 Sat 19:56]
#+begin_src sh :dir ~/code/bisonex/test-simuscop :result file
wget https://raw.githubusercontent.com/qasimyu/simuscop/master/testData/Illumina_HiSeq2000.profile
#+end_src
#+RESULTS:
#+begin_src sh :dir ~/code/bisonex/test-simuscop
cat > config_wes.txt << EOL
ref = genomeRef.fna
profile = ./Illumina_HiSeq2000.profile
variation = ./variant.txt
target = ./gatad2b-exon6.bed
layout = PE
threads = 1
name = single
output = test-gatad2b-hiseq2000
coverage = 20
EOL
simuReads config_wes.txt
bwa mem -R '@RG\tID:sample\tSM:sample\tPL:ILLUMINA\tPM:Miseq\tCN:lol\tLB:definition_to_add' bwa/genomeRef test-gatad2b-hiseq2000/single_1.fq test-gatad2b-hiseq2000/single_2.fq | samtools sort -o single-hiseq2000.bam
samtools index single-hiseq2000.bam
#+end_src
#+RESULTS:
****** KILL Tester exemple sur github
CLOSED: [2023-04-29 Sat 19:56]
#+begin_src sh
git clone https://github.com/qasimyu/simuscop/
cd simuscop
simuReads configFiles/config_test_wes.txt
#+end_src
****** KILL Centrer la fenêtre sur les zones de capture
CLOSED: [2023-04-30 Sun 13:28] SCHEDULED: <2023-04-29 Sat>
1000bp par défaut, ce qui est plus grand que les zones de captures...
Changer fragzip ne fonctionne pas
Si on rajoute un offset sur l'exon: 200bp, est encore plus allongé
NC_000001.11 153817371 153817542 ->
NC_000001.11 153817171 153817742
Si on désactive les target ?
Regarder les target sur le chromosome 1
#+begin_src sh :dir ~/code/bisonex/test-simuscop :results silent
scp meso:/Work/Projects/bisonex/data/simuscop/VCGS_Exome_Covered_Targets_hg38_40.1MB_renamed.bed .
#+end_src
#+begin_src sh :dir ~/code/bisonex/test-simuscop :results silent
head -n 100 VCGS_Exome_Covered_Targets_hg38_40.1MB_renamed.bed > 100exons.bed
echo -e "s\tsingle\tNC_000001.11\t153817496\tA\tT\thet"> variant.txt
cat > config_wes.txt << EOL
ref = genomeRef.fna
profile = ./63003856.profile
variation = ./variant.txt
layout = PE
threads = 4
target = 100exons.bed
name = single
output = test-gatad2b
coverage = 200
EOL
./simuscop/bin/simuReads config_wes.txt
bwa mem bwa/genomeRef test-gatad2b/single_1.fq test-gatad2b/single_2.fq | samtools sort -o single.bam
samtools index single.bam
#+end_src
**** TODO Vérifier tous les variants sont retrouvés en 200x
SCHEDULED: <2023-04-22 Sat>
***** DONE Après alignement
CLOSED: [2023-04-29 Sat 18:27] SCHEDULED: <2023-04-28 Fri>
****** DONE SNV: avec doublons
CLOSED: [2023-04-28 Fri 18:12]
On utilise [[file:~/recherche/bisonex/simuscop/checkBam.jl][checkBam.jl]]
#+begin_src julia
d = prepareVariant("../parsevariants/variant_genomic.csv")
root = "/home/alex/code/bisonex/simuscop-centogene/cento"
bam = root * "/preprocessing/applybqsr/cento.bam"
bai = root * "/preprocessing/recalibrated/cento.bam.bai"
snv = getSNV(d, bam, bai)
#+end_src
Nombreux faux homozygouteS
Vérification avec checkFalseHemizygous(snv) : nombreux doublons dans le fichier pour simuscop...
****** DONE SNV sans doublons
CLOSED: [2023-04-29 Sat 18:27]
******* DONE 18 faux homozygote mais avec peu de reads
CLOSED: [2023-04-29 Sat 18:27]
julia> @subset snv :refCount .== 0 :altCount .> 0 :zygosity .== "heterozygous"
18×10 DataFrame
Row │ chrom pos variant variantType zygosity ref alt refCount altCount readsCount
│ SubStrin…? Int64 SubStrin…? String? String15 SubStrin… SubStrin… Int64 Int64 Int64
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ NC_000022.11 42213078 g.42213078T>G snv heterozygous T G 0 1 1
2 │ NC_000012.12 101680427 g.101680427C>A snv heterozygous C A 0 3 3
3 │ NC_000014.9 105385684 g.105385684G>C snv heterozygous G C 0 4 4
4 │ NC_000011.10 125978299 g.125978299C>T snv heterozygous C T 0 3 3
5 │ NC_000023.11 77998618 g.77998618C>T snv heterozygous C T 0 2 2
6 │ NC_000015.10 66703292 g.66703292C>T snv heterozygous C T 0 3 3
7 │ NC_000010.11 87961118 g.87961118G>A snv heterozygous G A 0 3 3
8 │ NC_000012.12 112477719 g.112477719A>G snv heterozygous A G 0 2 2
9 │ NC_000020.11 6778406 g.6778406C>T snv heterozygous C T 0 3 3
10 │ NC_000023.11 68192943 g.68192943G>A snv heterozygous G A 0 2 2
11 │ NC_000004.12 987858 g.987858C>T snv heterozygous C T 0 3 4
12 │ NC_000015.10 66435145 g.66435145G>A snv heterozygous G A 0 1 2
13 │ NC_000002.12 47809595 g.47809595C>T snv heterozygous C T 0 2 2
14 │ NC_000003.12 136477305 g.136477305C>G snv heterozygous C G 0 4 4
15 │ NC_000005.10 157285458 g.157285458C>T snv heterozygous C T 0 3 3
16 │ NC_000012.12 23604413 g.23604413T>G snv heterozygous T G 0 5 5
17 │ NC_000019.10 52219703 g.52219703C>T snv heterozygous C T 0 1 1
18 │ NC_000016.10 88856757 g.88856757C>T snv heterozygous C T 0 8 8
******* DONE 8 non retrouvé => probablement hors de la zjone de capture
CLOSED: [2023-04-28 Fri 19:49]
julia> @subset snv :refCount .== 0 :altCount .== 0
8×10 DataFrame
Row │ chrom pos variant variantType zygosity ref alt refCount altCount readsCount
│ SubStrin…? Int64 SubStrin…? String? String15 SubStrin… SubStrin… Int64 Int64 Int64
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ NC_000015.10 74343027 g.74343027C>T snv heterozygous C T 0 0 0
2 │ NC_000011.10 20638345 g.20638345A>G snv heterozygous A G 0 0 0
3 │ NC_000004.12 139370252 g.139370252C>T snv heterozygous C T 0 0 2
4 │ NC_000017.11 61966475 g.61966475G>T snv heterozygous G T 0 0 0
5 │ NC_000019.10 54144058 g.54144058G>A snv heterozygous G A 0 0 0
6 │ NC_000023.11 77635947 g.77635947A>G snv hemizygous A G 0 0 0
7 │ NC_000005.10 1258495 g.1258495G>A snv heterozygous G A 0 0 0
8 │ NC_000012.12 2449086 g.2449086C>G snv heterozygous C G 0 0 0
***** TODO Après haplotypecaller
SCHEDULED: <2023-04-28 Fri>
****** KILL 20x
CLOSED: [2023-04-29 Sat 15:39]
Manque 183 sur 766
[[file:~/recherche/bisonex/simuscop/checkVCF.jl][checkVCF.jl]]
#+begin_src julia
@subset leftjoin(d2, dHaplo2, on=:genomic) ismissing.(:Column1)
#+end_src
Problème de profondeur ?
Ex: chr13 nombre de 101081606
NC_000011.10 16014966 g.16014966G>A
1 read sur 11 pour allèle alternative
Sur le patient de référence, 202 reads!
Celui-ci n'est pas le fichier de capture (ni dans le bam !)
ex: NC_000015.10 74343027 g.74343027C>T
Pour les autres, on devrait les retrouver...
Vérifier le nombre de reads sur 63003856
Vérifier la paramétrisation du modèle également
****** DONE [#B] 200x
CLOSED: [2023-05-18 Thu 11:04] SCHEDULED: <2023-04-30 Sun>
120 manquants (99 sans doublon)!
On vérifie dans IGV (vcf + bam après alignement) :
******* snv NC_000015.10 74343027
- rien d'appelé
- pas une région répétée
- base quality (voir [[*Phred score][Phred score]] ) à 37 donc ok
- variant retrouvé à 26/42
- Bam après aplybqsr: base qualità 35 donc ok
chr15 également à 89318565, variant retrouvé à 25/33 avec basequal de 37
Sans oublier de charger les instructions avx
#+begin_src sh
module load gcc@11.3.0/gcc-12.1.0
#+end_src
On coupe le .bam par chromosome pour débugger (sur le mesocentre)
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/simuscop-centogene-200x/cento/testing :results silent
ln -s ../preprocessing/applybqsr/cento.bam .
ln -s ../preprocessing/recalibrated/cento.bam.bai .
ln -s /Work/Projects/bisonex/data/dbSNP/GRCh38.p13/dbSNP.gz .
ln -s /Work/Projects/bisonex/data/dbSNP/GRCh38.p13/dbSNP.gz.tbi .
ln -s /Work/Projects/bisonex/data/genome/GRCh38.p13/genomeRef.dict .
ln -s /Work/Projects/bisonex/data/genome/GRCh38.p13/genomeRef.fna .
ln -s /Work/Projects/bisonex/data/genome/GRCh38.p13/genomeRef.fna.fai .
#+end_src
On doit lancer à la main (org-mode ne connait pas le chemin de samtools)
samtools view -b cento.bam NC_000015.10 > cento_chr15.bam
samtools index cento_chr15.bam
Puis on se restreint au chronmosome 15
samtools faidx genomeRef.fna NC_000015.10 > genomeRef_chr15.fa
samtools faidx genomeRef_chr15.fa
gatk CreateSequenceDictionary -R genomeRef_chr15.fa -O genomeRef_chr15.dict
On restreint au chromosome 15 avec l'option -L (dure = 1min)
gatk --java-options "-Xmx3072M" HaplotypeCaller --input cento_chr15.bam \
--output test.vcf.gz --reference genomeRef.fna --dbsnp dbSNP.gz --tmp-dir . --max-mnp-distance 2 -L NC_000015.10
******* DONE Tutorial haplotycaller
CLOSED: [2023-05-01 Mon 19:58]
Procédure : https://gatk.broadinstitute.org/hc/en-us/articles/360043491652-When-HaplotypeCaller-and-Mutect2-do-not-call-an-expected-variant
******** DONE Supprimer --max-mnp-distance = 2: idem
CLOSED: [2023-04-30 Sun 15:42]
******** DONE --debug &> run.log : Non appelé...
CLOSED: [2023-04-30 Sun 15:52]
******** DONE --linked-de-bruijn-graph: idem
CLOSED: [2023-04-30 Sun 15:55]
******** DONE --recover-all-dangling-branches
CLOSED: [2023-04-30 Sun 16:01]
******** DONE --min-pruning 0 : plus mais pas celui là
CLOSED: [2023-04-30 Sun 15:59]
******** DONE --bam-output
CLOSED: [2023-04-30 Sun 16:50]
********* DONE : rien !
CLOSED: [2023-04-30 Sun 16:08]
********* DONE + --recover-all-dangling-branches : rien !
CLOSED: [2023-04-30 Sun 16:08]
******** DONE Données filtrées ? apparement non
CLOSED: [2023-04-30 Sun 16:41]
183122 read(s) filtered by: MappingQualityReadFilter
3674 read(s) filtered by: NotDuplicateReadFilter
********* DONE --disable-read-filter MappingQualityReadFilter: idem
CLOSED: [2023-04-30 Sun 16:34]
On a bien - 0 read(s) filtered by: MappingQualityAvailableReadFilter
********* DONE --disable-read-filter NotDuplicateReadFilter: idem
CLOSED: [2023-04-30 Sun 16:40]
******** DONE Essayer freebayes : idem
CLOSED: [2023-04-30 Sun 16:22]
freebayes -f genomeRef.fna -r NC_000015.10 cento_chr15.bam > freebayes-test-chr15.vcf
******** DONE Avec toutes les options : idem
--linked-de-bruijn-graph --recover-all-dangling-branches --min-pruning 0 --bam-output debug.bam
CLOSED: [2023-04-30 Sun 16:50]
******** DONE Vérifier qu'on regarde le même bam : oui
CLOSED: [2023-04-30 Sun 16:50]
******** DONE Désactiver dbSNP : idem
CLOSED: [2023-04-30 Sun 16:52]
******** DONE Changer kmer size : idem
CLOSED: [2023-04-30 Sun 16:56]
par exemple[[https://gatk.broadinstitute.org/hc/en-us/community/posts/360075653152-REAL-Variant-not-called-by-HaplotypeCaller][forum gatk]] --kmer-size 18 --kmer-size 22
******** DONE --adaptive-pruning true
CLOSED: [2023-05-01 Mon 19:57]
******* DONE Mapping quality : est à 0 !!!!
CLOSED: [2023-05-01 Mon 19:58]
****** TODO Comparer VCF avec vcfeval
SCHEDULED: <2023-05-01 Mon>
On prépare les données en julia
#+begin_src ~/recherche/bisonex/simuscop
julia --project=. toVCF.jl
#+end_src
Puis on export sur le mésocentre
#+begin_src
scp variants_for_vcfeval.tsv.gz* meso:centogene_variants/
#+end_src
#+begin_src
z bis
cd simuscop-200x
rtg vcfeval -b ~/centogene_variants/variants_for_vcfeval.tsv.gz -c cento/variantCalling/haplotypecaller/cento.vcf.gz -o compare-haplotypecaller -t /Work/Groups/bisonex/data/giab/GRCh38/genomeRef.sdf
#+end_src
Threshold True-pos-baseline True-pos-call False-pos False-neg Precision Sensitivity F-measure
----------------------------------------------------------------------------------------------------
82.000 540 540 60 45 0.9000 0.9231 0.9114
None 546 546 329 39 0.6240 0.9333 0.7479
****** DONE Méthode naïve 549/585
CLOSED: [2023-05-04 Thu 21:57]
Haplotypecaller: Nb reference SNV 692 vs found 585
Variant calling, filter technical: reference SNV 692 vs found 521
***** TODO Avant annotation
SCHEDULED: <2023-04-28 Fri>
#+begin_src
cd cento/variantCalling
bgzip filter-technical.vcf
tabix -p vcf filter-technical.vcf.gz -f
#+end_src
Threshold True-pos-baseline True-pos-call False-pos False-neg Precision
C 321 128 453
26 │ NC_000015.10 48488437 g.48488437T>C snv heterozygous T C 356 132 488
CLOSED: [2023-05-01 Mon 17:18]
***** KILL Chromosome1 15 :Test haplotype caller : échec car CIGARE non mis à jour
CLOSED: [2023-05-13 Sat 18:29] SCHEDULED: <2023-05-01 Mon>
#+begin_src
julia -Jbisonex.so --project=. insertVariants.jl `63003856_S135_chr15.bam` 63003856_S135_chr15_inserted.bam
scp 63003856_S135_chr15_inserted.bam* meso:/Work/Users/apraga/bisonex/tests/synthetic/
#+end_src
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/tests/synthetic :results silent
ln -s /Work/Projects/bisonex/data/dbSNP/GRCh38.p13/dbSNP.gz .
ln -s /Work/Projects/bisonex/data/dbSNP/GRCh38.p13/dbSNP.gz.tbi
ln -s /Work/Projects/bisonex/data/genome/GRCh38.p13/genomeRef.dict .
ln -s /Work/Projects/bisonex/data/genome/GRCh38.p13/genomeRef.fna .
ln -s /Work/Projects/bisonex/data/genome/GRCh38.p13/genomeRef.fna.fai .
#+end_src
puis
#+begin_src
gatk --java-options "-Xmx3072M" HaplotypeCaller --input 63003856_S135_chr15_inserted.bam --output testchr15.vcf.gz --reference genomeRef.fna --tmp-dir . -L NC_000015.10
#+end_src
scp meso:/Work/Users/apraga/bisonex/tests/synthetic/testchr15.vcf.gz haplotypecaller-chr15.vcf.gz
Aucun variant inséré
- base quality ok
-
****** DONE bam out : non appelé
CLOSED: [2023-05-01 Mon 21:57]
gatk --java-options "-Xmx3072M" HaplotypeCaller --input 63003856_S135_chr15_inserted.bam --output haplotypecaller-chr15.vcf.gz --reference genomeRef.f
na --tmp-dir . -L NC_000015.10 --bam-output debug.bam
****** DONE --linked-de-bruijn-graph : idem
CLOSED: [2023-05-01 Mon 21:57]
readlink testchr15.vcf.gz -f^C
[apraga@mesointeractive synthetic]$ gatk --java-options "-Xmx3072M" HaplotypeCaller --input 63003856_S135_chr15_inserted.bam --output haplotypecaller-chr15.vcf.gz --reference genomeRef.fna --tmp-dir . -L NC_000015.10 --linked-de-bruijn-graph
****** KILL regénérer fastq
CLOSED: [2023-05-13 Sat 18:29]
Non
***** KILL Générer bam données pour tous les chromosomes
CLOSED: [2023-05-13 Sat 18:29]
timeit julia -Jbisonex.so --project=. insertVariants.jl ~/code/bisonex/out/63003856/preprocessing/63003856_S135.bam 63003856_S135_inserted.bam
40min 516ms 835µs 405ns
Avertissement:
[W::bam_hdr_read] EOF marker is absent. The input is probably truncated
Inserted.bam et excluded.bam (fichier avant le merge) ont l'air ok...
On réessaie à la main : ça passe
#+begin_src
samtools merge test-all.bam inserted.bam excluded.bam
❯ mv test-all.bam `63003856_S135_inserted.bam` -f
❯ mv test-all.bam.bai `63003856_S135_chr15_inserted.bam.bai` -f
#+end_src
***** DONE BAm2fastq pour avoir CIGAR à jour : échec (variants "cachés")
CLOSED: [2023-05-04 Thu 20:30] SCHEDULED: <2023-05-01 Mon>
On lance la génération de bam depuis le mesocentro (la copie plante via le VPN)
#+begin_src sh
cd /Work/Users/apraga/recherche/bisonex/generate
julia --project=. insertVariants.jl ../../../bisonex/out/63003856_S135/preprocessing/applybqsr/63003856_S135.bam 63003856_S135_inserted.bam
#+end_src
Workflow après avec désactivé storeDir pour SAMTOOLS_BAM2FQ dans nextflow.config (pourquoi ??)
#+begin_src nextflow
include { SAMTOOLS_BAM2FQ } from "${params.modulesDir}/samtools/bam2fq/main"
include { SAMTOOLS_SORT as sortBamByName } from "${params.modulesDir}/samtools/sort/main"
workflow {
f = Channel.fromPath("${params.dataDir}/synthetic/63003856_S135_inserted.bam",
checkIfExists: true).map{it -> [["id": "synthetic_63003856"], it]}
// Important: use "-n" option !!
sortBamByName(f)
SAMTOOLS_BAM2FQ(sortBamByName.out.bam, true)
}
#+end_src
Puis
#+begin_src
cp work/34/fb2fc136f6f6d7f42d0960512f06de/*.fq.gz /Work/Groups/bisonex/data/synthetic/
#+end_src
***** KILL Lancer pipeline
CLOSED: [2023-05-04 Thu 20:30] SCHEDULED: <2023-05-01 Mon>
NXF_OPTS=-D"user.name=apraga" nextflow run main.nf -c nextflow.config -profile standard,helios -bg --input="/Work/Groups/bisonex/data/synthetic/synthetic_63003856_{1,2}.fq.gz" --outdir out/synthetic_63003856
*** TODO Bamsurgeon
**** TODO Package nix
1. Patcher la recherche du génome de référence pour bien trouver les index (en utilisant une regexp comme nf-core)
2. Rajouter le chemin de picard dans les arguments
3. Option -O3 pour performance
**** DONE Test sur mini-bam: échec
CLOSED: [2023-05-14 Sun 21:12]
❯ samtools view -h ~/code/bisonex/simuscop-centogene-200x/cento/preprocessing/mapped/cento.bam | head -n1000 | samtools view -Sb - > mini.bam
❯ samtools index mini.bam
Sans spécfier le variant:
#+begin_quote
NC_000001.11 17651 17651
#+end_quote
./result/bin/addsnv -v snv.txt -f mini.bam -r ../data/genomeRef.fna -o test.bam
**** DONE Test chr22
CLOSED: [2023-05-15 Mon 23:24]
Pas assez de reads, on prend le chromosome 22
#+begin_src sh
samtools view ../simuscop-centogene-200x/cento/preprocessing/mapped/cento.bam NC_000022.11 -b -o chr22.bam
samtools index chr22.bam
#+end_src
Mésocentre
dans tests/bamsurgeno
#+begin_src
addsnv -v snv.txt -f chr22.bam -r ../genomeRef.fna -o test.bam --aligner mem
#+end_src
***** DONE SNV aléatoire:
CLOSED: [2023-05-15 Mon 23:13]
NC_000022.11 17499704 17499704 0.2
On retrouve bien un variant à cette position A > T
***** DONE SNV avec ALT prédéfini : retrouvée dans IGV (mais pas dans pileup)
CLOSED: [2023-05-15 Mon 23:13]
NC_000022.11 17499704 17499704 0.2 G
***** DONE Variants patients chr22: ok IGV
CLOSED: [2023-05-15 Mon 23:23]
Fichier non trié donc
samtools sort test.bam -o test-sorted.bam
samtools index test-sorted.bam
***** DONE Vérifier qu'il faut POS et POS+1: non
CLOSED: [2023-05-14 Sun 21:21]
**** TODO Variants cento
***** TODO SNV
SCHEDULED: <2023-05-15 Mon>
***** TODO del
***** TODO ins
*** Divers
**** DONE Vérifier nombre de reads fastq - bam
CLOSED: [2022-10-09 Sun 22:31]
* DONE Plot : ashkenazim trio
CLOSED: [2023-04-18 Tue 21:28] SCHEDULED: <2023-04-16 Sun>
/Entered on/ [2023-04-16 Sun 17:29]
C 321 128 453
26 │ NC_000015.10 48488437 g.48488437T>C snv heterozygous T C 356 132 488
CLOSED: [2023-05-01 Mon 17:18]
***** KILL Chromosome1 15 :Test haplotype caller : échec car CIGARE non mis à jour
CLOSED: [2023-05-13 Sat 18:29] SCHEDULED: <2023-05-01 Mon>
#+begin_src
julia -Jbisonex.so --project=. insertVariants.jl `63003856_S135_chr15.bam` 63003856_S135_chr15_inserted.bam
scp 63003856_S135_chr15_inserted.bam* meso:/Work/Users/apraga/bisonex/tests/synthetic/
#+end_src
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/tests/synthetic :results silent
ln -s /Work/Projects/bisonex/data/dbSNP/GRCh38.p13/dbSNP.gz .
ln -s /Work/Projects/bisonex/data/dbSNP/GRCh38.p13/dbSNP.gz.tbi
ln -s /Work/Projects/bisonex/data/genome/GRCh38.p13/genomeRef.dict .
ln -s /Work/Projects/bisonex/data/genome/GRCh38.p13/genomeRef.fna .
ln -s /Work/Projects/bisonex/data/genome/GRCh38.p13/genomeRef.fna.fai .
#+end_src
puis
#+begin_src
gatk --java-options "-Xmx3072M" HaplotypeCaller --input 63003856_S135_chr15_inserted.bam --output testchr15.vcf.gz --reference genomeRef.fna --tmp-dir . -L NC_000015.10
#+end_src
scp meso:/Work/Users/apraga/bisonex/tests/synthetic/testchr15.vcf.gz haplotypecaller-chr15.vcf.gz
Aucun variant inséré
- base quality ok
-
****** DONE bam out : non appelé
CLOSED: [2023-05-01 Mon 21:57]
gatk --java-options "-Xmx3072M" HaplotypeCaller --input 63003856_S135_chr15_inserted.bam --output haplotypecaller-chr15.vcf.gz --reference genomeRef.f
na --tmp-dir . -L NC_000015.10 --bam-output debug.bam
****** DONE --linked-de-bruijn-graph : idem
CLOSED: [2023-05-01 Mon 21:57]
readlink testchr15.vcf.gz -f^C
[apraga@mesointeractive synthetic]$ gatk --java-options "-Xmx3072M" HaplotypeCaller --input 63003856_S135_chr15_inserted.bam --output haplotypecaller-chr15.vcf.gz --reference genomeRef.fna --tmp-dir . -L NC_000015.10 --linked-de-bruijn-graph
****** KILL regénérer fastq
CLOSED: [2023-05-13 Sat 18:29]
Non
***** KILL Générer bam données pour tous les chromosomes
CLOSED: [2023-05-13 Sat 18:29]
timeit julia -Jbisonex.so --project=. insertVariants.jl ~/code/bisonex/out/63003856/preprocessing/63003856_S135.bam 63003856_S135_inserted.bam
40min 516ms 835µs 405ns
Avertissement:
[W::bam_hdr_read] EOF marker is absent. The input is probably truncated
Inserted.bam et excluded.bam (fichier avant le merge) ont l'air ok...
On réessaie à la main : ça passe
#+begin_src
samtools merge test-all.bam inserted.bam excluded.bam
❯ mv test-all.bam `63003856_S135_inserted.bam` -f
❯ mv test-all.bam.bai `63003856_S135_chr15_inserted.bam.bai` -f
#+end_src
***** DONE BAm2fastq pour avoir CIGAR à jour : échec (variants "cachés")
CLOSED: [2023-05-04 Thu 20:30] SCHEDULED: <2023-05-01 Mon>
On lance la génération de bam depuis le mesocentro (la copie plante via le VPN)
#+begin_src sh
cd /Work/Users/apraga/recherche/bisonex/generate
julia --project=. insertVariants.jl ../../../bisonex/out/63003856_S135/preprocessing/applybqsr/63003856_S135.bam 63003856_S135_inserted.bam
#+end_src
Workflow après avec désactivé storeDir pour SAMTOOLS_BAM2FQ dans nextflow.config (pourquoi ??)
#+begin_src nextflow
include { SAMTOOLS_BAM2FQ } from "${params.modulesDir}/samtools/bam2fq/main"
include { SAMTOOLS_SORT as sortBamByName } from "${params.modulesDir}/samtools/sort/main"
workflow {
f = Channel.fromPath("${params.dataDir}/synthetic/63003856_S135_inserted.bam",
checkIfExists: true).map{it -> [["id": "synthetic_63003856"], it]}
// Important: use "-n" option !!
sortBamByName(f)
SAMTOOLS_BAM2FQ(sortBamByName.out.bam, true)
}
#+end_src
Puis
#+begin_src
cp work/34/fb2fc136f6f6d7f42d0960512f06de/*.fq.gz /Work/Groups/bisonex/data/synthetic/
#+end_src
***** KILL Lancer pipeline
CLOSED: [2023-05-04 Thu 20:30] SCHEDULED: <2023-05-01 Mon>
NXF_OPTS=-D"user.name=apraga" nextflow run main.nf -c nextflow.config -profile standard,helios -bg --input="/Work/Groups/bisonex/data/synthetic/synthetic_63003856_{1,2}.fq.gz" --outdir out/synthetic_63003856
*** TODO Bamsurgeon :bamsurgeon:
**** DONE Package nix
CLOSED: [2023-05-18 Thu 11:05]
1. Patcher la recherche du génome de référence pour bien trouver les index (en utilisant une regexp comme nf-core)
2. Rajouter le chemin de picard dans les arguments
3. Option -O3 pour performance
**** DONE Test sur mini-bam: échec
CLOSED: [2023-05-14 Sun 21:12]
❯ samtools view -h ~/code/bisonex/simuscop-centogene-200x/cento/preprocessing/mapped/cento.bam | head -n1000 | samtools view -Sb - > mini.bam
❯ samtools index mini.bam
Sans spécfier le variant:
#+begin_quote
NC_000001.11 17651 17651
#+end_quote
./result/bin/addsnv -v snv.txt -f mini.bam -r ../data/genomeRef.fna -o test.bam
**** DONE Test chr22
CLOSED: [2023-05-15 Mon 23:24]
Pas assez de reads, on prend le chromosome 22
#+begin_src sh
samtools view ../simuscop-centogene-200x/cento/preprocessing/mapped/cento.bam NC_000022.11 -b -o chr22.bam
samtools index chr22.bam
#+end_src
Mésocentre
dans tests/bamsurgeno
#+begin_src
addsnv -v snv.txt -f chr22.bam -r ../genomeRef.fna -o test.bam --aligner mem
#+end_src
***** DONE SNV aléatoire:
CLOSED: [2023-05-15 Mon 23:13]
NC_000022.11 17499704 17499704 0.2
On retrouve bien un variant à cette position A > T
***** DONE SNV avec ALT prédéfini : retrouvée dans IGV (mais pas dans pileup)
CLOSED: [2023-05-15 Mon 23:13]
NC_000022.11 17499704 17499704 0.2 G
***** DONE Variants patients chr22: ok IGV
CLOSED: [2023-05-15 Mon 23:23]
Fichier non trié donc
samtools sort test.bam -o test-sorted.bam
samtools index test-sorted.bam
***** DONE Vérifier qu'il faut POS et POS+1: non
CLOSED: [2023-05-14 Sun 21:21]
**** TODO Variants cento
***** STRT SNV
SCHEDULED: <2023-05-15 Mon>
Attention à la mémoire: 32G ne semble pas suffire avec 12 threads
#+begin_src sh
NXF_OPTS=-D"user.name=${USER}" nextflow run workflows/bamsurgeon.nf -profile standard,helios --input=tests/bamsurgeon/snv-cento.tsv -bg
#+end_src
ET
#+begin_src nextflow
workflow {
f = Channel.fromPath(params.input, checkIfExists: true)
bam = Channel.fromPath("simuscop-centogene-200x/cento/preprocessing/mapped/cento.bam",
checkIfExists: true)
bamIndex = bam.map { it -> it + ".bai" }
downloadGenome | indexGenome
indexGenome.out.index | view
addSNV(f, bam, bamIndex, downloadGenome.out, indexGenome.out.index, indexGenome.out.dict, indexGenome.out.fai)
}
#+end_src
****** TODO Erreur ValueError: quality and sequence mismatch
******* DONE Idem avec dernière version sur github
CLOSED: [2023-05-18 Thu 14:36]
******* TODO Version 1.3
Test sur chr22: variants ok mais VAF=1...
***** TODO del
***** TODO ins
*** Divers
**** DONE Vérifier nombre de reads fastq - bam
CLOSED: [2022-10-09 Sun 22:31]