EGBAM5IY6LCNHYLJ2LVMBMGOW4SAUDR4N3PFZOOEN77VSSBRTO2AC
KJH535WVR3KSN625AUFRI3GXBUADDP7E5NF6IMBX743E44PAVBLQC
VBMV6H3DSTLSVZCQUAKSA4CFT7LSANFNALLETHLCPFH6Y3XPEAZAC
RHWQQAAHNHFO3FLCGVB3SIDKNOUFJGZTDNN57IQVBMXXCWX74MKAC
K54WC2QNEXHVKE2YFU3DOCEUYZTGLINXBRLSHQX4BNDADWUMJGFAC
FXA3ZBV64FML7W47IPHTAJFJHN3J3XHVHFVNYED47XFSBIGMBKRQC
5ULVEFBQOIIEPYRTQO4UMGYDD756DVKNHUL5OMNGWI7B3XTQ7KEAC
N5WHBFPQS5XZRXGLKUJTHN752TUA2SGAUSLR3LLP3DSLLEO3A5WQC
TUIGPRCLOWRHD2UQD3ZSWHWKKIJTRXB7LP3DDMTGEL4OFORXY24AC
* <2023-04-15 Sat> Running
Trail "Trail'n Loue"
10km 600M de dénivelé 1h30.
Pas de pause sur la montée !! Assez douce
PairHmm - Intel GKL Utils not loaded
17:28:00.733 WA
RN PairHMM - ***WARNING: Machine does not have the AVX instruction set support needed for the accelerated AVX PairHmm. Falling back to the MUCH slower LOGLESS_CACHING implementation!
17:28:00.763 INFO ProgressMeter - Starting traversal
#+end_quote
libgomp.so est fourni par gcc donc il faut charger le module
module load gcc@11.3.0/gcc-12.1.0
** KILL Utiliser subworkflow
CLOSED: [2023-04-02 Sun 18:08]
Notre version permet d'être plus souple
*** KILL Alignement
CLOSED: [2023-04-02 Sun 18:08] SCHEDULED: <2023-04-05 Wed>
*** KILL Vep
CLOSED: [2023-04-02 Sun 18:08] SCHEDULED: <2023-04-05 Wed>
vcf_annotate_ensemblvep
** TODO Annotation avec nextflow
*** TODO VEP
***** KILL Utiliser --gene-phenotype ?
CLOSED: [2023-03-15 mer. 13:43]
Vu avec alexis : bases de données non à jour
https://www.ensembl.org/info/genome/variation/phenotype/sources_phenotype_documentation.html
***** TODO Plugin pour CADD, pLI, LOEUF ?
https://www.ensembl.org/info/docs/tools/vep/script/vep_plugins.html#cadd
CADD: n’a pas réussi à le faire fonctionner
pLI, LOEUF : non demandé
***** TODO Utiliser l'option --hgvsg pour remplaer hgvsg.r ?
Non fait par Alexis, oubli a priori
***** TODO Ajout spliceAI ?
*** TODO Spip
**** TODO Checksum sur données
*** TODO Filtrer après VEP
**** TODO Remplacer avec simplement bcftools filter ?
*** TODO OMIM
**** TODO Remplacer script R par bcftools ?
**** TODO Remplacer script R par vep ?
*** TODO clinvar
**** TODO Remplacer script R par bcftools ?
**** TODO Remplacer script R par vep ?
*** TODO ACMG incidental
**** TODO Inclure dans vep ?
*** TODO Grantham
*** TODO LRG
*** TODO Gnomad
** DONE Porter exament la version d'Alexis sur Helios
CLOSED: [2023-01-14 Sat 17:56]
Branche "prod"
** STRT Tester version d'alexis avec Nix
*** DONE Ajouter clinvar
CLOSED: [2022-11-13 Sun 19:37]
*** DONE Alignement
CLOSED: [2022-11-13 Sun 12:52]
*** DONE Haplotype caller
CLOSED: [2022-11-13 Sun 13:00]
*** TODO Filter
- [X] depth
- [ ] comon snp not path
Problème avec liste des ID
**** TODO variant annotation
Besoin de vep
*** TODO Variant calling
* Amélioration :amelioration:
* Documentation :doc:
** Procédure d'installation nix + dependences pour VM CHU
SCHEDULED: <2023-04-13 Thu>
* Manuscript :manuscript:
* Tests :tests:
** WAIT Non régression : version prod
*** DONE ID common snp
CLOSED: [2022-11-19 Sat 21:36]
#+begin_src
$ wc -l ID_of_common_snp.txt
23194290 ID_of_common_snp.txt
$ wc -l /Work/Users/apraga/bisonex/database/dbSNP/ID_of_common_snp.txt
23194290 /Work/Users/apraga/bisonex/database/dbSNP/ID_of_common_snp.txt
#+end_src
*** DONE ID common snp not clinvar patho
CLOSED: [2022-12-11 Sun 20:11]
**** DONE Vérification du problème
CLOSED: [2022-12-11 Sun 16:30]
Sur le J:
21155134 /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt.ref
Version de "non-régression"
21155076 database/dbSNP/ID_of_common_snp_not_clinvar_patho.txt
Nouvelle version
23193391 /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt
Si on enlève les doublons
$ sort database/dbSNP/ID_of_common_snp_not_clinvar_patho.txt | uniq > old.txt
$ wc -l old.txt
21107097 old.txt
$ sort /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt | uniq > new.txt
$ wc -l new.txt
21174578 new.txt
$ sort /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt.ref | uniq > ref.txt
$ wc -l ref.txt
21107155 ref.txt
Si on regarde la différence
comm -23 ref.txt old.txt
rs1052692
rs1057518973
rs1057518973
rs11074121
rs112848754
rs12573787
rs145033890
rs147889095
rs1553904159
rs1560294695
rs1560296615
rs1560310926
rs1560325547
rs1560342418
rs1560356225
rs1578287542
...
On cherche le premier
bcftools query -i 'ID="rs1052692"' database/dbSNP/dbSNP_common.vcf.gz -f '%CHROM %POS %REF %ALT\n'
NC_000019.10 1619351 C A,T
Il est bien patho...
$ bcftools query -i 'POS=1619351' database/clinvar/clinvar.vcf.gz -f '%CHROM %POS %REF %ALT %INFO/CLNSIG\n'
19 1619351 C T Conflicting_interpretations_of_pathogenicity
On vérifie pour tous les autres
$ comm -23 ref.txt old.txt > tocheck.txt
On génère les régions à vérifier (chromosome number:position)
$ bcftools query -i 'ID=@tocheck.txt' database/dbSNP/dbSNP_common.vcf.gz -f '%CHROM\t%POS\n' > tocheck.pos
On génère le mapping inverse (chromosome number -> NC)
$ awk ' { t = $1; $1 = $2; $2 = t; print; } ' database/RefSeq/refseq_to_number_only_consensual.txt > mapping.txt
On remap clinvar
$ bcftools annotate --rename-chrs mapping.txt database/clinvar/clinvar.vcf.gz -o clinvar_remapped.vcf.gz
$ tabix clinvar_remapped.vcf.gz
Enfin, on cherche dans clinvar la classification
$ bcftools query -R tocheck.pos clinvar_remapped.vcf.gz -f '%CHROM %POS %INFO/CLNSIG\n'
$ bcftools query -R tocheck.pos database/dbSNP/dbSNP_common.vcf.gz -f '%CHROM %POS %ID \n' | grep '^NC'
#+RESULTS:
**** DONE Comprendre pourquoi la nouvelle version donne un résultat différent
CLOSED: [2022-12-11 Sun 20:11]
***** DONE Même version dbsnp et clinvar ?
CLOSED: [2022-12-10 Sat 23:02]
Clinvar différent !
$ bcftools stats clinvar.gz
clinvar (Alexis)
SN 0 number of samples: 0
SN 0 number of records: 1492828
SN 0 number of no-ALTs: 965
SN 0 number of SNPs: 1338007
SN 0 number of MNPs: 5562
SN 0 number of indels: 144580
SN 0 number of others: 3714
SN 0 number of multiallelic sites: 0
SN 0 number of multiallelic SNP sites: 0
clinvar (new)
SN 0 number of samples: 0
SN 0 number of records: 1493470
SN 0 number of no-ALTs: 965
SN 0 number of SNPs: 1338561
SN 0 number of MNPs: 5565
SN 0 number of indels: 144663
SN 0 number of others: 3716
SN 0 number of multiallelic sites: 0
SN 0 number of multiallelic SNP sites: 0
***** DONE Mettre à jour clinvar et dbnSNP pour travailler sur les mêm bases
CLOSED: [2022-12-11 Sun 12:10]
Problème persiste
***** DONE Supprimer la conversion en int du chromosome
CLOSED: [2022-12-10 Sat 19:29]
***** KILL Même NC ?
CLOSED: [2022-12-10 Sat 19:29]
$ zgrep "contig=<ID=NC_\(.*\)" clinvar/GRCh38/clinvar.vcf.gz > contig.clinvar
$ diff contig.txt contig.clinvar
< ##contig=<ID=NC_012920.1>
***** DONE Tester sur chromosome 19: ok
CLOSED: [2022-12-11 Sun 13:53]
On prépare les données
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/tests/debug-commonsnp
PATH=$PATH:$HOME/.nix-profile/bin
bcftools filter -i 'CHROM="NC_000019.10"' /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/dbSNP_common.vcf.gz -o dbSNP_common_19.vcf.gz
bcftools filter -i 'CHROM="NC_000019.10"' /Work/Groups/bisonex/data/clinvar/GRCh38/clinvar.vcf.gz -o clinvar_19.vcf.gz
bcftools filter -i 'CHROM="NC_000019.10"' /Work/Groups/bisonex/data-alexis/dbSNP/dbSNP_common.vcf.gz -o dbSNP_common_19_old.vcf.gz
bcftools filter -i 'CHROM="19"' /Work/Groups/bisonex/data-alexis/clinvar/clinvar.vcf.gz -o clinvar_19_old.vcf.gz
#+end_src
On récupère les 2 versions du script
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/tests/debug-commonsnp
PATH=$PATH:$HOME/.nix-profile/bin
git checkout regression ../../script/pythonScript/clinvar_sbSNP.py
cp ../../script/pythonScript/clinvar_sbSNP.py clinvar_sbSNP_old.py
git checkout HEAD ../../script/pythonScript/clinvar_sbSNP.py
#+end_src
#+RESULTS:
On compare
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/tests/debug-commonsnp
PATH=$PATH:$HOME/.nix-profile/bin
python ../../script/pythonScript/clinvar_sbSNP.py clinvar_sbSNP.py --clinvar clinvar_19.vcf.gz --dbSNP dbSNP_common_19.vcf.gz --output tmp.txt
sort tmp.txt | uniq > new.txt
table=/Work/Groups/bisonex/data-alexis/RefSeq/refseq_to_number_only_consensual.txt
python clinvar_sbSNP_old.py --clinvar clinvar_19_old.vcf.gz --dbSNP dbSNP_common_19_old.vcf.gz --output tmp_old.txt --chrm_name_table $table
sort tmp_old.txt | uniq > old.txt
wc -l old.txt new.txt
#+end_src
#+RESULTS:
| 535155 | old.txt |
| 535194 | new.txt |
| 1070349 | total |
Si on prend le premier manquant dans new, il est conflicting patho do
PairHmm - Intel GKL Utils not loaded
17:28:00.733 WARN PairHMM - ***WARNING: Machine does not have the AVX instruction set support needed for the accelerated AVX PairHmm. Falling back to the MUCH slower LOGLESS_CACHING implementation!
17:28:00.763 INFO ProgressMeter - Starting traversal
#+end_quote
libgomp.so est fourni par gcc donc il faut charger le module
module load gcc@11.3.0/gcc-12.1.0
** KILL Utiliser subworkflow
CLOSED: [2023-04-02 Sun 18:08]
Notre version permet d'être plus souple
*** KILL Alignement
CLOSED: [2023-04-02 Sun 18:08] SCHEDULED: <2023-04-05 Wed>
*** KILL Vep
CLOSED: [2023-04-02 Sun 18:08] SCHEDULED: <2023-04-05 Wed>
vcf_annotate_ensemblvep
** TODO Annotation avec nextflow
*** TODO VEP
***** KILL Utiliser --gene-phenotype ?
CLOSED: [2023-03-15 mer. 13:43]
Vu avec alexis : bases de données non à jour
https://www.ensembl.org/info/genome/variation/phenotype/sources_phenotype_documentation.html
***** TODO Plugin pour CADD, pLI, LOEUF ?
https://www.ensembl.org/info/docs/tools/vep/script/vep_plugins.html#cadd
CADD: n’a pas réussi à le faire fonctionner
pLI, LOEUF : non demandé
***** TODO Utiliser l'option --hgvsg pour remplaer hgvsg.r ?
Non fait par Alexis, oubli a priori
***** TODO Ajout spliceAI ?
*** TODO Spip
**** TODO Checksum sur données
*** TODO Filtrer après VEP
**** TODO Remplacer avec simplement bcftools filter ?
*** TODO OMIM
**** TODO Remplacer script R par bcftools ?
**** TODO Remplacer script R par vep ?
*** TODO clinvar
**** TODO Remplacer script R par bcftools ?
**** TODO Remplacer script R par vep ?
*** TODO ACMG incidental
**** TODO Inclure dans vep ?
*** TODO Grantham
*** TODO LRG
*** TODO Gnomad
** DONE Porter exament la version d'Alexis sur Helios
CLOSED: [2023-01-14 Sat 17:56]
Branche "prod"
** STRT Tester version d'alexis avec Nix
*** DONE Ajouter clinvar
CLOSED: [2022-11-13 Sun 19:37]
*** DONE Alignement
CLOSED: [2022-11-13 Sun 12:52]
*** DONE Haplotype caller
CLOSED: [2022-11-13 Sun 13:00]
*** TODO Filter
- [X] depth
- [ ] comon snp not path
Problème avec liste des ID
**** TODO variant annotation
Besoin de vep
*** TODO Variant calling
* Amélioration :amelioration:
* Documentation :doc:
** Procédure d'installation nix + dependences pour VM CHU
SCHEDULED: <2023-04-13 Thu>
* Manuscript :manuscript:
* Tests :tests:hg002:
** WAIT Non régression : version prod
*** DONE ID common snp
CLOSED: [2022-11-19 Sat 21:36]
#+begin_src
$ wc -l ID_of_common_snp.txt
23194290 ID_of_common_snp.txt
$ wc -l /Work/Users/apraga/bisonex/database/dbSNP/ID_of_common_snp.txt
23194290 /Work/Users/apraga/bisonex/database/dbSNP/ID_of_common_snp.txt
#+end_src
*** DONE ID common snp not clinvar patho
CLOSED: [2022-12-11 Sun 20:11]
**** DONE Vérification du problème
CLOSED: [2022-12-11 Sun 16:30]
Sur le J:
21155134 /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt.ref
Version de "non-régression"
21155076 database/dbSNP/ID_of_common_snp_not_clinvar_patho.txt
Nouvelle version
23193391 /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt
Si on enlève les doublons
$ sort database/dbSNP/ID_of_common_snp_not_clinvar_patho.txt | uniq > old.txt
$ wc -l old.txt
21107097 old.txt
$ sort /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt | uniq > new.txt
$ wc -l new.txt
21174578 new.txt
$ sort /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt.ref | uniq > ref.txt
$ wc -l ref.txt
21107155 ref.txt
Si on regarde la différence
comm -23 ref.txt old.txt
rs1052692
rs1057518973
rs1057518973
rs11074121
rs112848754
rs12573787
rs145033890
rs147889095
rs1553904159
rs1560294695
rs1560296615
rs1560310926
rs1560325547
rs1560342418
rs1560356225
rs1578287542
...
On cherche le premier
bcftools query -i 'ID="rs1052692"' database/dbSNP/dbSNP_common.vcf.gz -f '%CHROM %POS %REF %ALT\n'
NC_000019.10 1619351 C A,T
Il est bien patho...
$ bcftools query -i 'POS=1619351' database/clinvar/clinvar.vcf.gz -f '%CHROM %POS %REF %ALT %INFO/CLNSIG\n'
19 1619351 C T Conflicting_interpretations_of_pathogenicity
On vérifie pour tous les autres
$ comm -23 ref.txt old.txt > tocheck.txt
On génère les régions à vérifier (chromosome number:position)
$ bcftools query -i 'ID=@tocheck.txt' database/dbSNP/dbSNP_common.vcf.gz -f '%CHROM\t%POS\n' > tocheck.pos
On génère le mapping inverse (chromosome number -> NC)
$ awk ' { t = $1; $1 = $2; $2 = t; print; } ' database/RefSeq/refseq_to_number_only_consensual.txt > mapping.txt
On remap clinvar
$ bcftools annotate --rename-chrs mapping.txt database/clinvar/clinvar.vcf.gz -o clinvar_remapped.vcf.gz
$ tabix clinvar_remapped.vcf.gz
Enfin, on cherche dans clinvar la classification
$ bcftools query -R tocheck.pos clinvar_remapped.vcf.gz -f '%CHROM %POS %INFO/CLNSIG\n'
$ bcftools query -R tocheck.pos database/dbSNP/dbSNP_common.vcf.gz -f '%CHROM %POS %ID \n' | grep '^NC'
#+RESULTS:
**** DONE Comprendre pourquoi la nouvelle version donne un résultat différent
CLOSED: [2022-12-11 Sun 20:11]
***** DONE Même version dbsnp et clinvar ?
CLOSED: [2022-12-10 Sat 23:02]
Clinvar différent !
$ bcftools stats clinvar.gz
clinvar (Alexis)
SN 0 number of samples: 0
SN 0 number of records: 1492828
SN 0 number of no-ALTs: 965
SN 0 number of SNPs: 1338007
SN 0 number of MNPs: 5562
SN 0 number of indels: 144580
SN 0 number of others: 3714
SN 0 number of multiallelic sites: 0
SN 0 number of multiallelic SNP sites: 0
clinvar (new)
SN 0 number of samples: 0
SN 0 number of records: 1493470
SN 0 number of no-ALTs: 965
SN 0 number of SNPs: 1338561
SN 0 number of MNPs: 5565
SN 0 number of indels: 144663
SN 0 number of others: 3716
SN 0 number of multiallelic sites: 0
SN 0 number of multiallelic SNP sites: 0
***** DONE Mettre à jour clinvar et dbnSNP pour travailler sur les mêm bases
CLOSED: [2022-12-11 Sun 12:10]
Problème persiste
***** DONE Supprimer la conversion en int du chromosome
CLOSED: [2022-12-10 Sat 19:29]
***** KILL Même NC ?
CLOSED: [2022-12-10 Sat 19:29]
$ zgrep "contig=<ID=NC_\(.*\)" clinvar/GRCh38/clinvar.vcf.gz > contig.clinvar
$ diff contig.txt contig.clinvar
< ##contig=<ID=NC_012920.1>
***** DONE Tester sur chromosome 19: ok
CLOSED: [2022-12-11 Sun 13:53]
On prépare les données
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/tests/debug-commonsnp
PATH=$PATH:$HOME/.nix-profile/bin
bcftools filter -i 'CHROM="NC_000019.10"' /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/dbSNP_common.vcf.gz -o dbSNP_common_19.vcf.gz
bcftools filter -i 'CHROM="NC_000019.10"' /Work/Groups/bisonex/data/clinvar/GRCh38/clinvar.vcf.gz -o clinvar_19.vcf.gz
bcftools filter -i 'CHROM="NC_000019.10"' /Work/Groups/bisonex/data-alexis/dbSNP/dbSNP_common.vcf.gz -o dbSNP_common_19_old.vcf.gz
bcftools filter -i 'CHROM="19"' /Work/Groups/bisonex/data-alexis/clinvar/clinvar.vcf.gz -o clinvar_19_old.vcf.gz
#+end_src
On récupère les 2 versions du script
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/tests/debug-commonsnp
PATH=$PATH:$HOME/.nix-profile/bin
git checkout regression ../../script/pythonScript/clinvar_sbSNP.py
cp ../../script/pythonScript/clinvar_sbSNP.py clinvar_sbSNP_old.py
git checkout HEAD ../../script/pythonScript/clinvar_sbSNP.py
#+end_src
#+RESULTS:
On compare
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/tests/debug-commonsnp
PATH=$PATH:$HOME/.nix-profile/bin
python ../../script/pythonScript/clinvar_sbSNP.py clinvar_sbSNP.py --clinvar clinvar_19.vcf.gz --dbSNP dbSNP_common_19.vcf.gz --output tmp.txt
sort tmp.txt | uniq > new.txt
table=/Work/Groups/bisonex/data-alexis/RefSeq/refseq_to_number_only_consensual.txt
python clinvar_sbSNP_old.py --clinvar clinvar_19_old.vcf.gz --dbSNP dbSNP_common_19_old.vcf.gz --output tmp_old.txt --chrm_name_table $table
sort tmp_old.txt | uniq > old.txt
wc -l old.txt new.txt
#+end_src
#+RESULTS:
| 535155 | old.txt |
| 535194 | new.txt |
| 1070349 | total |
Si on prend le premier manquant dans new, il est conflicting patho do
0.959 | 0.057 |
| 2 | SNV | 23946 | 865 | 356 | 0.985 | 0.965 | 0.975 | 0.036 |
| 1 | indel | 1254 | 72 | 75 | 0.944 | 0.946 | 0.945 | 0.054 |
| 2 | indel | 1309 | 10 | 20 | 0.985 | 0.992 | 0.989 | 0.008 |
Pour essayer d'améliorer les statistiques :
- La version du génome GRC38 vs GRCh38.p13 ne change quasiment rien
- Désactiver dbSNP ne change strictement rien pour le variant calling
J'ai exploré les faux négatifs :
- la grande majorité n'est juste pas vue (ce n'est pas un problème d'haploïde/génotype)
- la répartition par chromosome est relativement homogène, sauf sur le 6 ()
- la majorité est en 5' et 3'UTR (selon Best refseq)
Conclusion: je pense m'arrêter là pour la validation du variant calling par manque de temps. Il faudrait creuser pour savoir pourquoi certains variants ne sont pas vus par GATK mais ce n'est pas la majorité. En tout cas, je peux justifier d'une première analyse pour la thèse.
Ça te va ?
[1]
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2928-9
Résultats ici https://static-content.springer.com/esm/art%3A10.1186%2Fs12859-019-2928-9/MediaObjects/12859_2019_2928_MOESM8_ESM.pdf
***** DONE Comparaison
CLOSED: [2023-03-04 Sat 11:14]
HGREF=/Work/Groups/bisonex/data-alexis-reference/genome/GRCh38_latest_genomic.fna ./result/bin/hap.py /Work/Groups/bisonex/NA12878/HG001_GRCh38_1_22_v4.2.1
_benchmark_renamed.vcf.gz script/files/vcf/NA12878_NIST7035_vep_annot.vcf -f /Work/Groups/bison
ex/NA12878/HG001_GRCh38_1_22_v4.2.1_benchmark.bed -o test
na1878.slurm
#+begin_src slurm
#!/bin/bash
#SBATCH -c
4
#SBATCH -p smp
#SBATCH --time=01:00:00
#SBATCH --mem=32G
module load nix/2.11.0
export HGREF=/Work/Groups/bisonex/data-alexis-reference/genome/GRCh38_latest_genomic.fna
dir=/Work/Groups/bisonex/data/NA12878/GRCh38
hap.py ${dir}/HG001_GRCh38_1_22_v4.2.1_benchmark.vcf.gz script/files/vcf/NA12878_NIST7035.vcf -f ${dir}/HG001_GRCh38_1_22_v4.2.1_benchmark.bed -o test
#+end_src
****** KILL beaucoup trop de faux négatifs
CLOSED: [2023-02-17 Fri 19:37]
******* DONE Test 1 : vep annot : beaucoup trop de faux négatif
CLOSED: [2023-02-06 lun. 13:40]
Type Filter TRUTH.TOTAL TRUTH.TP TRUTH.FN QUERY.TOTAL QUERY.FP QUERY.UNK FP.gt FP.al METRIC.Recall METRIC.Precision METRIC.Frac_NA METRIC.F1_Score TRUTH.TOTAL.TiTv_ratio QUERY.TOTAL.TiTv_ratio TRUTH.TOTAL.het_hom_ratio QUERY.TOTAL.het_hom_ratio
INDEL ALL 276768 274 276494 1500 257 968 26 15 0.000990 0.516917 0.645333 0.001976 NaN NaN 1.483361 6.129187
INDEL PASS 276768 274 276494 1500 257 968 26 15 0.000990 0.516917 0.645333 0.001976 NaN NaN 1.483361 6.129187
SNP ALL 1937706 1193 1936513 3338 106 2037 11 2 0.000616 0.918524 0.610246 0.001231 2.0785 1.861183 1.539064 2.703663
SNP PASS 1937706 1193 1936513 3338 106 2037 11 2 0.000616 0.918524 0.610246 0.001231 2.0785 1.861183 1.539064 2.703663
******* KILL Test 3 : indexer vcf de reference
CLOSED: [2023-02-06 lun. 17:19]
Même résultat avec vcfeval, qui a besoin de la version indexée
******* DONE Test 3 sans filtre vep : idem
CLOSED: [2023-02-06 lun. 17:19]
Benchmarking Summary:
Type Filter TRUTH.TOTAL TRUTH.TP TRUTH.FN QUERY.TOTAL QUERY.FP QUERY.UNK FP.gt FP.al METRIC.Recall METRIC.Precision METRIC.Frac_NA METRIC.F1_Score TRUTH.TOTAL.TiTv_ratio QUERY.TOTAL.TiTv_ratio TRUTH.TOTAL.het_hom_ratio QUERY.TOTAL.het_hom_ratio
INDEL ALL 276768 10535 266233 52169 10969 30616 3552 2122 0.038064 0.491069 0.586862 0.070652 NaN NaN 1.483361 0.509510
INDEL PASS 276768 10535 266233 52169 10969 30616 3552 2122 0.038064 0.491069 0.586862 0.070652 NaN NaN 1.483361 0.509510
SNP ALL 1937706 105753 1831953 357652 74634 177259 35111 797 0.054576 0.586270 0.495619 0.099857 2.0785 1.42954 1.539064 0.324923
SNP PASS 1937706 105753 1831953 357652 74634 177259 35111 797 0.054576 0.586270 0.495619 0.099857 2.0785 1.42954 1.539064 0.324923
******* DONE Test 4 avec vcfeval sur vep_annot : idem
CLOSED: [2023-02-06 lun. 17:18]
#+begin_src
#!/bin/bash
#SBATCH -c 4
#SBATCH -p smp
#SBATCH --time=01:00:00
#SBATCH --mem=32G
module load nix/2.11.0
export HGREF=/Work/Groups/bisonex/data-alexis-reference/genome/GRCh38_latest_genomic.fna dir=/Work/Groups/bisonex/data/NA12878/GRCh38
rtg vcfeval -b /Work/Groups/bisonex/data/NA12878/GRCh38/HG001_GRCh38_1_22_v4.2.1_benchmark.vcf.gz -c files/vcf/NA12878_NIST7035_vep_annot.vcf.gz -o test-rtg -t /Work/Groups/bisonex/data/genome/GRCh38.p13/genomeRef.sdf
#+end_src
Threshold True-pos-baseline True-pos-call False-pos False-neg Precision Sensitivity F-measure
----------------------------------------------------------------------------------------------------
1.000 2984 2682 1840 3890296 0.5931 0.0008 0.0015
None 2984 2682 1841 3890296 0.5930 0.0008 0.0015
Exemple du log
2023-02-06 13:50:14 Reference NC_000001.11 baseline contains 307854 variants.
2023-02-06 13:50:14 Reference NC_000001.11 calls contains 426 variants.
2023-02-06 13:50:15 Reference NC_000002.12 baseline contains 325877 variants.
2023-02-06 13:50:15 Reference NC_000002.12 calls contains 320 variants.
******* DONE Regarder quelques variants à la main
CLOSED: [2023-02-07 Tue 22:01]
Ex:
Il manque NC_000001.11 783006 . A G 50 PASS
Il y a A -> G et C -> A sur cette position
***** DONE Restreindre genome de référence
CLOSED: [2023-03-04 Sat 11:15]
****** Discussion Alexis
le pipeline prend en compte 5', 3', variant canoniques d'épissage + prédit spip
Le plus simple pour le moment est de restreindre seulement aux exons
GENCODE (version eu
vcf-output false | --sites-only-vcf-output false |
| --help false | --help false |
| --version false | --version false |
| --showHidden false | --showHidden false |
| --QUIET false | --QUIET false |
| --use-jdk-deflater false | --use-jdk-deflater false |
| --use-jdk-inflater false | --use-jdk-inflater false |
| --gcs-max-retries 20 | --gcs-max-retries 20 |
| --gcs-project-for-requester-pays | --gcs-project-for-requester-pays |
| --disable-tool-default-read-filters false PN:GATK ApplyBQSR | --disable-tool-default-read-filters false PN:GATK ApplyBQSR |
****** KILL Vérifier sha256sum
CLOSED: [2023-01-24 Tue 23:00]
alignment: différent
****** KILL Comparer bam
CLOSED: [2023-01-25 Wed 21:58]
/Work/Users/apraga/bisonex/script/files〉picard CompareSAMs LENIENT_LOW_MQ_ALIGNMENT=true LENIENT_DUP=true tmp_63003856_S135/63003856_S135.bam /Work/Groups/bisonex/ref/tmp_63003856_S135/63003856_S135.bam O=compare-bam.tsv
picard CompareSAMs -LENIENT_LOW_MQ_ALIGNMENT true -LENIENT_DUP true tmp_63003856_S135/63003856_S135.bam /Work/Groups/bisonex/ref/tmp_63003856_S135/63003856_S135.bam -O compare-bam.tsv
VN Program Record attribute differs.
File 1: 1.13
File 2: 1.10
SAM files differ.
[Tue Jan 24 23:12:50 CET 2023] picard.sam.CompareSAMs done. Elapsed time: 7.32 minutes.
***** DONE Relancer avec la même version de samtools
CLOSED: [2023-01-25 Wed 21:58]
Pas d'impact
***** TODO Comparer tsv de sortie
***** TODO Regarder où sont les variants différents
** TODO GIAB Validation : GIAB
https://github.com/ga4gh/benchmarking-tools
Prérequis :
- [[*hap.py][hap.py]]
- [[*NA12878][NA12878]]
*** TODO GIAB : exome :giab:
**** Notes
https://github.com/genome-in-a-bottle/giab_FAQ
**** Résultats résumés :resultats:
***** DONE HG001 :
CLOSED: [2023-04-06 Thu 21:41] SCHEDULED: <2023-04-02 Sun>
| Données | Algorithm | Type | Recall | Precision |
|---------+-----------+---------+--------+-----------|
| Bisonex | Happy | SNP | 0.8552 | 0.9708 |
| Bisonex | vcfeval | SNP | 0.8547 | 0.9727 |
| Bisonex | Happy | INDEL | 0.7105 | 0.6929 |
| Bisonex | vcfeval | Non-SNP | 0.7139 | 0.7136 |
|---------+-----------+---------+--------+-----------|
| GIAB | happy | INDEL | 0.7551 | 0.7415 |
| GIAB | vcfeval | INDEL | 0.7598 | 0.7445 |
| GIAB | happy | SNP | 0.8937 | 0.9621 |
| giab | vcfeval | SNP | 0.8937 | 0.9621 |
***** DONE HG002
CLOSED: [2023-04-14 Fri 11:36] SCHEDULED: <2023-04-14 Fri>
Capture Agilent
| Algorithm | Type | Recall | Precision |
| happy | INDEL | 0.851495 | 0.923616 |
| happy | SNP | 0.905926 | 0.992158 |
| vcfeval | indel | 0.8523 | 0.9212 |
| vcfeval | snp | 0.9054 | 0.9934 |
***** DONE HG003
CLOSED: [2023-04-16 Sun 00:25] SCHEDULED: <2023-04-14 Fri>
| Algorithm | Type | METRIC.Recall | METRIC.Precision |
| vcfeval | indel | 0.8363 | 0.9115 |
| vcfeval | snp | 0.9069 | 0.9928 |
| happy | INDEL | 0.838521 | 0.917296 |
| happy | SNP | 0.907466 | 0.991204 |
***** DONE HG004
CLOSED: [2023-04-16 Sun 00:26]
| Algorithm | Type | METRIC.Recall | METRIC.Precision |
| happy | INDEL | 0.856835 | 0.925086 |
| happy | SNP | 0.905067 | 0.992704 |
| vcfeval | indel | 0.8568 | 0.9240 |
| vcfeval | snp | 0.9048 | 0.9938 |
**** TODO télécharger données avec Nextflow
***** DONE Renommer les chromosomes
CLOSED: [2023-02-17 Fri 19:30]
****** DONE Genome de reference NCBI
CLOSED: [2023-02-25 Sat 19:46]
****** DONE Bed avec les exons
CLOSED: [2023-03-29 Wed 23:04]
****** DONE hg19
CLOSED: [2023-02-26 Sun 22:37]
****** DONE hg38
CLOSED: [2023-03-29 Wed 23:04]
- [X] Télécharger hg19 : ok
- [X] convertir bed en interval list
picard BedToIntervalList -I exons_illumina.bed -O exons_illumina.list -SD ../../genome/GRCh19/genomeRef.dict
- [X] puis en hg38
picard LiftOverIntervalList -I exons_illumina.list -O exons_illumina_hg38.list --CHAIN hg19ToHg38.over.chain -SD ../../genome/GRCh38.p13/genomeRef.dict
- [X] puis en bed
***** TODO VCF de référence
****** DONE NA12878 (HG001)
CLOSED: [2023-02-25 Sat 19:46]
******* DONE Fastq HiSeq
CLOSED: [2023-02-25 Sat 19:46]
On prend le Hiseq, qui est probablement ce qu'utilise Centogène :
https://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/
On utilisé les données "trimmés" (https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-1069-7), i.e qui ont enlevé les fragments plus petits que la taille d'un read.
Informations:
- https://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/Garvan_NA12878_HG001_HiSeq_Exome.README
- Sequencer: HiSeq2500
- kit: Nextera Rapid Capture Exome and Expanded Exome
Il y a 2 samples (NIST7035 et NIST7086), chacun sur 2 lanes -> à concaténer
NB : liste techno illumina https://www.illumina.com/systems/sequencing-platforms.html
Hiseq postérieur nextseq 550
******* DONE Capture : Exons (bed)
CLOSED: [2023-02-25 Sat 19:46]
https://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/nexterarapidcapture_expandedexome_targetedregions.bed.gz
******* DONE Bed, vcf
CLOSED: [2023-02-24 Fri 23:45]
****** DONE Ashkenazy trio HG002, HG003, HGQ004
CLOSED: [2023-04-06 Thu 21:43] SCHEDULED: <2023-04-01 Sat>
****** TODO Chinese trio HG005, 6, 7
***** TODO Fastq :fastq:
****** DONE NA12878 (HG001)
CLOSED: [2023-02-25 Sat 19:46]
******* DONE Fastq HiSeq
CLOSED: [2023-02-25 Sat 19:46]
On prend le Hiseq, qui est probablement ce qu'utilise Centogène :
https://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/
On utilisé les données "trimmés" (https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-1069-7), i.e qui ont enlevé les fragments plus petits que la taille d'un read.
Informations:
- https://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/Garvan_NA12878_HG001_HiSeq_Exome.README
- Sequencer: HiSeq2500
- kit: Nextera Rapid Capture Exome and Expanded Exome
Il y a 2 samples (NIST7035 et NIST7086), chacun sur 2 lanes -> à concaténer
NB : liste techno illumina https://www.illumina.com/systems/sequencing-platforms.html
Hiseq postérieur nextseq 550
******* DONE Capture : Exons (bed)
CLOSED: [2023-02-25 Sat 19:46]
https://ftp-trace.ncbi.nih.gov/ReferenceSamples/giab/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/nexterarapidcapture_expandedexome_targetedregions.bed.gz
****** DONE Ashkenazy trio HG002, HG003, HG004
CLOSED: [2023-04-15 Sat 23:24] SCHEDULED: <2023-04-05 Wed>
******* DONE Capture
CLOSED: [2023-04-15 Sat 23:24]
https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/analysis/OsloUniversityHospital_Exome_GATK_jointVC_11242015/wex_Agilent_SureSelect_v05_b37.baits.slop50.merged.list
******* DONE Capture Agilent
CLOSED: [2023-04-15 Sat 23:24]
******* DONE Bam à partir des fastq
CLOSED: [2023-04-15 Sat 23:24]
Bam + index + checksum
https://raw.githubusercontent.com/genome-in-a-bottle/giab_data_indexes/master/AshkenazimTrio/alignment.index.AJtrio_OsloUniversityHospital_IlluminaExome_bwamem_GRCh37_11252015
****** TODO Chinese trio
Whole exome pour HG005 seulement
******* TODO HG005
https://raw.githubusercontent.com/genome-in-a-bottle/giab_data_indexes/master/ChineseTrio/alignment.index.Chinesetrio_HG005_OsloUniversityHospital_IlluminaExome_bwamem_GRCh37_11252015
**** DONE NA12878 / HG001 :na12878:
CLOSED: [2023-04-15 Sat 23:53]
***** DONE Discussion alexis : Mail
CLOSED: [2023-03-29 Wed 22:40]
Avec le patient NA12878 et comparaison avec hap.py du VCF de Genome In A Bottle ("gold" standard), on avait pour rappel
- sensibilité (=recall) 71% pour indel, 85% SNP
- précision (= VPP) 69 et 97% respectivement
| Type | TRUTH | TP | FN | QUERY | FP | UNK | FP.gt | FP.al | Recall | Precision |
| INDEL | 4871 | 3461 | 1410 | 7048 | 1554 | 1987 | 193 | 346 | 0.710532 | 0.692946 |
| SNP | 46032 | 39369 | 6663 | 44600 | 1186 | 4041 | 304 | 30 | 0.855253 | 0.970759 |
Les statistiques sur les génomes sont bien meilleurs (cf precisionFDA challenge).
Pour les exome, un article [1] a fait a des meilleures stats sur ce patient avec BWA et GATK mais ils ont moins de variant (on a presque un facteur 2 !).
Je soupçonne qu'on ne travaille pas sur les mêmes zones de capture (pas réussi à récupérer leur .bed)
| Exome | Type | TP | FP | FN | Sensitivity | Precision | F-Score | FDR |
| 1 | SNV | 23689 | 1397 | 613 | 0.975 | 0.944 | 0.959 | 0.057 |
| 2 | SNV | 23946 | 865 | 356 | 0.985 | 0.965 | 0.975 | 0.036 |
| 1 | indel | 1254 | 72 | 75 | 0.944 | 0.946 | 0.945 | 0.054 |
| 2 | indel | 1309 | 10 | 20 | 0.985 | 0.992 | 0.989 | 0.008 |
Pour essayer d'améliorer les statistiques :
- La version du génome GRC38 vs GRCh38.p13 ne change quasiment rien
- Désactiver dbSNP ne change strictement rien pour le variant calling
J'ai exploré les faux négatifs :
- la grande majorité n'est juste pas vue (ce n'est pas un problème d'haploïde/génotype)
- la répartition par chromosome est relativement homogène, sauf sur le 6 ()
- la majorité est en 5' et 3'UTR (selon Best refseq)
Conclusion: je pense m'arrêter là pour la validation du variant calling par manque de temps. Il faudrait creuser pour savoir pourquoi certains variants ne sont pas vus par GATK mais ce n'est pas la majorité. En tout cas, je peux justifier d'une première analyse pour la thèse.
Ça te va ?
[1]
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2928-9
Résultats ici https://static-content.springer.com/esm/art%3A10.1186%2Fs12859-019-2928-9/MediaObjects/12859_2019_2928_MOESM8_ESM.pdf
***** DONE Comparaison
CLOSED: [2023-03-04 Sat 11:14]
HGREF=/Work/Groups/bisonex/data-alexis-reference/genome/GRCh38_latest_genomic.fna ./result/bin/hap.py /Work/Groups/bisonex/NA12878/HG001_GRCh38_1_22_v4.2.1
_benchmark_renamed.vcf.gz script/files/vcf/NA12878_NIST7035_vep_annot.vcf -f /Work/Groups/bison
ex/NA12878/HG001_GRCh38_1_22_v4.2.1_benchmark.bed -o test
na1878.slurm
#+begin_src slurm
#!/bin/bash
#SBATCH -c 4
#SBATCH -p smp
#SBATCH --time=01:00:00
#SBATCH --mem=32G
module load nix/2.11.0
export HGREF=/Work/Groups/bisonex/data-alexis-reference/genome/GRCh38_latest_genomic.fna
dir=/Work/Groups/bisonex/data/NA12878/GRCh38
hap.py ${dir}/HG001_GRCh38_1_22_v4.2.1_benchmark.vcf.gz script/files/vcf/NA12878_NIST7035.vcf -f ${dir}/HG001_GRCh38_1_22_v4.2.1_benchmark.bed -o test
#+end_src
****** KILL beaucoup trop de faux négatifs
CLOSED: [2023-02-17 Fri 19:37]
******* DONE Test 1 : vep annot : beaucoup trop de faux négatif
CLOSED: [2023-02-06 lun. 13:40]
Type Filter TRUTH.TOTAL TRUTH.TP TRUTH.FN QUERY.TOTAL QUERY.FP QUERY.UNK FP.gt FP.al METRIC.Recall METRIC.Precision METRIC.Frac_NA METRIC.F1_Score TRUTH.TOTAL.TiTv_ratio QUERY.TOTAL.TiTv_ratio TRUTH.TOTAL.het_hom_ratio QUERY.TOTAL.het_hom_ratio
INDEL ALL 276768 274 276494 1500 257 968 26 15 0.000990 0.516917 0.645333 0.001976 NaN NaN 1.483361 6.129187
INDEL PASS 276768 274 276494 1500 257 968 26 15 0.000990 0.516917 0.645333 0.001976 NaN NaN 1.483361 6.129187
SNP ALL 1937706 1193 1936513 3338 106 2037 11 2 0.000616 0.918524 0.610246 0.001231 2.0785 1.861183 1.539064 2.703663
SNP PASS 1937706 1193 1936513 3338 106 2037 11 2 0.000616 0.918524 0.610246 0.001231 2.0785 1.861183 1.539064 2.703663
******* KILL Test 3 : indexer vcf de reference
CLOSED: [2023-02-06 lun. 17:19]
Même résultat avec vcfeval, qui a besoin de la version indexée
******* DONE Test 3 sans filtre vep : idem
CLOSED: [2023-02-06 lun. 17:19]
Benchmarking Summary:
Type Filter TRUTH.TOTAL TRUTH.TP TRUTH.FN QUERY.TOTAL QUERY.FP QUERY.UNK FP.gt FP.al METRIC.Recall METRIC.Precision METRIC.Frac_NA METRIC.F1_Score TRUTH.TOTAL.TiTv_ratio QUERY.TOTAL.TiTv_ratio TRUTH.TOTAL.het_hom_ratio QUERY.TOTAL.het_hom_ratio
INDEL ALL 276768 10535 266233 52169 10969 30616 3552 2122 0.038064 0.491069 0.586862 0.070652 NaN NaN 1.483361 0.509510
INDEL PASS 276768 10535 266233 52169 10969 30616 3552 2122 0.038064 0.491069 0.586862 0.070652 NaN NaN 1.483361 0.509510
SNP ALL 1937706 105753 1831953 357652 74634 177259 35111 797 0.054576 0.586270 0.495619 0.099857 2.0785 1.42954 1.539064 0.324923
SNP PASS 1937706 105753 1831953 357652 74634 177259 35111 797 0.054576 0.586270 0.495619 0.099857 2.0785 1.42954 1.539064 0.324923
******* DONE Test 4 avec vcfeval sur vep_annot : idem
CLOSED: [2023-02-06 lun. 17:18]
#+begin_src
#!/bin/bash
#SBATCH -c 4
#SBATCH -p smp
#SBATCH --time=01:00:00
#SBATCH --mem=32G
module load nix/2.11.0
export HGREF=/Work/Groups/bisonex/data-alexis-reference/genome/GRCh38_latest_genomic.fna dir=/Work/Groups/bisonex/data/NA12878/GRCh38
rtg vcfeval -b /Work/Groups/bisonex/data/NA12878/GRCh38/HG001_GRCh38_1_22_v4.2.1_benchmark.vcf.gz -c files/vcf/NA12878_NIST7035_vep_annot.vcf.gz -o test-rtg -t /Work/Groups/bisonex/data/genome/GRCh38.p13/genomeRef.sdf
#+end_src
Threshold True-pos-baseline True-pos-call False-pos False-neg Precision Sensitivity F-measure
----------------------------------------------------------------------------------------------------
1.000 2984 2682 1840 3890296 0.5931 0.0008 0.0015
None 2984 2682 1841 3890296 0.5930 0.0008 0.0015
Exemple du log
2023-02-06 13:50:14 Reference NC_000001.11 baseline contains 307854 variants.
2023-02-06 13:50:14 Reference NC_000001.11 calls contains 426 variants.
2023-02-06 13:50:15 Reference NC_000002.12 baseline contains 325877 variants.
2023-02-06 13:50:15 Reference NC_000002.12 calls contains 320 variants.
******* DONE Regarder quelques variants à la main
CLOSED: [2023-02-07 Tue 22:01]
Ex:
Il manque NC_000001.11 783006 . A G 50 PASS
Il y a A -> G et C -> A sur cette position
***** DONE Restreindre genome de référence
CLOSED: [2023-03-04 Sat 11:15]
****** Discussion Alexis
le pipeline prend en compte 5', 3', variant canoniques d'épissage + prédit spip
Le plus simple pour le moment est de restreindre seulement aux exons
GENCODE (version eu
ULED: <2023-04-12 Wed>
"target region" +/- 50bp
[[https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/analysis/OsloUniversityHospital_Exome_GATK_jointVC_11242015/README.txt][README]]
list file describing the variant calling regions (target regions extended with 50 bp on each end)
***** DONE .bed fourni par AGilent: sensbilité très mauvaise
CLOSED: [2023-04-13 Thu 09:46] SCHEDULED: <2023-04-13 Thu>
Agilent SureSelect Human All Exon V5 kit
Disponible en hg38
Threshold True-pos-baseline True-pos-call False-pos False-neg Precision Sensitivity F-measure
----------------------------------------------------------------------------------------------------
0.000 19653 19501 6410 21657 0.7526 0.4757 0.5830
None 19653 19501 6410 21657 0.7526 0.4757 0.5830
***** DONE Trier par nom avec samtools sort : bons résultats
CLOSED: [2023-04-14 Fri 09:25] SCHEDULED: <2023-04-13 Thu>
Avec capture fourni par GIAB
vcf eval
Threshold True-pos-baseline True-pos-call False-pos False-neg Precision Sensitivity F-measure
----------------------------------------------------------------------------------------------------
5.000 57443 57032 984 6557 0.9830 0.8975 0.9383
None 57457 57046 1009 6543 0.9826 0.8978 0.9383
Happy
| Type | Filter | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision | METRIC.Frac_NA | METRIC.F1_Score | TRUTH.TOTAL.TiTv_ratio | QUERY.TOTAL.TiTv_ratio | TRUTH.TOTAL.het_hom_ratio | QUERY.TOTAL.het_hom_ratio |
|-------+--------+-------------+----------+----------+-------------+----------+-----------+-------+-------+---------------+------------------+----------------+-----------------+------------------------+------------------------+---------------------------+---------------------------|
| INDEL | ALL | 6150 | 5007 | 1143 | 6978 | 556 | 1346 | 151 | 168 | 0.814146 | 0.901278 | 0.192892 | 0.8555 | | | 1.5434221840068787 | 1.9467178175618074 |
| INDEL | PASS | 6150 | 5007 | 1143 | 6978 | 556 | 1346 | 151 | 168 | 0.814146 | 0.901278 | 0.192892 | 0.8555 | | | 1.5434221840068787 | 1.9467178175618074 |
| SNP | ALL | 57818 | 52464 | 5354 | 56016 | 500 | 3046 | 90 | 30 | 0.907399 | 0.990561 | 0.054377 | 0.947158 | 2.4892012548262548 | 2.426824047458871 | 1.5904527117884357 | 1.6107795598657217 |
| SNP | PASS | 57818 | 52464 | 5354 | 56016 | 500 | 3046 | 90 | 30 | 0.907399 | 0.990561 | 0.054377 | 0.947158 | 2.4892012548262548 | 2.426824047458871 | 1.5904527117884357 | 1.6107795598657217 |
***** DONE Capture agilent légment meilleur que celui fourni par GIAB (padding ?)
CLOSED: [2023-04-14 Fri 09:48]
GIAB:
vcf eval
Threshold True-pos-baseline True-pos-call False-pos False-neg Precision Sensitivity F-measure
----------------------------------------------------------------------------------------------------
5.000 57443 57032 984 6557 0.9830 0.8975 0.9383
None 57457 57046 1009 6543 0.9826 0.8978 0.9383
Happy
| Type | Filter | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision | METRIC.Frac_NA | METRIC.F1_Score | TRUTH.TOTAL.TiTv_ratio | QUERY.TOTAL.TiTv_ratio | TRUTH.TOTAL.het_hom_ratio | QUERY.TOTAL.het_hom_ratio |
|-------+--------+-------------+----------+----------+-------------+----------+-----------+-------+-------+---------------+------------------+----------------+-----------------+------------------------+------------------------+---------------------------+---------------------------|
| INDEL | ALL | 6150 | 5007 | 1143 | 6978 | 556 | 1346 | 151 | 168 | 0.814146 | 0.901278 | 0.192892 | 0.8555 | | | 1.5434221840068787 | 1.9467178175618074 |
| INDEL | PASS | 6150 | 5007 | 1143 | 6978 | 556 | 1346 | 151 | 168 | 0.814146 | 0.901278 | 0.192892 | 0.8555 | | | 1.5434221840068787 | 1.9467178175618074 |
| SNP | ALL | 57818 | 52464 | 5354 | 56016 | 500 | 3046 | 90 | 30 | 0.907399 | 0.990561 | 0.054377 | 0.947158 | 2.4892012548262548 | 2.426824047458871 | 1.5904527117884357 | 1.6107795598657217 |
| SNP | PASS | 57818 | 52464 | 5354 | 56016 | 500 | 3046 | 90 | 30 | 0.907399 | 0.990561 | 0.054377 | 0.947158 | 2.4892012548262548 | 2.426824047458871 | 1.5904527117884357 | 1.6107795598657217 |
Agilent
Threshold True-pos-baseline True-pos-call False-pos False-neg Precision Sensitivity F-measure
----------------------------------------------------------------------------------------------------
6.000 37241 36965 449 4069 0.9880 0.9015 0.9428
None 37248 36972 461 4062 0.9877 0.9017 0.9427
| Type | Filter | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision | METRIC.Frac_NA | METRIC.F1_Score | TRUTH.TOTAL.TiTv_ratio | QUERY.TOTAL.TiTv_ratio | TRUTH.TOTAL.het_hom_ratio | QUERY.TOTAL.het_hom_ratio |
| INDEL | ALL | 2909 | 2477 | 432 | 3229 | 207 | 519 | 52 | 50 | 0.851495 | 0.923616 | 0.160731 | 0.886091 | | | 1.4964850615114236 | 1.8339222614840989 |
| INDEL | PASS | 2909 | 2477 | 432 | 3229 | 207 | 519 | 52 | 50 | 0.851495 | 0.923616 | 0.160731 | 0.886091 | | | 1.4964850615114236 | 1.8339222614840989 |
| SNP | ALL | 38406 | 34793 | 3613 | 36935 | 275 | 1868 | 37 | 15 | 0.905926 | 0.992158 | 0.050575 | 0.947083 | 2.6247759222568168 | 2.5752854654538417 | 1.588953331534934 | 1.6192536889897844 |
| SNP | PASS | 38406 | 34793 | 3613 | 36935 | 275 | 1868 | 37 | 15 | 0.905926 | 0.992158 | 0.050575 | 0.947083 | 2.6247759222568168 | 2.5752854654538417 | 1.588953331534934 | 1.6192536889897844 |
**** DONE Résumer résultats pour Paul + article :resultats:
CLOSED: [2023-04-06 Thu 21:41] SCHEDULED: <2023-04-02 Sun>
*** TODO Platinum genome
https://emea.illumina.com/platinumgenomes.html
*** TODO Séquencer NA12878
Discussion avec Paul : sous-traitant ne nous donnera pas les données, il faut commander l'ADN
** TODO Fastq avec tous les variants centogène
*** TODO Extraire liste des variants
*** TODO Générer fastq
*** TODO Vérifier qu
'on les retrouve tous
** Divers
*** DONE Vérifier nombre de reads fastq - bam
CLOSED: [2022-10-09 Sun 22:31]
ULED: <2023-04-12 Wed>
"target region" +/- 50bp
[[https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/analysis/OsloUniversityHospital_Exome_GATK_jointVC_11242015/README.txt][README]]
list file describing the variant calling regions (target regions extended with 50 bp on each end)
***** DONE .bed fourni par AGilent: sensbilité très mauvaise
CLOSED: [2023-04-13 Thu 09:46] SCHEDULED: <2023-04-13 Thu>
Agilent SureSelect Human All Exon V5 kit
Disponible en hg38
Threshold True-pos-baseline True-pos-call False-pos False-neg Precision Sensitivity F-measure
----------------------------------------------------------------------------------------------------
0.000 19653 19501 6410 21657 0.7526 0.4757 0.5830
None 19653 19501 6410 21657 0.7526 0.4757 0.5830
***** DONE Trier par nom avec samtools sort : bons résultats
CLOSED: [2023-04-14 Fri 09:25] SCHEDULED: <2023-04-13 Thu>
Avec capture fourni par GIAB
vcf eval
Threshold True-pos-baseline True-pos-call False-pos False-neg Precision Sensitivity F-measure
----------------------------------------------------------------------------------------------------
5.000 57443 57032 984 6557 0.9830 0.8975 0.9383
None 57457 57046 1009 6543 0.9826 0.8978 0.9383
Happy
| Type | Filter | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision | METRIC.Frac_NA | METRIC.F1_Score | TRUTH.TOTAL.TiTv_ratio | QUERY.TOTAL.TiTv_ratio | TRUTH.TOTAL.het_hom_ratio | QUERY.TOTAL.het_hom_ratio |
|-------+--------+-------------+----------+----------+-------------+----------+-----------+-------+-------+---------------+------------------+----------------+-----------------+------------------------+------------------------+---------------------------+---------------------------|
| INDEL | ALL | 6150 | 5007 | 1143 | 6978 | 556 | 1346 | 151 | 168 | 0.814146 | 0.901278 | 0.192892 | 0.8555 | | | 1.5434221840068787 | 1.9467178175618074 |
| INDEL | PASS | 6150 | 5007 | 1143 | 6978 | 556 | 1346 | 151 | 168 | 0.814146 | 0.901278 | 0.192892 | 0.8555 | | | 1.5434221840068787 | 1.9467178175618074 |
| SNP | ALL | 57818 | 52464 | 5354 | 56016 | 500 | 3046 | 90 | 30 | 0.907399 | 0.990561 | 0.054377 | 0.947158 | 2.4892012548262548 | 2.426824047458871 | 1.5904527117884357 | 1.6107795598657217 |
| SNP | PASS | 57818 | 52464 | 5354 | 56016 | 500 | 3046 | 90 | 30 | 0.907399 | 0.990561 | 0.054377 | 0.947158 | 2.4892012548262548 | 2.426824047458871 | 1.5904527117884357 | 1.6107795598657217 |
***** DONE Capture agilent légment meilleur que celui fourni par GIAB (padding ?)
CLOSED: [2023-04-14 Fri 09:48]
GIAB:
vcf eval
Threshold True-pos-baseline True-pos-call False-pos False-neg Precision Sensitivity F-measure
----------------------------------------------------------------------------------------------------
5.000 57443 57032 984 6557 0.9830 0.8975 0.9383
None 57457 57046 1009 6543 0.9826 0.8978 0.9383
Happy
| Type | Filter | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision | METRIC.Frac_NA | METRIC.F1_Score | TRUTH.TOTAL.TiTv_ratio | QUERY.TOTAL.TiTv_ratio | TRUTH.TOTAL.het_hom_ratio | QUERY.TOTAL.het_hom_ratio |
|-------+--------+-------------+----------+----------+-------------+----------+-----------+-------+-------+---------------+------------------+----------------+-----------------+------------------------+------------------------+---------------------------+---------------------------|
| INDEL | ALL | 6150 | 5007 | 1143 | 6978 | 556 | 1346 | 151 | 168 | 0.814146 | 0.901278 | 0.192892 | 0.8555 | | | 1.5434221840068787 | 1.9467178175618074 |
| INDEL | PASS | 6150 | 5007 | 1143 | 6978 | 556 | 1346 | 151 | 168 | 0.814146 | 0.901278 | 0.192892 | 0.8555 | | | 1.5434221840068787 | 1.9467178175618074 |
| SNP | ALL | 57818 | 52464 | 5354 | 56016 | 500 | 3046 | 90 | 30 | 0.907399 | 0.990561 | 0.054377 | 0.947158 | 2.4892012548262548 | 2.426824047458871 | 1.5904527117884357 | 1.6107795598657217 |
| SNP | PASS | 57818 | 52464 | 5354 | 56016 | 500 | 3046 | 90 | 30 | 0.907399 | 0.990561 | 0.054377 | 0.947158 | 2.4892012548262548 | 2.426824047458871 | 1.5904527117884357 | 1.6107795598657217 |
Agilent
Threshold True-pos-baseline True-pos-call False-pos False-neg Precision Sensitivity F-measure
----------------------------------------------------------------------------------------------------
6.000 37241 36965 449 4069 0.9880 0.9015 0.9428
None 37248 36972 461 4062 0.9877 0.9017 0.9427
| Type | Filter | TRUTH.TOTAL | TRUTH.TP | TRUTH.FN | QUERY.TOTAL | QUERY.FP | QUERY.UNK | FP.gt | FP.al | METRIC.Recall | METRIC.Precision | METRIC.Frac_NA | METRIC.F1_Score | TRUTH.TOTAL.TiTv_ratio | QUERY.TOTAL.TiTv_ratio | TRUTH.TOTAL.het_hom_ratio | QUERY.TOTAL.het_hom_ratio |
| INDEL | ALL | 2909 | 2477 | 432 | 3229 | 207 | 519 | 52 | 50 | 0.851495 | 0.923616 | 0.160731 | 0.886091 | | | 1.4964850615114236 | 1.8339222614840989 |
| INDEL | PASS | 2909 | 2477 | 432 | 3229 | 207 | 519 | 52 | 50 | 0.851495 | 0.923616 | 0.160731 | 0.886091 | | | 1.4964850615114236 | 1.8339222614840989 |
| SNP | ALL | 38406 | 34793 | 3613 | 36935 | 275 | 1868 | 37 | 15 | 0.905926 | 0.992158 | 0.050575 | 0.947083 | 2.6247759222568168 | 2.5752854654538417 | 1.588953331534934 | 1.6192536889897844 |
| SNP | PASS | 38406 | 34793 | 3613 | 36935 | 275 | 1868 | 37 | 15 | 0.905926 | 0.992158 | 0.050575 | 0.947083 | 2.6247759222568168 | 2.5752854654538417 | 1.588953331534934 | 1.6192536889897844 |
**** DONE HG003 :hg003:
CLOSED: [2023-04-16 Sun 00:20]
#+begin_src sh
NXF_OPTS=-D"user.name=${USER}" nextflow run main.nf -profile standard,helios --input /Work/Groups/bisonex/data/giab/GRCh38/HG003_{1,2}.fq.gz -bg
#+end_src
#+begin_src sh
NXF_OPTS=-D"user.name=${USER}" nextflow run workflows/compareVCF.nf -profile standard,helios -resume --outdir=compareHG003 --test.id=HG003 --test.query=out/HG003_1/variantCalling/haplotypecaller/HG003_1.vcf.gz --test.compare=vcfeval,happy --test.capture=data/AgilentSureSelectv05_hg38.bed
#+end_src
vcfeval
Threshold True-pos-baseline True-pos-call False-pos False-neg Precision Sensitivity F-measure
----------------------------------------------------------------------------------------------------
5.000 36745 36473 486 3988 0.9869 0.9021 0.9426
None 36748 36476 495 3985 0.9866 0.9022 0.9425
$ zcat NA12878.snp_roc.tsv.gz | tail -n 1 | awk '{print $7 $6}'
happy
Type Filter TRUTH.TOTAL TRUTH.TP TRUTH.FN QUERY.TOTAL QUERY.FP QUERY.UNK FP.gt FP.al METRIC.Recall METRIC.Precision METRIC.Frac_NA METRIC.F1_Score TRUTH.TOTAL.TiTv_ratio QUERY.TOTAL.TiTv_ratio TRUTH.TOTAL.het_hom_ratio QUERY.TOTAL.het_hom_ratio
INDEL ALL 2731 2290 441 3092 208 577 62 53 0.838521 0.917296 0.186611 0.876141 NaN NaN 1.505145 1.888993
INDEL PASS 2731 2290 441 3092 208 577 62 53 0.838521 0.917296 0.186611 0.876141 NaN NaN 1.505145 1.888993
SNP ALL 37997 34481 3516 36861 306 2074 33 13 0.907466 0.991204 0.056265 0.947488 2.611269 2.565915 1.555780 1.621727
SNP PASS 37997 34481 3516 36861 306 2074 33 13 0.907466 0.991204 0.056265 0.947488 2.611269 2.5659
**** DONE HG004
CLOSED: [2023-04-16 Sun 00:20]
#+begin_src sh
NXF_OPTS=-D"user.name=${USER}" nextflow run main.nf -profile standard,helios --input /Work/Groups/bisonex/data/giab/GRCh38/HG004_{1,2}.fq.gz -bg
#+end_src
vcfeval
Threshold True-pos-baseline True-pos-call False-pos False-neg Precision Sensitivity F-measure
----------------------------------------------------------------------------------------------------
6.000 36938 36678 421 4040 0.9887 0.9014 0.9430
None 36942 36682 432 4036 0.9884 0.9015 0.9429
happy
Type Filter TRUTH.TOTAL TRUTH.TP TRUTH.FN QUERY.TOTAL QUERY.FP QUERY.UNK FP.gt FP.al METRIC.Recall METRIC.Precision METRIC.Frac_NA METRIC.F1_Score TRUTH.TOTAL.TiTv_ratio QUERY.TOTAL.TiTv_ratio TRUTH.TOTAL.het_hom_ratio QUERY.TOTAL.het_hom_ratio
INDEL ALL 2787 2388 399 3183 195 580 53 38 0.856835 0.925086 0.182218 0.889654 NaN NaN 1.507834 1.848649
INDEL PASS 2787 2388 399 3183 195 580 53 38 0.856835 0.925086 0.182218 0.889654 NaN NaN 1.507834 1.848649
SNP ALL 38185 34560 3625 36921 254 2107 46 7 0.905067 0.992704 0.057068 0.946862 2.589175 2.553546 1.632595 1.653534
SNP PASS 38185 34560 3625 36921 254 2107 46 7 0.905067 0.992704 0.057068 0.946862 2.589175 2.553546 1.632595 1.653534
**** DONE Résumer résultats pour Paul + article :resultats:
CLOSED: [2023-04-06 Thu 21:41] SCHEDULED: <2023-04-02 Sun>
*** TODO Platinum genome
https://emea.illumina.com/platinumgenomes.html
*** TODO Séquencer NA12878
Discussion avec Paul : sous-traitant ne nous donnera pas les données, il faut commander l'ADN
** TODO Fastq avec tous les variants centogène
*** TODO Extraire liste des variants
*** TODO Générer fastq
*** TODO Vérifier qu'on les retrouve tous
** Divers
*** DONE Vérifier nombre de reads fastq - bam
CLOSED: [2022-10-09 Sun 22:31]