JXA2LMD35VKJZLEJAJYNZWU6RGGQ7HZDXFQHNCDHODQ4AHCQU5ZQC
6SBCU636KE3RXJXGL4SLDL2KEM5C6SRGUGHQNGIKRJTQRX5GVJIAC
VK6K2ZKRI4XTJKHSJKHC3F4F447JCPEVHX4TQBJJPVUIB3WURIOAC
KXGS6Y7FYRAG6W5UCCNQ44ESWXXU4FQO6KQY6E2TUKLRJXVJJP6QC
FXA3ZBV64FML7W47IPHTAJFJHN3J3XHVHFVNYED47XFSBIGMBKRQC
2DIFTF6IXG7IBBDDFV7NIPN2ZOVSYRM4TTA4PYRGGJF32PYP3H4QC
RWKIBFPIDVDJ3XLTWLWEC46FOJ3ID6VKP46U65UVRDERCRIZBD7AC
RHWQQAAHNHFO3FLCGVB3SIDKNOUFJGZTDNN57IQVBMXXCWX74MKAC
iant annotation
Besoin de vep
*** TODO Variant calling
* Amélioration :amelioration:
* Documentation :doc:
** DONE Procédure d'installation nix + dependences pour VM CHU
CLOSED: [2023-04-22 Sat 15:27] SCHEDULED: <2023-04-13 Thu>
* Manuscript :manuscript:
* Tests :tests:
** KILL Non régression : version prod
CLOSED: [2023-05-23 Tue 08:46]
*** DONE ID common snp
CLOSED: [2022-11-19 Sat 21:36]
#+begin_src
$ wc -l ID_of_common_snp.txt
23194290 ID_of_common_snp.txt
$ wc -l /Work/Users/apraga/bisonex/database/dbSNP/ID_of_common_snp.txt
23194290 /Work/Users/apraga/bisonex/database/dbSNP/ID_of_common_snp.txt
#+end_src
*** DONE ID common snp not clinvar patho
CLOSED: [2022-12-11 Sun 20:11]
**** DONE Vérification du problème
CLOSED: [2022-12-11 Sun 16:30]
Sur le J:
21155134 /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt.ref
Version de "non-régression"
21155076 database/dbSNP/ID_of_common_snp_not_clinvar_patho.txt
Nouvelle version
23193391 /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt
Si on enlève les doublons
$ sort database/dbSNP/ID_of_common_snp_not_clinvar_patho.txt | uniq > old.txt
$ wc -l old.txt
21107097 old.txt
$ sort /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt | uniq > new.txt
$ wc -l new.txt
21174578 new.txt
$ sort /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt.ref | uniq > ref.txt
$ wc -l ref.txt
21107155 ref.txt
Si on regarde la différence
comm -23 ref.txt old.txt
rs1052692
rs1057518973
rs1057518973
rs11074121
rs112848754
rs12573787
rs145033890
rs147889095
rs1553904159
rs1560294695
rs1560296615
rs1560310926
rs1560325547
rs1560342418
rs1560356225
rs1578287542
...
On cherche le premier
bcftools query -i 'ID="rs1052692"' database/dbSNP/dbSNP_common.vcf.gz -f '%CHROM %POS %REF %ALT\n'
NC_000019.10 1619351 C A,T
Il est bien patho...
$ bcftools query -i 'POS=1619351' database/clinvar/clinvar.vcf.gz -f '%CHROM %POS %REF %ALT %INFO/CLNSIG\n'
19 1619351 C T Conflicting_interpretations_of_pathogenicity
On vérifie pour tous les autres
$ comm -23 ref.txt old.txt > tocheck.txt
On génère les régions à vérifier (chromosome number:position)
$ bcftools query -i 'ID=@tocheck.txt' database/dbSNP/dbSNP_common.vcf.gz -f '%CHROM\t%POS\n' > tocheck.pos
On génère le mapping inverse (chromosome number -> NC)
$ awk ' { t = $1; $1 = $2; $2 = t; print; } ' database/RefSeq/refseq_to_number_only_consensual.txt > mapping.txt
On remap clinvar
$ bcftools annotate --rename-chrs mapping.txt database/clinvar/clinvar.vcf.gz -o clinvar_remapped.vcf.gz
$ tabix clinvar_remapped.vcf.gz
Enfin, on cherche dans clinvar la classification
$ bcftools query -R tocheck.pos clinvar_remapped.vcf.gz -f '%CHROM %POS %INFO/CLNSIG\n'
$ bcftools query -R tocheck.pos database/dbSNP/dbSNP_common.vcf.gz -f '%CHROM %POS %ID \n' | grep '^NC'
#+RESULTS:
**** DONE Comprendre pourquoi la nouvelle version donne un résultat différent
CLOSED: [2022-12-11 Sun 20:11]
***** DONE Même version dbsnp et clinvar ?
CLOSED: [2022-12-10 Sat 23:02]
Clinvar différent !
$ bcftools stats clinvar.gz
clinvar (Alexis)
SN 0 number of samples: 0
SN 0 number of records: 1492828
SN 0 number of no-ALTs: 965
SN 0 number of SNPs: 1338007
SN 0 number of MNPs: 5562
SN 0 number of indels: 144580
SN 0 number of others: 3714
SN 0 number of multiallelic sites: 0
SN 0 number of multiallelic SNP sites: 0
clinvar (new)
SN 0 number of samples: 0
SN 0 number of records: 1493470
SN 0 number of no-ALTs: 965
SN 0 number of SNPs: 1338561
SN 0 number of MNPs: 5565
SN 0 number of indels: 144663
SN 0 number of others: 3716
SN 0 number of multiallelic sites: 0
SN 0 number of multiallelic SNP sites: 0
***** DONE Mettre à jour clinvar et dbnSNP pour travailler sur les mêm bases
CLOSED: [2022-12-11 Sun 12:10]
Problème persiste
***** DONE Supprimer la conversion en int du chromosome
CLOSED: [2022-12-10 Sat 19:29]
***** KILL Même NC ?
CLOSED: [2022-12-10 Sat 19:29]
$ zgrep "contig=<ID=NC_\(.*\)" clinvar/GRCh38/clinvar.vcf.gz > contig.clinvar
$ diff contig.txt contig.clinvar
< ##contig=<ID=NC_012920.1>
***** DONE Tester sur chromosome 19: ok
CLOSED: [2022-12-11 Sun 13:53]
On prépare les données
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/tests/debug-commonsnp
PATH=$PATH:$HOME/.nix-profile/bin
bcftools filter -i 'CHROM="NC_000019.10"' /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/dbSNP_common.vcf.gz -o dbSNP_common_19.vcf.gz
bcftools filter -i 'CHROM="NC_000019.10"' /Work/Groups/bisonex/data/clinvar/GRCh38/clinvar.vcf.gz -o clinvar_19.vcf.gz
bcftools filter -i 'CHROM="NC_000019.10"' /Work/Groups/bisonex/data-alexis/dbSNP/dbSNP_common.vcf.gz -o dbSNP_common_19_old.vcf.gz
bcftools filter -i 'CHROM="19"' /Work/Groups/bisonex/data-alexis/clinvar/clinvar.vcf.gz -o clinvar_19_old.vcf.gz
#+end_src
On récupère les 2 versions du script
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/tests/debug-commonsnp
PATH=$PATH:$HOME/.nix-profile/bin
git checkout regression ../../script/pythonScript/clinvar_sbSNP.py
cp ../../script/pythonScript/clinvar_sbSNP.py clinvar_sbSNP_old.py
git checkout HEAD ../../script/pythonScript/clinvar_sbSNP.py
#+end_src
#+RESULTS:
On compare
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/tests/debug-commonsnp
PATH=$PATH:$HOME/.nix-profile/bin
python ../../script/pythonScript/clinvar_sbSNP.py clinvar_sbSNP.py --clinvar clinvar_19.vcf.gz --dbSNP dbSNP_common_19.vcf.gz --output tmp.txt
sort tmp.txt | uniq > new.txt
table=/Work/Groups/bisonex/data-alexis/RefSeq/refseq_to_number_only_consensual.txt
python clinvar_sbSNP_old.py --clinvar clinvar_19_old.vcf.gz --dbSNP dbSNP_common_19_old.vcf.gz --output tmp_old.txt --chrm_name_table $table
sort tmp_old.txt | uniq > old.txt
wc -l old.txt new.txt
#+end_src
#+RESULTS:
| 535155 | old.txt |
| 535194 | new.txt |
| 1070349 | total |
Si on prend le premier manquant dans new, il est conflicting patho donc il ne devrait pas y être...
$ bcftools query -i 'ID="rs10418277"' dbSNP
_common_19.vcf.gz -f '%CHROM %POS %REF %ALT\n'
NC_000019.10 54939682 C G,T
$ bcftools query -i 'ID="rs10418277"' dbSNP_common_19_old.vcf.gz -f '%CHROM %POS %REF %ALT\n'
NC_000019.10 54939682 C G,T
$ bcftools query -i 'POS=54939682' clinvar_19.vcf.gz -f '%POS %REF %ALT %INFO/CLNSIG\n'
54939682 C G Conflicting_interpretations_of_pathogenicity
54939682 C T Benign
$ bcftools query -i 'POS=54939682' clinvar_19_old.vcf.gz -f '%POS %REF %ALT %INFO/CLNSIG\n'
54939682 C G Conflicting_interpretations_of_pathogenicity
54939682 C T Benign
$ grep rs10418277 *.txt
new.txt:rs10418277
tmp.txt:rs10418277
Le problème venait de la POS qui n'était plus convertie en int (suppression de la ligne par erreur ??)
On vérifie
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/tests/debug-commonsnp
PATH=$PATH:$HOME/.nix-profile/bin
python ../../script/pythonScript/clinvar_sbSNP.py --clinvar clinvar_19.vcf.gz --dbSNP dbSNP_common_19.vcf.gz --output tmp.txt
sort tmp.txt | uniq > new.txt
table=/Work/Groups/bisonex/data-alexis/RefSeq/refseq_to_number_only_consensual.txt
python clinvar_sbSNP_old.py --clinvar clinvar_19_old.vcf.gz --dbSNP dbSNP_common_19_old.vcf.gz --output tmp_old.txt --chrm_name_table $table
sort tmp_old.txt | uniq > old.txt
wc -l old.txt new.txt
diff old.txt new.txt
#+end_src
#+RESULTS:
| 535155 | old.txt |
| 535155 | new.txt |
| 1070310 | total |
***** DONE Tester sur chromosome 19 et 20: ok
CLOSED: [2022-12-11 Sun 15:56]
On prépare les données
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/tests/debug-commonsnp
PATH=$PATH:$HOME/.nix-profile/bin
bcftools filter -i 'CHROM="NC_000019.10" | CHROM="NC_000020.11"' /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/dbSNP_common.vcf.gz -o dbSNP_common_19_20.vcf.gz
bcftools filter -i 'CHROM="NC_000019.10" | CHROM="NC_000020.11"' /Work/Groups/bisonex/data/clinvar/GRCh38/clinvar.vcf.gz -o clinvar_19_20.vcf.gz
bcftools filter -i 'CHROM="NC_000019.10" | C
iant annotation
Besoin de vep
*** TODO Variant calling
** STRT Tester sarek
#+begin_src sh
module load apptainer/1.1.8
nextflow run nf-core/sarek -profile test,singularity --outdir test-sarek
#+end_src
Les dépendences ne se téléchargent pas correctement, on les extrait à la main
#+begin_src sh
rg -IN galaxyproject modules | sed 's/ //g;s/:$//' | sort | uniq > deps.txt
#+end_src
Nettoyage à la main
Puis
#+begin_src sh
cat deps.txt | xargs -L1 singularity pull
#+end_src
* Amélioration :amelioration:
* Documentation :doc:
** DONE Procédure d'installation nix + dependences pour VM CHU
CLOSED: [2023-04-22 Sat 15:27] SCHEDULED: <2023-04-13 Thu>
* Manuscript :manuscript:
* Tests :tests:
** KILL Non régression : version prod
CLOSED: [2023-05-23 Tue 08:46]
*** DONE ID common snp
CLOSED: [2022-11-19 Sat 21:36]
#+begin_src
$ wc -l ID_of_common_snp.txt
23194290 ID_of_common_snp.txt
$ wc -l /Work/Users/apraga/bisonex/database/dbSNP/ID_of_common_snp.txt
23194290 /Work/Users/apraga/bisonex/database/dbSNP/ID_of_common_snp.txt
#+end_src
*** DONE ID common snp not clinvar patho
CLOSED: [2022-12-11 Sun 20:11]
**** DONE Vérification du problème
CLOSED: [2022-12-11 Sun 16:30]
Sur le J:
21155134 /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt.ref
Version de "non-régression"
21155076 database/dbSNP/ID_of_common_snp_not_clinvar_patho.txt
Nouvelle version
23193391 /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt
Si on enlève les doublons
$ sort database/dbSNP/ID_of_common_snp_not_clinvar_patho.txt | uniq > old.txt
$ wc -l old.txt
21107097 old.txt
$ sort /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt | uniq > new.txt
$ wc -l new.txt
21174578 new.txt
$ sort /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/ID_of_common_snp_not_clinvar_patho.txt.ref | uniq > ref.txt
$ wc -l ref.txt
21107155 ref.txt
Si on regarde la différence
comm -23 ref.txt old.txt
rs1052692
rs1057518973
rs1057518973
rs11074121
rs112848754
rs12573787
rs145033890
rs147889095
rs1553904159
rs1560294695
rs1560296615
rs1560310926
rs1560325547
rs1560342418
rs1560356225
rs1578287542
...
On cherche le premier
bcftools query -i 'ID="rs1052692"' database/dbSNP/dbSNP_common.vcf.gz -f '%CHROM %POS %REF %ALT\n'
NC_000019.10 1619351 C A,T
Il est bien patho...
$ bcftools query -i 'POS=1619351' database/clinvar/clinvar.vcf.gz -f '%CHROM %POS %REF %ALT %INFO/CLNSIG\n'
19 1619351 C T Conflicting_interpretations_of_pathogenicity
On vérifie pour tous les autres
$ comm -23 ref.txt old.txt > tocheck.txt
On génère les régions à vérifier (chromosome number:position)
$ bcftools query -i 'ID=@tocheck.txt' database/dbSNP/dbSNP_common.vcf.gz -f '%CHROM\t%POS\n' > tocheck.pos
On génère le mapping inverse (chromosome number -> NC)
$ awk ' { t = $1; $1 = $2; $2 = t; print; } ' database/RefSeq/refseq_to_number_only_consensual.txt > mapping.txt
On remap clinvar
$ bcftools annotate --rename-chrs mapping.txt database/clinvar/clinvar.vcf.gz -o clinvar_remapped.vcf.gz
$ tabix clinvar_remapped.vcf.gz
Enfin, on cherche dans clinvar la classification
$ bcftools query -R tocheck.pos clinvar_remapped.vcf.gz -f '%CHROM %POS %INFO/CLNSIG\n'
$ bcftools query -R tocheck.pos database/dbSNP/dbSNP_common.vcf.gz -f '%CHROM %POS %ID \n' | grep '^NC'
#+RESULTS:
**** DONE Comprendre pourquoi la nouvelle version donne un résultat différent
CLOSED: [2022-12-11 Sun 20:11]
***** DONE Même version dbsnp et clinvar ?
CLOSED: [2022-12-10 Sat 23:02]
Clinvar différent !
$ bcftools stats clinvar.gz
clinvar (Alexis)
SN 0 number of samples: 0
SN 0 number of records: 1492828
SN 0 number of no-ALTs: 965
SN 0 number of SNPs: 1338007
SN 0 number of MNPs: 5562
SN 0 number of indels: 144580
SN 0 number of others: 3714
SN 0 number of multiallelic sites: 0
SN 0 number of multiallelic SNP sites: 0
clinvar (new)
SN 0 number of samples: 0
SN 0 number of records: 1493470
SN 0 number of no-ALTs: 965
SN 0 number of SNPs: 1338561
SN 0 number of MNPs: 5565
SN 0 number of indels: 144663
SN 0 number of others: 3716
SN 0 number of multiallelic sites: 0
SN 0 number of multiallelic SNP sites: 0
***** DONE Mettre à jour clinvar et dbnSNP pour travailler sur les mêm bases
CLOSED: [2022-12-11 Sun 12:10]
Problème persiste
***** DONE Supprimer la conversion en int du chromosome
CLOSED: [2022-12-10 Sat 19:29]
***** KILL Même NC ?
CLOSED: [2022-12-10 Sat 19:29]
$ zgrep "contig=<ID=NC_\(.*\)" clinvar/GRCh38/clinvar.vcf.gz > contig.clinvar
$ diff contig.txt contig.clinvar
< ##contig=<ID=NC_012920.1>
***** DONE Tester sur chromosome 19: ok
CLOSED: [2022-12-11 Sun 13:53]
On prépare les données
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/tests/debug-commonsnp
PATH=$PATH:$HOME/.nix-profile/bin
bcftools filter -i 'CHROM="NC_000019.10"' /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/dbSNP_common.vcf.gz -o dbSNP_common_19.vcf.gz
bcftools filter -i 'CHROM="NC_000019.10"' /Work/Groups/bisonex/data/clinvar/GRCh38/clinvar.vcf.gz -o clinvar_19.vcf.gz
bcftools filter -i 'CHROM="NC_000019.10"' /Work/Groups/bisonex/data-alexis/dbSNP/dbSNP_common.vcf.gz -o dbSNP_common_19_old.vcf.gz
bcftools filter -i 'CHROM="19"' /Work/Groups/bisonex/data-alexis/clinvar/clinvar.vcf.gz -o clinvar_19_old.vcf.gz
#+end_src
On récupère les 2 versions du script
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/tests/debug-commonsnp
PATH=$PATH:$HOME/.nix-profile/bin
git checkout regression ../../script/pythonScript/clinvar_sbSNP.py
cp ../../script/pythonScript/clinvar_sbSNP.py clinvar_sbSNP_old.py
git checkout HEAD ../../script/pythonScript/clinvar_sbSNP.py
#+end_src
#+RESULTS:
On compare
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/tests/debug-commonsnp
PATH=$PATH:$HOME/.nix-profile/bin
python ../../script/pythonScript/clinvar_sbSNP.py clinvar_sbSNP.py --clinvar clinvar_19.vcf.gz --dbSNP dbSNP_common_19.vcf.gz --output tmp.txt
sort tmp.txt | uniq > new.txt
table=/Work/Groups/bisonex/data-alexis/RefSeq/refseq_to_number_only_consensual.txt
python clinvar_sbSNP_old.py --clinvar clinvar_19_old.vcf.gz --dbSNP dbSNP_common_19_old.vcf.gz --output tmp_old.txt --chrm_name_table $table
sort tmp_old.txt | uniq > old.txt
wc -l old.txt new.txt
#+end_src
#+RESULTS:
| 535155 | old.txt |
| 535194 | new.txt |
| 1070349 | total |
Si on prend le premier manquant dans new, il est conflicting patho donc il ne devrait pas y être...
$ bcftools query -i 'ID="rs10418277"' dbSNP
_common_19.vcf.gz -f '%CHROM %POS %REF %ALT\n'
NC_000019.10 54939682 C G,T
$ bcftools query -i 'ID="rs10418277"' dbSNP_common_19_old.vcf.gz -f '%CHROM %POS %REF %ALT\n'
NC_000019.10 54939682 C G,T
$ bcftools query -i 'POS=54939682' clinvar_19.vcf.gz -f '%POS %REF %ALT %INFO/CLNSIG\n'
54939682 C G Conflicting_interpretations_of_pathogenicity
54939682 C T Benign
$ bcftools query -i 'POS=54939682' clinvar_19_old.vcf.gz -f '%POS %REF %ALT %INFO/CLNSIG\n'
54939682 C G Conflicting_interpretations_of_pathogenicity
54939682 C T Benign
$ grep rs10418277 *.txt
new.txt:rs10418277
tmp.txt:rs10418277
Le problème venait de la POS qui n'était plus convertie en int (suppression de la ligne par erreur ??)
On vérifie
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/tests/debug-commonsnp
PATH=$PATH:$HOME/.nix-profile/bin
python ../../script/pythonScript/clinvar_sbSNP.py --clinvar clinvar_19.vcf.gz --dbSNP dbSNP_common_19.vcf.gz --output tmp.txt
sort tmp.txt | uniq > new.txt
table=/Work/Groups/bisonex/data-alexis/RefSeq/refseq_to_number_only_consensual.txt
python clinvar_sbSNP_old.py --clinvar clinvar_19_old.vcf.gz --dbSNP dbSNP_common_19_old.vcf.gz --output tmp_old.txt --chrm_name_table $table
sort tmp_old.txt | uniq > old.txt
wc -l old.txt new.txt
diff old.txt new.txt
#+end_src
#+RESULTS:
| 535155 | old.txt |
| 535155 | new.txt |
| 1070310 | total |
***** DONE Tester sur chromosome 19 et 20: ok
CLOSED: [2022-12-11 Sun 15:56]
On prépare les données
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/tests/debug-commonsnp
PATH=$PATH:$HOME/.nix-profile/bin
bcftools filter -i 'CHROM="NC_000019.10" | CHROM="NC_000020.11"' /Work/Groups/bisonex/data/dbSNP/GRCh38.p13/dbSNP_common.vcf.gz -o dbSNP_common_19_20.vcf.gz
bcftools filter -i 'CHROM="NC_000019.10" | CHROM="NC_000020.11"' /Work/Groups/bisonex/data/clinvar/GRCh38/clinvar.vcf.gz -o clinvar_19_20.vcf.gz
bcftools filter -i 'CHROM="NC_000019.10" | C
7305 g.136477305C>G snv heterozygous C G 0 4 4
15 │ NC_000005.10 157285458 g.157285458C>T snv heterozygous C T 0 3 3
16 │ NC_000012.12 23604413 g.23604413T>G snv heterozygous T G 0 5 5
17 │ NC_000019.10 52219703 g.52219703C>T snv heterozygous C T 0 1 1
18 │ NC_000016.10 88856757 g.88856757C>T snv heterozygous C T 0 8 8
******* DONE 8 non retrouvé => probablement hors de la zjone de capture
CLOSED: [2023-04-28 Fri 19:49]
julia> @subset snv :refCount .== 0 :altCount .== 0
8×10 DataFrame
Row │ chrom pos variant variantType zygosity ref alt refCount altCount readsCount
│ SubStrin…? Int64 SubStrin…? String? String15 SubStrin… SubStrin… Int64 Int64 Int64
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ NC_000015.10 74343027 g.74343027C>T snv heterozygous C T 0 0 0
2 │ NC_000011.10 20638345 g.20638345A>G snv heterozygous A G 0 0 0
3 │ NC_000004.12 139370252 g.139370252C>T snv heterozygous C T 0 0 2
4 │ NC_000017.11 61966475 g.61966475G>T snv heterozygous G T 0 0 0
5 │ NC_000019.10 54144058 g.54144058G>A snv heterozygous G A 0 0 0
6 │ NC_000023.11 77635947 g.77635947A>G snv hemizygous A G 0 0 0
7 │ NC_000005.10 1258495 g.1258495G>A snv heterozygous G A 0 0 0
8 │ NC_000012.12 2449086 g.2449086C>G snv heterozygous C G 0 0 0
***** TODO Après haplotypecaller
SCHEDULED: <2023-04-28 Fri>
****** KILL 20x
CLOSED: [2023-04-29 Sat 15:39]
Manque 183 sur 766
[[file:~/recherche/bisonex/simuscop/checkVCF.jl][checkVCF.jl]]
#+begin_src julia
@subset leftjoin(d2, dHaplo2, on=:genomic) ismissing.(:Column1)
#+end_src
Problème de profondeur ?
Ex: chr13 nombre de 101081606
NC_000011.10 16014966 g.16014966G>A
1 read sur 11 pour allèle alternative
Sur le patient de référence, 202 reads!
Celui-ci n'est pas le fichier de capture (ni dans le bam !)
ex: NC_000015.10 74343027 g.74343027C>T
Pour les autres, on devrait les retrouver...
Vérifier le nombre de reads sur 63003856
Vérifier la paramétrisation du modèle également
****** DONE [#B] 200x
CLOSED: [2023-05-18 Thu 11:04] SCHEDULED: <2023-04-30 Sun>
120 manquants (99 sans doublon)!
On vérifie dans IGV (vcf + bam après alignement) :
******* snv NC_000015.10 74343027
- rien d'appelé
- pas une région répétée
- base quality (voir [[*Phred score][Phred score]] ) à 37 donc ok
- variant retrouvé à 26/42
- Bam après aplybqsr: base qualità 35 donc ok
chr15 également à 89318565, variant retrouvé à 25/33 avec basequal de 37
Sans oublier de charger les instructions avx
#+begin_src sh
module load gcc@11.3.0/gcc-12.1.0
#+end_src
On coupe le .bam par chromosome pour débugger (sur le mesocentre)
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/simuscop-centogene-200x/cento/testing :results silent
ln -s ../preprocessing/applybqsr/cento.bam .
ln -s ../preprocessing/recalibrated/cento.bam.bai .
ln -s /Work/Projects/bisonex/data/dbSNP/GRCh38.p13/dbSNP.gz .
ln -s /Work/Projects/bisonex/data/dbSNP/GRCh38.p13/dbSNP.gz.tbi .
ln -s /Work/Projects/bisonex/data/genome/GRCh38.p13/genomeRef.dict .
ln -s /Work/Projects/bisonex/data/genome/GRCh38.p13/genomeRef.fna .
ln -s /Work/Projects/bisonex/data/genome/GRCh38.p13/genomeRef.fna.fai .
#+end_src
On doit lancer à la main (org-mode ne connait pas le chemin de samtools)
samtools view -b cento.bam NC_000015.10 > cento_chr15.bam
samtools index cento_chr15.bam
Puis on se restreint au chronmosome 15
samtools faidx genomeRef.fna NC_000015.10 > genomeRef_chr15.fa
samtools faidx genomeRef_chr15.fa
gatk CreateSequenceDictionary -R genomeRef_chr15.fa -O genomeRef_chr15.dict
On restreint au chromosome 15 avec l'option -L (dure = 1min)
gatk --java-options "-Xmx3072M" HaplotypeCaller --input cento_chr15.bam \
--output test.vcf.gz --reference genomeRef.fna --dbsnp dbSNP.gz --tmp-dir . --max-mnp-distance 2 -L NC_000015.10
******* DONE Tutorial haplotycaller
CLOSED: [2023-05-01 Mon 19:58]
Procédure : https://gatk.broadinstitute.org/hc/en-us/articles/360043491652-When-HaplotypeCaller-and-Mutect2-do-not-call-an-expected-variant
******** DONE Supprimer --max-mnp-distance = 2: idem
CLOSED: [2023-04-30 Sun 15:42]
******** DONE --debug &> run.log : Non appelé...
CLOSED: [2023-04-30 Sun 15:52]
******** DONE --linked-de-bruijn-graph: idem
CLOSED: [2023-04-30 Sun 15:55]
******** DONE --recover-all-dangling-branches
CLOSED: [2023-04-30 Sun 16:01]
******** DONE --min-pruning 0 : plus mais pas celui là
CLOSED: [2023-04-30 Sun 15:59]
******** DONE --bam-output
CLOSED: [2023-04-30 Sun 16:50]
********* DONE : rien !
CLOSED: [2023-04-30 Sun 16:08]
********* DONE + --recover-all-dangling-branches : rien !
CLOSED: [2023-04-30 Sun 16:08]
******** DONE Données filtrées ? apparement non
CLOSED: [2023-04-30 Sun 16:41]
183122 read(s) filtered by: MappingQualityReadFilter
3674 read(s) filtered by: NotDuplicateReadFilter
********* DONE --disable-read-filter MappingQualityReadFilter: idem
CLOSED: [2023-04-30 Sun 16:34]
On a bien - 0 read(s) filtered by: MappingQualityAvailableReadFilter
********* DONE --disable-read-filter NotDuplicateReadFilter: idem
CLOSED: [2023-04-30 Sun 16:40]
******** DONE Essayer freebayes : idem
CLOSED: [2023-04-30 Sun 16:22]
freebayes -f genomeRef.fna -r NC_000015.10 cento_chr15.bam > freebayes-test-chr15.vcf
******** DONE Avec toutes les options : idem
--linked-de-bruijn-graph --recover-all-dangling-branches --min-pruning 0 --bam-output debug.bam
CLOSED: [2023-04-30 Sun 16:50]
******** DONE Vérifier qu'on regarde le même bam : oui
CLOSED: [2023-04-30 Sun 16:50]
******** DONE Désactiver dbSNP : idem
CLOSED: [2023-04-30 Sun 16:52]
******** DONE Changer kmer size : idem
CLOSED: [2023-04-30 Sun 16:56]
par exemple[[https://gatk.broadinstitute.org/hc/en-us/community/posts/360075653152-REAL-Variant-not-called-by-HaplotypeCaller][forum gatk]] --kmer-size 18 --kmer-size 22
******** DONE --adaptive-pruning true
CLOSED: [2023-05-01 Mon 19:57]
******* DONE Mapping quality : est à 0 !!!!
CLOSED: [2023-05-01 Mon 19:58]
****** TODO Comparer VCF avec vcfeval :haplotypecaller:
SCHEDULED: <2023-05-01 Mon>
On prépare les données en julia
#+begin_src ~/recherche/bisonex/simuscop
julia --project=. toVCF.jl
#+end_src
Puis on export sur le mésocentre
#+begin_src
scp variants_for_vcfeval.tsv.gz* meso:centogene_variants/
#+end_src
#+begin_src
z bis
cd simuscop-200x
rtg vcfeval -b ~/centogene_variants/variants_for_vcfeval.tsv.gz -c cento/variantCalling/haplotypecaller/cento.vcf.gz -o compare-haplotypecaller -t /Work/Groups/bisonex/data/giab/GRCh38/genomeRef.sdf
#+end_src
Threshold True-pos-baseline True-pos-call False-pos False-neg Precision Sensitivity F-measure
----------------------------------------------------------------------------------------------------
82.000 540 540 60 45 0.9000 0.9231 0.9114
None 546 546 329 39 0.6240 0.9333 0.7479
****** TODO Comparer avec hap.py :haplotypecaller:
SCHEDULED: <2023-05-19 Fri>
NXF_OPTS=-D"user.name=${USER}" nextflow run workflows/checkInserted.nf -profile standard,helios --outdir=compare-simuscop-200x --query=out/simuscop-centogene-200x/cento/callVariant/haplotypecaller/cento.vcf.gz --truth=centogene_variants/variants_for_vcfeval.tsv.gz --id=simuscop-200x-check
****** DONE Méthode naïve 549/585
CLOSED: [2023-05-04 Thu 21:57]
Haplotypecaller: Nb reference SNV 692 vs found 585
Variant calling, filter technical: reference SNV 692 vs found 521
***** TODO Avant annotation
SCHEDULED: <2023-04-28 Fri>
#+begin_src
cd cento/variantCalling
bgzip filter-technical.vcf
tabix -p vcf filter-technical.vcf.gz -f
#+end_src
Threshold True-pos-baseline True-pos-call False-pos False-neg Precision Sensitivity F-measure
----------------------------------------------------------------------------------------------------
12.000 519 519 55 66 0.9042 0.8872 0.8956
None 519 519 55 66 0.9042 0.8872 0.8956
****** DONE Méthode naïve 521/585
CLOSED: [2023-05-04 Thu 21:57]
Haplotypecaller: Nb reference SNV 692 vs found 585
Variant calling, filter technical: reference SNV 692 vs found 521
****** TODO Comparer avec hap.py
***** TODO Après filtre annotation
****** DONE Méthode naïve : 493/585
CLOSED: [2023-05-04 Thu 22:09]
****** TODO Comparer avec hap.py
****** TODO VCf eval
cd cento/annotation/
bgzip postvep-filter.vcf
tabix postvep-filter.vcf.gz
cd ../..
rtg vcfeval -b ~/centogene_variants/variants_for_vcfeval.tsv.gz -c cento/annotation/postvep-filter.vcf.gz -o compare-vepfilter -t /Work/Groups/bisonex/data/giab/GRCh38/genomeRef.sdf
Threshold True-pos-baseline True-pos-call False-pos False-neg Precision Sensitivity F-measure
----------------------------------------------------------------------------------------------------
12.000 491 491 50 94 0.9076 0.8393 0.8721
None 491 491 50 94 0.9076 0.8393 0.8721
*** KILL NEAT : trop lent :neat:
CLOSED: [2023-04-29 Sat 22:06]
**** KILL Génération fastq sur exno 5 GATAD2B
CLOSED: [2023-04-29 Sat 22:06]
Trop lent : pour 1 exon : 1500 secondes !
#+begin_src sh
samtools faidx genomeRef.fna NC_000001.11 | save -f genomeRef_chr1.fna
python gen_reads.py -r ../test-simuscop/genomeRef_chr1.fna -o lol -tr ../test-simuscop/gatad2b-exon6.bed -R 147 --pe 150 10
#+end_src
*** KILL ReSeq : exome avec exons comme fasta mais ne gère pas des exons trop petits :reseq:
CLOSED: [2023-04-30 Sun 19:44] SCHEDULED: <2023-04-29 Sat>
#+begin_quote
Can I simulate exome sequencing? Yes. You need to use a reference that only contains the exons as individual scaffolds. Using --refBiasFile you can specify the coverage of individual exons. To simulate intron contamination you can add the whole reference to the reference containing the exons and strongly reduce the coverage for these scaffolds using --refBiasFile.
#+end_quote
Par contre, rapide
**** DONE Fasta pour exons seuls
CLOSED: [2023-04-30 Sun 19:25]
Depuis le GFF
#+begin_src sh :dir ~/code/bisonex/test-reseq :results silent
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_genomic.gff.gz
#+end_src
#+begin_src sh :dir ~/code/bisonex/test-reseq :results silent
gunzip -c GCF_000001405.39_GRCh38.p13_genomic.gff.gz | grep -w "exon" > exons.gff
#+end_src
On génère les exons
#+begin_src sh :dir ~/code/bisonex/test-reseq
bedtools getfasta -fi ../test-simuscop/genomeRef.fna -bed exons.gff -fo exons.fna
#+end_src
A tester avec un profile déjà fait :
https://github.com/schmeing/ReSeq-profiles/tree/master/profiles
On cherche l'exons qui nous intéresse
NC_000001.11 g.153817496 A>T
N'y est pas ??
***** DONE On test sur les 2 premiers : exec
CLOSED: [2023-04-30 Sun 18:39]
#+begin_src
head exons.fa -n 2 > 2exons.fna
#+end_src
#+begin_src sh
../ReSeq/bin/reseq illuminaPE -j 32 -R exons.fa -s Ec-Hi2000-TruSeq.reseq --ipfIterations 0 -1 reseq-sim_1.fq reseq_sim_2.fq
#+end_src
#+begin_quote
error: All reference sequences are too short for simulating. They should have at least 1991 bases
#+end_quote
#+begin_src sh
grep '^>NC_000001.10' exons.fa | sed 's/:/,/;s/-/,/;s/^>//' > exons.csv
#+end_src
***** DONE Sur 200 premiers exons du chr1
CLOSED: [2023-04-30 Sun 19:17]
#+begin_src sh :dir ~/code/bisonex/test-reseq :results silent
head -n200 exons.fna > exons-200.fna
bwa index exons-200.fna
#+end_src
Simulation avec 30x
#+begin_src sh :dir ~/code/bisonex/test-reseq :results silent
../ReSeq/bin/reseq illuminaPE -R exons-200.fna -s Ec-Hi2000-TruSeq.reseq --ipfIterations 0 -1 reseq1.fq -2 reseq2.fq -c 30
#+end_src
Attention, pour l'alignement, il faut le nfa complet ! Sinon erreur du type
Erreurs:::sam_hdr_create] Duplicated sequence "NC_000001.10:762970-763155" in file "-"
Et pas de bam avec
samtools sort: failed to change sort order header to 'coordinate'
#+begin_src
bwa mem ../test-simuscop/bwa/genomeRef.fna reseq1.fq reseq2.fq | samtools sort -o reseq.bam
#+end_src
Manque des exons et l'allure ne correspond pas...
***** DONE Utiliser le fichier de capture : exons trop petits
CLOSED: [2023-04-30 Sun 19:25]
Comme pour ART
Trop court avec
echo -e "NC_000001.11\t153817371\t153817542" > gatad2b-exon6.bed
Donc on ajoute 1000 de chaque côté
#+begin_src sh :dir ~/code/bisonex/test-reseq :results silent
echo -e "NC_000001.11\t153816371\t153818542" > gatad2b-exon6.bed
bedtools getfasta -fi ../test-simuscop/genomeRef.fna -bed gatad2b-exon6.bed -fo gatad2b-exon6.fna
bwa index gatad2b-exon6.bed
../ReSeq/bin/reseq illuminaPE -R gatad2b-exon6.fna -s Ec-Hi2000-TruSeq.reseq --ipfIterations 0 -1 reseq1.fq -2 reseq2.fq -c 30
bwa mem ../test-simuscop/bwa/genomeRef.fna reseq1.fq reseq2.fq | samtools sort -o reseq.bam
samtools index reseq.bam
#+end_src
**** KILL Sur le chromosome 15 puis trier à la main sur les zones de capture ?
CLOSED: [2023-04-30 Sun 19:44]
#+begin_src sh :dir ~/code/bisonex/test-reseq :results silent
samtools faidx ../test-simuscop/genomeRef.fna NC_000015.10 > chr15.fna
../ReSeq/bin/reseq illuminaPE -R chr15.fna -s Ec-Hi2000-TruSeq.reseq --ipfIterations 0 -1 reseq1.fq -2 reseq2.fq -c 30
#+end_src
*** DONE ART : fonctionne très mal en targeted
CLOSED: [2023-04-30 Sun 11:49]
**** DONE Génération de reads
CLOSED: [2023-04-30 Sun 11:49]
***** DONE Avec seulement les exons en séquence
CLOSED: [2023-04-30 Sun 10:24]
head -n6 exons.fa | save three-exons.fna
../art_bin_MountRainier/art_illumina -ss HS25 -i three-exons.fna -o ./paired_end_com -l 150 -f 10 -p -m 500 -s 10 -sam
Le sam n'est pas visible sur igv mais si on aligne avec bwa mem, on a quelques reads
***** DONE Extraire une zone de capture dans le fasta
CLOSED: [2023-04-30 Sun 11:49]
NC_000001.11 g.153817496 A>T
****** DONE Essai 1: ne dépasse pas la zone
CLOSED: [2023-04-30 Sun 10:49]
#+begin_src sh :dir ~/code/bisonex/test-art :results silent
echo -e "NC_000001.11\t153817371\t153817542" > gatad2b-exon6.bed
bedtools getfasta -fi ../test-simuscop/genomeRef.fna -bed gatad2b-exon6.bed -fo gatad2b-exon6.fa
#+end_src
-ss HS25 : nom du profile illumina
-l 150 : reads de 150
-f 10 : coverage de 10
-p : paired end
-m 500 : longueur moyenne des fragment d'ADN
-s 10 : déviation standard
#+begin_src sh :dir ~/code/bisonex/test-art :results silent
../art_bin_MountRainier/art_illumina -ss HS25 -i gatad2b-exon6.fa -o ./gatad2b -l 150 -f 100 -p -m 500 -s 10
#+end_src
#+begin_src sh :dir ~/code/bisonex/test-art :results silent
bwa mem ../test-simuscop/bwa/genomeRef gatad2b1.fq gatad2b2.fq | samtools sort -o gatad2b.b
am
samtools index gatad2b.bam
#+end_src
#+attr_html: :width 800px
[[./art-capture-1.png]]
****** Avec offset
50bp idem
NC_00
7305 g.136477305C>G snv heterozygous C G 0 4 4
15 │ NC_000005.10 157285458 g.157285458C>T snv heterozygous C T 0 3 3
16 │ NC_000012.12 23604413 g.23604413T>G snv heterozygous T G 0 5 5
17 │ NC_000019.10 52219703 g.52219703C>T snv heterozygous C T 0 1 1
18 │ NC_000016.10 88856757 g.88856757C>T snv heterozygous C T 0 8 8
******* DONE 8 non retrouvé => probablement hors de la zjone de capture
CLOSED: [2023-04-28 Fri 19:49]
julia> @subset snv :refCount .== 0 :altCount .== 0
8×10 DataFrame
Row │ chrom pos variant variantType zygosity ref alt refCount altCount readsCount
│ SubStrin…? Int64 SubStrin…? String? String15 SubStrin… SubStrin… Int64 Int64 Int64
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ NC_000015.10 74343027 g.74343027C>T snv heterozygous C T 0 0 0
2 │ NC_000011.10 20638345 g.20638345A>G snv heterozygous A G 0 0 0
3 │ NC_000004.12 139370252 g.139370252C>T snv heterozygous C T 0 0 2
4 │ NC_000017.11 61966475 g.61966475G>T snv heterozygous G T 0 0 0
5 │ NC_000019.10 54144058 g.54144058G>A snv heterozygous G A 0 0 0
6 │ NC_000023.11 77635947 g.77635947A>G snv hemizygous A G 0 0 0
7 │ NC_000005.10 1258495 g.1258495G>A snv heterozygous G A 0 0 0
8 │ NC_000012.12 2449086 g.2449086C>G snv heterozygous C G 0 0 0
***** TODO Après haplotypecaller
SCHEDULED: <2023-04-28 Fri>
****** KILL 20x
CLOSED: [2023-04-29 Sat 15:39]
Manque 183 sur 766
[[file:~/recherche/bisonex/simuscop/checkVCF.jl][checkVCF.jl]]
#+begin_src julia
@subset leftjoin(d2, dHaplo2, on=:genomic) ismissing.(:Column1)
#+end_src
Problème de profondeur ?
Ex: chr13 nombre de 101081606
NC_000011.10 16014966 g.16014966G>A
1 read sur 11 pour allèle alternative
Sur le patient de référence, 202 reads!
Celui-ci n'est pas le fichier de capture (ni dans le bam !)
ex: NC_000015.10 74343027 g.74343027C>T
Pour les autres, on devrait les retrouver...
Vérifier le nombre de reads sur 63003856
Vérifier la paramétrisation du modèle également
****** DONE [#B] 200x
CLOSED: [2023-05-18 Thu 11:04] SCHEDULED: <2023-04-30 Sun>
120 manquants (99 sans doublon)!
On vérifie dans IGV (vcf + bam après alignement) :
******* snv NC_000015.10 74343027
- rien d'appelé
- pas une région répétée
- base quality (voir [[*Phred score][Phred score]] ) à 37 donc ok
- variant retrouvé à 26/42
- Bam après aplybqsr: base qualità 35 donc ok
chr15 également à 89318565, variant retrouvé à 25/33 avec basequal de 37
Sans oublier de charger les instructions avx
#+begin_src sh
module load gcc@11.3.0/gcc-12.1.0
#+end_src
On coupe le .bam par chromosome pour débugger (sur le mesocentre)
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/simuscop-centogene-200x/cento/testing :results silent
ln -s ../preprocessing/applybqsr/cento.bam .
ln -s ../preprocessing/recalibrated/cento.bam.bai .
ln -s /Work/Projects/bisonex/data/dbSNP/GRCh38.p13/dbSNP.gz .
ln -s /Work/Projects/bisonex/data/dbSNP/GRCh38.p13/dbSNP.gz.tbi .
ln -s /Work/Projects/bisonex/data/genome/GRCh38.p13/genomeRef.dict .
ln -s /Work/Projects/bisonex/data/genome/GRCh38.p13/genomeRef.fna .
ln -s /Work/Projects/bisonex/data/genome/GRCh38.p13/genomeRef.fna.fai .
#+end_src
On doit lancer à la main (org-mode ne connait pas le chemin de samtools)
samtools view -b cento.bam NC_000015.10 > cento_chr15.bam
samtools index cento_chr15.bam
Puis on se restreint au chronmosome 15
samtools faidx genomeRef.fna NC_000015.10 > genomeRef_chr15.fa
samtools faidx genomeRef_chr15.fa
gatk CreateSequenceDictionary -R genomeRef_chr15.fa -O genomeRef_chr15.dict
On restreint au chromosome 15 avec l'option -L (dure = 1min)
gatk --java-options "-Xmx3072M" HaplotypeCaller --input cento_chr15.bam \
--output test.vcf.gz --reference genomeRef.fna --dbsnp dbSNP.gz --tmp-dir . --max-mnp-distance 2 -L NC_000015.10
******* DONE Tutorial haplotycaller
CLOSED: [2023-05-01 Mon 19:58]
Procédure : https://gatk.broadinstitute.org/hc/en-us/articles/360043491652-When-HaplotypeCaller-and-Mutect2-do-not-call-an-expected-variant
******** DONE Supprimer --max-mnp-distance = 2: idem
CLOSED: [2023-04-30 Sun 15:42]
******** DONE --debug &> run.log : Non appelé...
CLOSED: [2023-04-30 Sun 15:52]
******** DONE --linked-de-bruijn-graph: idem
CLOSED: [2023-04-30 Sun 15:55]
******** DONE --recover-all-dangling-branches
CLOSED: [2023-04-30 Sun 16:01]
******** DONE --min-pruning 0 : plus mais pas celui là
CLOSED: [2023-04-30 Sun 15:59]
******** DONE --bam-output
CLOSED: [2023-04-30 Sun 16:50]
********* DONE : rien !
CLOSED: [2023-04-30 Sun 16:08]
********* DONE + --recover-all-dangling-branches : rien !
CLOSED: [2023-04-30 Sun 16:08]
******** DONE Données filtrées ? apparement non
CLOSED: [2023-04-30 Sun 16:41]
183122 read(s) filtered by: MappingQualityReadFilter
3674 read(s) filtered by: NotDuplicateReadFilter
********* DONE --disable-read-filter MappingQualityReadFilter: idem
CLOSED: [2023-04-30 Sun 16:34]
On a bien - 0 read(s) filtered by: MappingQualityAvailableReadFilter
********* DONE --disable-read-filter NotDuplicateReadFilter: idem
CLOSED: [2023-04-30 Sun 16:40]
******** DONE Essayer freebayes : idem
CLOSED: [2023-04-30 Sun 16:22]
freebayes -f genomeRef.fna -r NC_000015.10 cento_chr15.bam > freebayes-test-chr15.vcf
******** DONE Avec toutes les options : idem
--linked-de-bruijn-graph --recover-all-dangling-branches --min-pruning 0 --bam-output debug.bam
CLOSED: [2023-04-30 Sun 16:50]
******** DONE Vérifier qu'on regarde le même bam : oui
CLOSED: [2023-04-30 Sun 16:50]
******** DONE Désactiver dbSNP : idem
CLOSED: [2023-04-30 Sun 16:52]
******** DONE Changer kmer size : idem
CLOSED: [2023-04-30 Sun 16:56]
par exemple[[https://gatk.broadinstitute.org/hc/en-us/community/posts/360075653152-REAL-Variant-not-called-by-HaplotypeCaller][forum gatk]] --kmer-size 18 --kmer-size 22
******** DONE --adaptive-pruning true
CLOSED: [2023-05-01 Mon 19:57]
******* DONE Mapping quality : est à 0 !!!!
CLOSED: [2023-05-01 Mon 19:58]
****** TODO Comparer VCF avec vcfeval :haplotypecaller:
On prépare les données en julia
#+begin_src ~/recherche/bisonex/simuscop
julia --project=. toVCF.jl
#+end_src
Puis on export sur le mésocentre
#+begin_src
scp variants_for_vcfeval.tsv.gz* meso:centogene_variants/
#+end_src
#+begin_src
z bis
cd simuscop-200x
rtg vcfeval -b ~/centogene_variants/variants_for_vcfeval.tsv.gz -c cento/variantCalling/haplotypecaller/cento.vcf.gz -o compare-haplotypecaller -t /Work/Groups/bisonex/data/giab/GRCh38/genomeRef.sdf
#+end_src
Threshold True-pos-baseline True-pos-call False-pos False-neg Precision Sensitivity F-measure
----------------------------------------------------------------------------------------------------
82.000 540 540 60 45 0.9000 0.9231 0.9114
None 546 546 329 39 0.6240 0.9333 0.7479
****** TODO Comparer avec hap.py :haplotypecaller:
NXF_OPTS=-D"user.name=${USER}" nextflow run workflows/checkInserted.nf -profile standard,helios --outdir=compare-simuscop-200x --query=out/simuscop-centogene-200x/cento/callVariant/haplotypecaller/cento.vcf.gz --truth=centogene_variants/variants_for_vcfeval.tsv.gz --id=simuscop-200x-check
****** DONE Méthode naïve 549/585
CLOSED: [2023-05-04 Thu 21:57]
Haplotypecaller: Nb reference SNV 692 vs found 585
Variant calling, filter technical: reference SNV 692 vs found 521
***** TODO Avant annotation
SCHEDULED: <2023-04-28 Fri>
#+begin_src
cd cento/variantCalling
bgzip filter-technical.vcf
tabix -p vcf filter-technical.vcf.gz -f
#+end_src
Threshold True-pos-baseline True-pos-call False-pos False-neg Precision Sensitivity F-measure
----------------------------------------------------------------------------------------------------
12.000 519 519 55 66 0.9042 0.8872 0.8956
None 519 519 55 66 0.9042 0.8872 0.8956
****** DONE Méthode naïve 521/585
CLOSED: [2023-05-04 Thu 21:57]
Haplotypecaller: Nb reference SNV 692 vs found 585
Variant calling, filter technical: reference SNV 692 vs found 521
****** TODO Comparer avec hap.py
***** TODO Après filtre annotation
****** DONE Méthode naïve : 493/585
CLOSED: [2023-05-04 Thu 22:09]
****** TODO Comparer avec hap.py
****** TODO VCf eval
cd cento/annotation/
bgzip postvep-filter.vcf
tabix postvep-filter.vcf.gz
cd ../..
rtg vcfeval -b ~/centogene_variants/variants_for_vcfeval.tsv.gz -c cento/annotation/postvep-filter.vcf.gz -o compare-vepfilter -t /Work/Groups/bisonex/data/giab/GRCh38/genomeRef.sdf
Threshold True-pos-baseline True-pos-call False-pos False-neg Precision Sensitivity F-measure
----------------------------------------------------------------------------------------------------
12.000 491 491 50 94 0.9076 0.8393 0.8721
None 491 491 50 94 0.9076 0.8393 0.8721
*** KILL NEAT : trop lent :neat:
CLOSED: [2023-04-29 Sat 22:06]
**** KILL Génération fastq sur exno 5 GATAD2B
CLOSED: [2023-04-29 Sat 22:06]
Trop lent : pour 1 exon : 1500 secondes !
#+begin_src sh
samtools faidx genomeRef.fna NC_000001.11 | save -f genomeRef_chr1.fna
python gen_reads.py -r ../test-simuscop/genomeRef_chr1.fna -o lol -tr ../test-simuscop/gatad2b-exon6.bed -R 147 --pe 150 10
#+end_src
*** KILL ReSeq : exome avec exons comme fasta mais ne gère pas des exons trop petits :reseq:
CLOSED: [2023-04-30 Sun 19:44] SCHEDULED: <2023-04-29 Sat>
#+begin_quote
Can I simulate exome sequencing? Yes. You need to use a reference that only contains the exons as individual scaffolds. Using --refBiasFile you can specify the coverage of individual exons. To simulate intron contamination you can add the whole reference to the reference containing the exons and strongly reduce the coverage for these scaffolds using --refBiasFile.
#+end_quote
Par contre, rapide
**** DONE Fasta pour exons seuls
CLOSED: [2023-04-30 Sun 19:25]
Depuis le GFF
#+begin_src sh :dir ~/code/bisonex/test-reseq :results silent
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_genomic.gff.gz
#+end_src
#+begin_src sh :dir ~/code/bisonex/test-reseq :results silent
gunzip -c GCF_000001405.39_GRCh38.p13_genomic.gff.gz | grep -w "exon" > exons.gff
#+end_src
On génère les exons
#+begin_src sh :dir ~/code/bisonex/test-reseq
bedtools getfasta -fi ../test-simuscop/genomeRef.fna -bed exons.gff -fo exons.fna
#+end_src
A tester avec un profile déjà fait :
https://github.com/schmeing/ReSeq-profiles/tree/master/profiles
On cherche l'exons qui nous intéresse
NC_000001.11 g.153817496 A>T
N'y est pas ??
***** DONE On test sur les 2 premiers : exec
CLOSED: [2023-04-30 Sun 18:39]
#+begin_src
head exons.fa -n 2 > 2exons.fna
#+end_src
#+begin_src sh
../ReSeq/bin/reseq illuminaPE -j 32 -R exons.fa -s Ec-Hi2000-TruSeq.reseq --ipfIterations 0 -1 reseq-sim_1.fq reseq_sim_2.fq
#+end_src
#+begin_quote
error: All reference sequences are too short for simulating. They should have at least 1991 bases
#+end_quote
#+begin_src sh
grep '^>NC_000001.10' exons.fa | sed 's/:/,/;s/-/,/;s/^>//' > exons.csv
#+end_src
***** DONE Sur 200 premiers exons du chr1
CLOSED: [2023-04-30 Sun 19:17]
#+begin_src sh :dir ~/code/bisonex/test-reseq :results silent
head -n200 exons.fna > exons-200.fna
bwa index exons-200.fna
#+end_src
Simulation avec 30x
#+begin_src sh :dir ~/code/bisonex/test-reseq :results silent
../ReSeq/bin/reseq illuminaPE -R exons-200.fna -s Ec-Hi2000-TruSeq.reseq --ipfIterations 0 -1 reseq1.fq -2 reseq2.fq -c 30
#+end_src
Attention, pour l'alignement, il faut le nfa complet ! Sinon erreur du type
Erreurs:::sam_hdr_create] Duplicated sequence "NC_000001.10:762970-763155" in file "-"
Et pas de bam avec
samtools sort: failed to change sort order header to 'coordinate'
#+begin_src
bwa mem ../test-simuscop/bwa/genomeRef.fna reseq1.fq reseq2.fq | samtools sort -o reseq.bam
#+end_src
Manque des exons et l'allure ne correspond pas...
***** DONE Utiliser le fichier de capture : exons trop petits
CLOSED: [2023-04-30 Sun 19:25]
Comme pour ART
Trop court avec
echo -e "NC_000001.11\t153817371\t153817542" > gatad2b-exon6.bed
Donc on ajoute 1000 de chaque côté
#+begin_src sh :dir ~/code/bisonex/test-reseq :results silent
echo -e "NC_000001.11\t153816371\t153818542" > gatad2b-exon6.bed
bedtools getfasta -fi ../test-simuscop/genomeRef.fna -bed gatad2b-exon6.bed -fo gatad2b-exon6.fna
bwa index gatad2b-exon6.bed
../ReSeq/bin/reseq illuminaPE -R gatad2b-exon6.fna -s Ec-Hi2000-TruSeq.reseq --ipfIterations 0 -1 reseq1.fq -2 reseq2.fq -c 30
bwa mem ../test-simuscop/bwa/genomeRef.fna reseq1.fq reseq2.fq | samtools sort -o reseq.bam
samtools index reseq.bam
#+end_src
**** KILL Sur le chromosome 15 puis trier à la main sur les zones de capture ?
CLOSED: [2023-04-30 Sun 19:44]
#+begin_src sh :dir ~/code/bisonex/test-reseq :results silent
samtools faidx ../test-simuscop/genomeRef.fna NC_000015.10 > chr15.fna
../ReSeq/bin/reseq illuminaPE -R chr15.fna -s Ec-Hi2000-TruSeq.reseq --ipfIterations 0 -1 reseq1.fq -2 reseq2.fq -c 30
#+end_src
*** DONE ART : fonctionne très mal en targeted
CLOSED: [2023-04-30 Sun 11:49]
**** DONE Génération de reads
CLOSED: [2023-04-30 Sun 11:49]
***** DONE Avec seulement les exons en séquence
CLOSED: [2023-04-30 Sun 10:24]
head -n6 exons.fa | save three-exons.fna
../art_bin_MountRainier/art_illumina -ss HS25 -i three-exons.fna -o ./paired_end_com -l 150 -f 10 -p -m 500 -s 10 -sam
Le sam n'est pas visible sur igv mais si on aligne avec bwa mem, on a quelques reads
***** DONE Extraire une zone de capture dans le fasta
CLOSED: [2023-04-30 Sun 11:49]
NC_000001.11 g.153817496 A>T
****** DONE Essai 1: ne dépasse pas la zone
CLOSED: [2023-04-30 Sun 10:49]
#+begin_src sh :dir ~/code/bisonex/test-art :results silent
echo -e "NC_000001.11\t153817371\t153817542" > gatad2b-exon6.bed
bedtools getfasta -fi ../test-simuscop/genomeRef.fna -bed gatad2b-exon6.bed -fo gatad2b-exon6.fa
#+end_src
-ss HS25 : nom du profile illumina
-l 150 : reads de 150
-f 10 : coverage de 10
-p : paired end
-m 500 : longueur moyenne des fragment d'ADN
-s 10 : déviation standard
#+begin_src sh :dir ~/code/bisonex/test-art :results silent
../art_bin_MountRainier/art_illumina -ss HS25 -i gatad2b-exon6.fa -o ./gatad2b -l 150 -f 100 -p -m 500 -s 10
#+end_src
#+begin_src sh :dir ~/code/bisonex/test-art :results silent
bwa mem ../test-simuscop/bwa/genomeRef gatad2b1.fq gatad2b2.fq | samtools sort -o gatad2b.b
am
samtools index gatad2b.bam
#+end_src
#+attr_html: :width 800px
[[./art-capture-1.png]]
****** Avec offset
50bp idem
NC_00
el à rand(d, 1)[1] est semblable à un appel de rand(d, 100)
CLOSED: [2023-05-31 Wed 22:24]
julia> df = vcat(DataFrame(:y => z, :type => "z"), DataFrame(:y => y, :type => "y"));
julia> y = [rand(d, 1)[1] for x in 1:1000];
julia> z = rand(d,1000);
julia> df = vcat(DataFrame(:y => z, :type => "z"), DataFrame(:y => y, :type => "y
"));
draw(data(df)*histogram(bins=100)*mapping(:y, color=:type,dodge=:type))
****** DONE Améliorer les performances
CLOSED: [2023-06-02 Fri 23:39]
#+begin_src julia
@time include("xamscissors.jl")
#+end_src
430s pour chromosome 22. Majorité dans l'édition de reads:
******* DONE Inserér tous les variants d'un reads d'un coup
CLOSED: [2023-06-01 Thu 23:09]
Ne change rien
******* DONE Test avec -t4: idem
CLOSED: [2023-06-01 Thu 23:17]
******* DONE Test mésocentre : idem
CLOSED: [2023-06-01 Thu 23:40]
348s
******* Changer la structure de données des
Dataframe -> dict = les performances horribles ont disparuse
****** TODO Refaire le test avec la nouvelle version
******* DONE Génération des données
CLOSED: [2023-06-02 Fri 23:40]
Mésocentre
#+begin_src sh
cd /Work/Users/apraga/bisonex/out/63003856_S135_R/preprocessing/mapped
samtools view 63003856_S135_R.bam NC_000022.11 -o 63003856_S135_R_chr22.bam
samtools index 63003856_S135_R_chr22.bam
cp 63003856_S135_R_chr22.bam* /Work/Users/apraga/bisonex/tests/xamscissors/
cd /Work/Users/apraga/bisonex/tests/xamscissors
#+end_src
On génère les données
#+begin_src julia
using XAMScissors
insertSNV("./63003856_S135_R_chr22.bam", "snvs_chr22.csv", "out")
#+end_src
Puis
#+begin_src sh
julia xamscissors.jl
#+end_src
******* DONE Run
CLOSED: [2023-06-03 Sat 18:26]
NXF_OPTS=-D"user.name=${USER}" nextflow run workflows/runInserted.nf -profile standard,helios --input="tests/xamscissors/out/inserted_{1,2}.fq.gz"
******* DONE Après haplotypecaller : ok
CLOSED: [2023-06-03 Sat 18:27] SCHEDULED: <2023-06-03 Sat>
******* TODO Après filtre vep
SCHEDULED: <2023-06-03 Sat>
***** TODO PHase 3 : tous les SNV
SCHEDULED: <2023-06-03 Sat>
****** DONE Générer les données
CLOSED: [2023-06-03 Sat 20:16] SCHEDULED: <2023-06-03 Sat>
#+begin_src julia
using XAMScissors
insertSNV("../../out/63003856_S135_R/preprocessing/mapped/63003856_S135_R.bam", "snvs.csv", "out")
#+end_src
temps d'exécution 73min
#+begin_src sh
nohup bash -c 'time julia xamscissors.jl' &
xamscissors-63003856/*.fq.gz /Work/Groups/bisonex/data/xamscissors/
#+end_src
****** DONE Regénérer avec @time pour avoir les performaces
CLOSED: [2023-06-03 Sat 21:45]
markReads 6.265202 seconds (1.36 M allocations: 137.090 MiB, 1.00% gc time, 9.79% compilation time)
editReads 1327.701623 seconds (1.03 G allocations: 81.996 GiB, 0.59% gc time, 0.03% compilation time)
samtools index 117.743727 seconds (53 allocations: 1.883 KiB)
samtools sort 2820.074930 seconds (66 allocations: 2.789 KiB)
bam2fastq 134.148952 seconds (794 allocations: 40.539 KiB, 0.01% compilation time)
real 73m33.273s
user 77m38.194s
sys 1m26.684s
[bam_sort_core] merging from 60 files and 1 in-memory blocks...
[M::bam2fq_mainloop] discarded 0 singletons
[M::bam2fq_mainloop] processed 126905130 reads
real 73m6.934s
user 77m25.397s
sys 1m21.339s
****** TODO Analyses
SCHEDULED: <2023-06-04 Sun>
**** TODO Test Indel
*** Divers
**** DONE Vérifier nombre de reads fastq - bam
CLOSED: [2022-10-09 Sun 22:31]
el à rand(d, 1)[1] est semblable à un appel de rand(d, 100)
CLOSED: [2023-05-31 Wed 22:24]
julia> df = vcat(DataFrame(:y => z, :type => "z"), DataFrame(:y => y, :type => "y"));
julia> y = [rand(d, 1)[1] for x in 1:1000];
julia> z = rand(d,1000);
julia> df = vcat(DataFrame(:y => z, :type => "z"), DataFrame(:y => y, :type => "y"));
draw(data(df)*histogram(bins=100)*mapping(:y, color=:type,dodge=:type))
****** DONE Améliorer les performances
CLOSED: [2023-06-02 Fri 23:39]
#+begin_src julia
@time include("xamscissors.jl")
#+end_src
430s pour chromosome 22. Majorité dans l'édition de reads:
******* DONE Inserér tous les variants d'un reads d'un coup
CLOSED: [2023-06-01 Thu 23:09]
Ne change rien
******* DONE Test avec -t4: idem
CLOSED: [2023-06-01 Thu 23:17]
******* DONE Test mésocentre : idem
CLOSED: [2023-06-01 Thu 23:40]
348s
******* Changer la structure de données des
Dataframe -> dict = les performances horribles ont disparuse
****** TODO Refaire le test avec la nouvelle version
******* DONE Génération des données
CLOSED: [2023-06-02 Fri 23:40]
Mésocentre
#+begin_src sh
cd /Work/Users/apraga/bisonex/out/63003856_S135_R/preprocessing/mapped
samtools view 63003856_S135_R.bam NC_000022.11 -o 63003856_S135_R_chr22.bam
samtools index 63003856_S135_R_chr22.bam
cp 63003856_S135_R_chr22.bam* /Work/Users/apraga/bisonex/tests/xamscissors/
cd /Work/Users/apraga/bisonex/tests/xamscissors
#+end_src
On génère les données
#+begin_src julia
using XAMScissors
insertSNV("./63003856_S135_R_chr22.bam", "snvs_chr22.csv", "out")
#+end_src
Puis
#+begin_src sh
julia xamscissors.jl
#+end_src
******* DONE Run
CLOSED: [2023-06-03 Sat 18:26]
NXF_OPTS=-D"user.name=${USER}" nextflow run workflows/runInserted.nf -profile standard,helios --input="tests/xamscissors/out/inserted_{1,2}.fq.gz"
******* DONE Après haplotypecaller : ok
CLOSED: [2023-06-03 Sat 18:27] SCHEDULED: <2023-06-03 Sat>
******* TODO Après filtre vep
SCHEDULED: <2023-06-03 Sat>
***** TODO PHase 3 : tous les SNV
SCHEDULED: <2023-06-03 Sat>
****** DONE Générer les données
CLOSED: [2023-06-03 Sat 20:16] SCHEDULED: <2023-06-03 Sat>
#+begin_src julia
using XAMScissors
insertSNV("../../out/63003856_S135_R/preprocessing/mapped/63003856_S135_R.bam", "snvs.csv", "out")
#+end_src
temps d'exécution 73min
#+begin_src sh
nohup bash -c 'time julia xamscissors.jl' &
xamscissors-63003856/*.fq.gz /Work/Groups/bisonex/data/xamscissors/
#+end_src
****** DONE Regénérer avec @time pour avoir les performaces
CLOSED: [2023-06-03 Sat 21:45]
markReads 6.265202 seconds (1.36 M allocations: 137.090 MiB, 1.00% gc time, 9.79% compilation time)
editReads 1327.701623 seconds (1.03 G allocations: 81.996 GiB, 0.59% gc time, 0.03% compilation time)
samtools index 117.743727 seconds (53 allocations: 1.883 KiB)
samtools sort 2820.074930 seconds (66 allocations: 2.789 KiB)
bam2fastq 134.148952 seconds (794 allocations: 40.539 KiB, 0.01% compilation time)
real 73m33.273s
user 77m38.194s
sys 1m26.684s
[bam_sort_core] merging from 60 files and 1 in-memory blocks...
[M::bam2fq_mainloop] discarded 0 singletons
[M::bam2fq_mainloop] processed 126905130 reads
real 73m6.934s
user 77m25.397s
sys 1m21.339s
****** TODO Après haplotypecaller 556/590, majorité = échec alignement
SCHEDULED: <2023-06-04 Sun>
Haplotypecaller 556 found over 590
Amongst 34 missed variant, 2 have a mapping quality > 0
2×7 DataFrame
Row │ chrom pos ref alt zygosity meanQual stdQual
│ String15 Int64 String1 String1 String7 Float64 Float64
─────┼─────────────────────────────────────────────────────────────────────────
1 │ NC_000017.11 39672244 G A het 60.0 0.0
2 │ NC_000001.11 155235252 A G het 0.258065 2.48868
NC_000017.11 39672244 G A het => ok, problème de représentation car 2 variant côte à cote
NC_000001.11 155235252 A G het => peu de reads alternatifs (9/93 donc ok)
Position: chromoe 1 et 6 surtout
34×7 DataFrame
Row │ chrom pos ref alt zygosity
│ String15 Int64 String1 String1 String7
─────┼──────────────────────────────────────────────────────
1 │ NC_000001.11 153817496 A T het
2 │ NC_000001.11 155235252 A G het
3 │ NC_000001.11 155236268 G A het
4 │ NC_000001.11 155290591 C T het
5 │ NC_000001.11 155291918 G A het
6 │ NC_000001.11 155294358 G T het
7 │ NC_000002.12 149010343 C T het
8 │ NC_000006.12 32039426 T A het
9 │ NC_000006.12 32040110 G T het
10 │ NC_000006.12 32040723 G A het
11 │ NC_000006.12 32041006 C T het
12 │ NC_000006.12 32041147 G A het
13 │ NC_000006.12 33443054 G T het
14 │ NC_000006.12 33451815 C T het
15 │ NC_000006.12 170283230 C A het
16 │ NC_000006.12 170283754 G A het
17 │ NC_000006.12 170285637 T C het
18 │ NC_000006.12 170289678 A C het
19 │ NC_000010.11 87961118 G A het
20 │ NC_000012.12 2449086 C G het
21 │ NC_000015.10 74343027 C T het
22 │ NC_000016.10 16163078 G A het
23 │ NC_000016.10 21262032 C G het
24 │ NC_000016.10 21962506 C T homo
25 │ NC_000017.11 7513122 C T het
26 │ NC_000017.11 7513752 C T het
27 │ NC_000017.11 39672244 G A het
28 │ NC_000017.11 46018710 C T het
29 │ NC_000019.10 54144058 G A het
30 │ NC_000021.9 43063074 A G het
31 │ NC_000021.9 43426167 C T het
32 │ NC_000022.11 18918421 A G het
33 │ NC_000022.11 42087168 T A homo
34 │ NC_000022.11 42213078 T G het
****** DONE Voir où est l'alignement alternatif: sur NW_ (zone supprimée)
CLOSED: [2023-06-04 Sun 22:15] SCHEDULED: <2023-06-04 Sun>
ex chr15 74343027
A00853:477:HMLWYDSX3:2:2444:22354:28870
#+begin_src
cd /Work/Groups/bisonex/data/xamscissors
zgrep -A4 "A00853:477:HMLWYDSX3:2:2444:22354:28870" *.fq.gz
#+end_src
63003856_xamscissors_1.fq.gz:@A00853:477:HMLWYDSX3:2:2444:22354:28870
63003856_xamscissors_1.fq.gz:CACCGTGTCCACCCCTCCTGCCGGCATCTCTGTGACGTTGGCCTTGATGTCCTTGAAGGACATCTTGCTGTCTCCCAGGAGTCTGTAGAGGATGCCACGGTAATCGTGGTGAACACTTCCTTTCTGTC
63003856_xamscissors_1.fq.gz:+
63003856_xamscissors_1.fq.gz:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFF:FFFFFFFFFF::FFFFFFFFFFF:FFFFFFFFFFFFFF:FFFFFFF,FFFFFF,FFFFFFFFFFFF:FF::FF
63003856_xamscissors_2.fq.gz:@A00853:477:HMLWYDSX3:2:2444:22354:28870
63003856_xamscissors_2.fq.gz:GACAGAAAGGAAGTGTTCACCACGATTACCGTGGCATCCTCTACAGACTCCTGGGAGACAGCAAGATGTCCTTCGAGGACATCAAGGCCAACGTCACAGAGATGCCGGCAGGAGGGGTGGACACGGTG
63003856_xamscissors_2.fq.gz:+
63003856_xamscissors_2.fq.gz:FF:FFF:FF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFF:F:FF:FFFFFFFFFFFFFF:FFFFFFFFFFFFFFFF,:FFF,FFFFFF:FFFFFFFFFFFFF
******* DONE Avec BLAT: sur _fix
CLOSED: [2023-06-04 Sun 21:07]
1er =
ACTIONS QUERY SCORE START END QSIZE IDENTITY CHROM STRAND START END SPAN
--------------------------------------------------------------------------------------------------------------
browser details YourSeq 124 1 128 128 98.5% chr15_ML143370v1_fix + 172243 172370 128 What is chrom_fix?
browser details YourSeq 124 1 128 128 98.5% chr15 + 74342974 74343101 128
browser details YourSeq 23 1 25 128 96.0% chr19 - 33396097 33396121 25
Second
--------------------------------------------------------------------------------------------------------------
browser details YourSeq 126 1 128 128 99.3% chr15_ML143370v1_fix - 172243 172370 128 What is chrom_fix?
browser details YourSeq 126 1 128 128 99.3% chr15 - 74342974 74343101 128
browser details YourSeq 23 104 128 128 96.0% chr19 + 33396097 33396121 25
******* DONE Bwa mem à la main GRCh38.p13 : on est dans une zone NW
CLOSED: [2023-06-04 Sun 21:51]
On met les 2 reads dans des fichiers séparés puis
#+begin_src sh
cd /Work/Users/apraga/bisonex/tests/xamscissors/align
bwa mem /Work/Groups/bisonex/data/genome/GRCh38.p13/bwa/genomeRef test1.fq test2.fq
#+end_src
A00853:477:HMLWYDSX3:2:2444:22354:28870 97 NW_021160016.1 172243 0 128M = 172243 128 CACCGTGTCCACCCCTCCTGCCGGCATCTCTGTGACGTTGGCCTTGATGTCCTTGAAGGACATCTTGCTGTCTCCCAGGAGTCTGTAGAGGATGCCACGGTAATCGTGGTGAACACTTCCTTTCTGTC FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFF:FFFFFFFFFF::FFFFFFFFFFF:FFFFFFFFFFFFFF:FFFFFFF,FFFFFF,FFFFFFFFFFFF:FF::FF NM:i:2 MD:Z:22A30C7MC:Z:128M AS:i:118 XS:i:118 XA:Z:NC_000015.10,+74342974,128M,2;
A00853:477:HMLWYDSX3:2:2444:22354:28870 145 NW_021160016.1 172243 0 128M = 172243 -128 CACCGTGTCCACCCCTCCTGCCGGCATCTCTGTGACGTTGGCCTTGATGTCCTCGAAGGACATCTTGCTGTCTCCCAGGAGTCTGTAGAGGATGCCACGGTAATCGTGGTGAACACTTCCTTTCTGTC FFFFFFFFFFFFF:FFFFFF,FFF:,FFFFFFFFFFFFFFFF:FFFFFFFFFFFFFF:FF:F:FFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FF:FFF:FF NM:i:1 MD:Z:22A105 MC:Z:128M AS:i:123 XS:i:123 XA:Z:NC_000015.10,-74342974,128M,1;
******* DONE GRCh38.p14: idem
CLOSED: [2023-06-04 Sun 21:51]
A00853:477:HMLWYDSX3:2:2444:22354:28870 97 NW_021160016.1 172243 0 128M = 172243 128 CACCGTGTCCACCCCTCCTGCCGGCATCTCTGTGACGTTGGCCTTGATGTCCTTGAAGGACATCTTGCTGTCTCCCAGGAGTCTGTAGAGGATGCCACGGTAATCGTGGTGAACACTTCCTTTCTGTC FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFF:FFFFFFFFFF::FFFFFFFFFFF:FFFFFFFFFFFFFF:FFFFFFF,FFFFFF,FFFFFFFFFFFF:FF::FF NM:i:2 MD:Z:22A30C7MC:Z:128M AS:i:118 XS:i:118 XA:Z:NC_000015.10,+74342974,128M,2;
A00853:477:HMLWYDSX3:2:2444:22354:28870 145 NW_021160016.1 172243 0 128M = 172243 -128 CACCGTGTCCACCCCTCCTGCCGGCATCTCTGTGACGTTGGCCTTGATGTCCTCGAAGGACATCTTGCTGTCTCCCAGGAGTCTGTAGAGGATGCCACGGTAATCGTGGTGAACACTTCCTTTCTGTC FFFFFFFFFFFFF:FFFFFF,FFF:,FFFFFFFFFFFFFFFF:FFFFFFFFFFFFFF:FF:F:FFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FF:FFF:FF NM:i:1 MD:Z:22A105 MC:Z:128M AS:i:123 XS:i:123 XA:Z:NC_000015.10,-74342974,128M,1;
******* DONE GRCh38 : ok
CLOSED: [2023-06-04 Sun 22:15]
bwa mem /Work/Projects/bisonex/data/genome/GRCh38/GCA_000001405.15_GRCh38_full_analysis_set.fna test1.fq test2.fq
****** DONE Vérifier que les reads ont la même qualité sur les fichiers d'origine: oui
CLOSED: [2023-06-04 Sun 21:07]
****** TODO Supprimer les NW_ ?
SCHEDULED: <2023-06-04 Sun>
Attente réponse alexis
**** TODO Test Indel
*** Divers
**** DONE Vérifier nombre de reads fastq - bam
CLOSED: [2022-10-09 Sun 22:31]
* <2023-06-04 Sun>
** Grammaire
V-といいです: Il faudrait que tu V
ex: 「頭が痛いです」「水を飲むといいですよ」
v-たらどうですか: il serait bien que tu V
ex: 「水の飲んだらどうですか。」
ne rien X
nanino tabemasu
nanio hanusu koto ga arimasu
nogaku narimasu = ??
nariso ???
** Vocabulaire
- 自動( じどう ): automatic
- 何日間: how many days
togaru = avoir très envie de