apraga/org - Change R4DZDAYOBE2YDB74CFSPWJHS3WCNT3GGHNIDER36Y3AJASAOBRQAC

Bisonex: bamsurgeon works with an old version

Created by Alexis Praga on May 19, 2023

R4DZDAYOBE2YDB74CFSPWJHS3WCNT3GGHNIDER36Y3AJASAOBRQAC

Dependencies

In channels

main

Change contents

Replacement in projects/bisonex.org at line 40 [4.35]

B:BD[3.13057] → [3.13057:13200]

B:BD[3.13200] → [5.24749:41172]

∅:D[5.41172] → [6.6459:10784]

B:BD[6.6459] → [6.6459:10784]

∅:D[6.10784] → [7.369202:372887]

B:BD[7.369202] → [7.369202:372887]

oire)
Mais mauvaise répartitiopn
#+attr_html: :width 800px
[[./simuscop-error.png]]
À tester
- Problème de profile ?
  - mauvais pat
ient ?
  - mauvaise génération ? -> comparer avec ceux donnés sur github
- nom des chromosomes ?
***** DONE [#A] Tester sur exon 6 GATAD2B pour NC_000001.11:g.153817496A>T
CLOSED: [2023-04-29 Sat 19:56] SCHEDULED: <2023-04-29 Sat>
****** DONE Configuration + Profile 63003856.profile: idem, mal centré
CLOSED: [2023-04-29 Sat 19:18]
Téléchargement des données
#+begin_src sh :dir ~/code/bisonex/test-simuscop
scp meso:/Work/Projects/bisonex/data/genome/GRCh38.p14/genomeRef.fna .
scp meso:Work/Projects/bisonex/data/simuscop/*.profile .
scp -r meso:/Work/Projects/bisonex/data/genome/GRCh38.p13/bwa .
#+end_src
On récupère l'exon (NB: org-mode ne lance pas le code...)
#+begin_src julia
using CSV,DataFramesMeta
d = CSV.read("VCGS_Exome_Covered_Targets_hg38_40.1MB_renamed.bed", header=false, delim="\t", DataFrame)
@subset d :Column1 .== "NC_000001.11" :Column2 .<= 153817496 :Column3 .>= 153817496
#+end_src
NC_000001.11  153817371  153817542
Génération du bed
#+begin_src sh :dir ~/code/bisonex/test-simuscop
echo -e "NC_000001.11\t153817371\t153817542" > gatad2b-exon6.bed
#+end_src
#+RESULTS:
Génération d'un variant
#+begin_src sh :dir ~/code/bisonex/test-simuscop
echo -e "s\tsingle\tNC_000001.11\t153817496\tA\tT\thet"> variant.txt
#+end_src
#+RESULTS:
Génération du fichier de config
#+begin_src sh :dir ~/code/bisonex/test-simuscop
cat > config_wes.txt << EOL
ref = genomeRef.fna
profile = ./63003856.profile
variation = ./variant.txt
target = ./gatad2b-exon6.bed
layout = PE
threads = 1
name = single
output = test-gatad2b
coverage = 20
EOL
#+end_src
#+RESULTS:
On démarre la simulation
#+begin_src sh :dir ~/code/bisonex/test-simuscop
simuReads config_wes.txt
#+end_src
#+RESULTS:
Alignement
#+begin_src sh :dir ~/code/bisonex/test-simuscop
bwa mem -R '@RG\tID:sample\tSM:sample\tPL:ILLUMINA\tPM:Miseq\tCN:lol\tLB:definition_to_add' bwa/genomeRef test-gatad2b/single_1.fq  test-gatad2b/single_2.fq | samtools sort  -o single.bam
#+end_src
#+RESULTS:
****** DONE Profile github  HiSeq2000
CLOSED: [2023-04-29 Sat 19:56]
#+begin_src sh :dir ~/code/bisonex/test-simuscop :result file
wget https://raw.githubusercontent.com/qasimyu/simuscop/master/testData/Illumina_HiSeq2000.profile
#+end_src
#+RESULTS:
#+begin_src sh :dir ~/code/bisonex/test-simuscop
cat > config_wes.txt << EOL
ref = genomeRef.fna
profile = ./Illumina_HiSeq2000.profile
variation = ./variant.txt
target = ./gatad2b-exon6.bed
layout = PE
threads = 1
name = single
output = test-gatad2b-hiseq2000
coverage = 20
EOL
simuReads config_wes.txt
bwa mem -R '@RG\tID:sample\tSM:sample\tPL:ILLUMINA\tPM:Miseq\tCN:lol\tLB:definition_to_add' bwa/genomeRef test-gatad2b-hiseq2000/single_1.fq  test-gatad2b-hiseq2000/single_2.fq | samtools sort  -o single-hiseq2000.bam
samtools index single-hiseq2000.bam
#+end_src
#+RESULTS:
****** KILL Tester exemple sur github
CLOSED: [2023-04-29 Sat 19:56]
#+begin_src sh
git clone https://github.com/qasimyu/simuscop/
cd simuscop
simuReads configFiles/config_test_wes.txt
#+end_src
****** KILL Centrer la fenêtre sur les zones de capture
CLOSED: [2023-04-30 Sun 13:28] SCHEDULED: <2023-04-29 Sat>
1000bp par défaut, ce qui est plus grand que les zones de captures...
Changer fragzip ne fonctionne pas
Si on rajoute un offset sur l'exon: 200bp, est encore plus allongé
NC_000001.11 153817371 153817542 ->
NC_000001.11 153817171 153817742
Si on désactive les target ?
Regarder les target sur le chromosome 1
#+begin_src sh :dir ~/code/bisonex/test-simuscop :results silent
scp meso:/Work/Projects/bisonex/data/simuscop/VCGS_Exome_Covered_Targets_hg38_40.1MB_renamed.bed .
#+end_src
#+begin_src sh :dir ~/code/bisonex/test-simuscop :results silent
head -n 100 VCGS_Exome_Covered_Targets_hg38_40.1MB_renamed.bed > 100exons.bed
echo -e "s\tsingle\tNC_000001.11\t153817496\tA\tT\thet"> variant.txt
cat > config_wes.txt << EOL
ref = genomeRef.fna
profile = ./63003856.profile
variation = ./variant.txt
layout = PE
threads = 4
target = 100exons.bed
name = single
output = test-gatad2b
coverage = 200
EOL
./simuscop/bin/simuReads config_wes.txt
bwa mem bwa/genomeRef test-gatad2b/single_1.fq  test-gatad2b/single_2.fq | samtools sort  -o single.bam
samtools index single.bam
#+end_src
**** TODO Vérifier tous les variants sont retrouvés en 200x
SCHEDULED: <2023-04-22 Sat>
***** DONE Après alignement
CLOSED: [2023-04-29 Sat 18:27] SCHEDULED: <2023-04-28 Fri>
****** DONE SNV: avec doublons
CLOSED: [2023-04-28 Fri 18:12]
On utilise [[file:~/recherche/bisonex/simuscop/checkBam.jl][checkBam.jl]]
#+begin_src julia
d = prepareVariant("../parsevariants/variant_genomic.csv")
root = "/home/alex/code/bisonex/simuscop-centogene/cento"
bam = root * "/preprocessing/applybqsr/cento.bam"
bai = root * "/preprocessing/recalibrated/cento.bam.bai"
snv = getSNV(d, bam, bai)
#+end_src
Nombreux faux homozygouteS
Vérification avec checkFalseHemizygous(snv) : nombreux doublons dans le fichier pour simuscop...
****** DONE SNV sans doublons
CLOSED: [2023-04-29 Sat 18:27]
******* DONE 18 faux homozygote mais avec peu de reads
CLOSED: [2023-04-29 Sat 18:27]
julia> @subset snv :refCount .== 0 :altCount .> 0 :zygosity .== "heterozygous"
18×10 DataFrame
 Row │ chrom         pos        variant         variantType  zygosity      ref        alt        refCount  altCount  readsCount
     │ SubStrin…?    Int64      SubStrin…?      String?      String15      SubStrin…  SubStrin…  Int64     Int64     Int64
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ NC_000022.11   42213078  g.42213078T>G   snv          heterozygous  T          G                 0         1           1
   2 │ NC_000012.12  101680427  g.101680427C>A  snv          heterozygous  C          A                 0         3           3
   3 │ NC_000014.9   105385684  g.105385684G>C  snv          heterozygous  G          C                 0         4           4
   4 │ NC_000011.10  125978299  g.125978299C>T  snv          heterozygous  C          T                 0         3           3
   5 │ NC_000023.11   77998618  g.77998618C>T   snv          heterozygous  C          T                 0         2           2
   6 │ NC_000015.10   66703292  g.66703292C>T   snv          heterozygous  C          T                 0         3           3
   7 │ NC_000010.11   87961118  g.87961118G>A   snv          heterozygous  G          A                 0         3           3
   8 │ NC_000012.12  112477719  g.112477719A>G  snv          heterozygous  A          G                 0         2           2
   9 │ NC_000020.11    6778406  g.6778406C>T    snv          heterozygous  C          T                 0         3           3
  10 │ NC_000023.11   68192943  g.68192943G>A   snv          heterozygous  G          A                 0         2           2
  11 │ NC_000004.12     987858  g.987858C>T     snv          heterozygous  C          T                 0         3           4
  12 │ NC_000015.10   66435145  g.66435145G>A   snv          heterozygous  G          A                 0         1           2
  13 │ NC_000002.12   47809595  g.47809595C>T   snv          heterozygous  C          T                 0         2           2
  14 │ NC_000003.12  136477305  g.136477305C>G  snv          heterozygous  C          G                 0         4           4
  15 │ NC_000005.10  157285458  g.157285458C>T  snv          heterozygous  C          T                 0         3           3
  16 │ NC_000012.12   23604413  g.23604413T>G   snv          heterozygous  T          G                 0         5           5
  17 │ NC_000019.10   52219703  g.52219703C>T   snv          heterozygous  C          T                 0         1           1
  18 │ NC_000016.10   88856757  g.88856757C>T   snv          heterozygous  C          T                 0         8           8
******* DONE 8 non retrouvé => probablement hors de la zjone de capture
CLOSED: [2023-04-28 Fri 19:49]
julia> @subset snv :refCount .== 0 :altCount .== 0
8×10 DataFrame
 Row │ chrom         pos        variant         variantType  zygosity      ref        alt        refCount  altCount  readsCount
     │ SubStrin…?    Int64      SubStrin…?      String?      String15      SubStrin…  SubStrin…  Int64     Int64     Int64
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ NC_000015.10   74343027  g.74343027C>T   snv          heterozygous  C          T                 0         0           0
   2 │ NC_000011.10   20638345  g.20638345A>G   snv          heterozygous  A          G                 0         0           0
   3 │ NC_000004.12  139370252  g.139370252C>T  snv          heterozygous  C          T                 0         0           2
   4 │ NC_000017.11   61966475  g.61966475G>T   snv          heterozygous  G          T                 0         0           0
   5 │ NC_000019.10   54144058  g.54144058G>A   snv          heterozygous  G          A                 0         0           0
   6 │ NC_000023.11   77635947  g.77635947A>G   snv          hemizygous    A          G                 0         0           0
   7 │ NC_000005.10    1258495  g.1258495G>A    snv          heterozygous  G          A                 0         0           0
   8 │ NC_000012.12    2449086  g.2449086C>G    snv          heterozygous  C          G                 0         0           0
***** TODO Après haplotypecaller
SCHEDULED: <2023-04-28 Fri>
****** KILL 20x
CLOSED: [2023-04-29 Sat 15:39]
Manque 183 sur 766
[[file:~/recherche/bisonex/simuscop/checkVCF.jl][checkVCF.jl]]
#+begin_src julia
@subset leftjoin(d2, dHaplo2, on=:genomic) ismissing.(:Column1)
#+end_src
Problème de profondeur ?
Ex: chr13 nombre de 101081606
NC_000011.10   16014966  g.16014966G>A
1 read sur 11 pour allèle alternative
Sur le patient de référence, 202 reads!
Celui-ci n'est pas le fichier de capture (ni dans le bam !)
ex: NC_000015.10   74343027  g.74343027C>T
Pour les autres, on devrait les retrouver...
Vérifier le nombre de reads sur 63003856
Vérifier la paramétrisation du modèle également
****** DONE [#B] 200x
CLOSED: [2023-05-18 Thu 11:04] SCHEDULED: <2023-04-30 Sun>
120 manquants (99 sans doublon)!
On vérifie dans IGV (vcf + bam après alignement) :
******* snv NC_000015.10   74343027
- rien d'appelé
- pas une région répétée
- base quality (voir [[*Phred score][Phred score]] ) à 37 donc ok
- variant retrouvé à 26/42
- Bam après aplybqsr: base qualità 35 donc ok
chr15 également à 89318565, variant retrouvé à 25/33 avec basequal de 37
Sans oublier de charger les instructions avx
#+begin_src sh
module load gcc@11.3.0/gcc-12.1.0
#+end_src
On coupe le .bam par chromosome pour débugger (sur le mesocentre)
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/simuscop-centogene-200x/cento/testing :results silent
ln -s ../preprocessing/applybqsr/cento.bam .
ln -s ../preprocessing/recalibrated/cento.bam.bai .
ln -s /Work/Projects/bisonex/data/dbSNP/GRCh38.p13/dbSNP.gz .
ln -s /Work/Projects/bisonex/data/dbSNP/GRCh38.p13/dbSNP.gz.tbi .
ln -s /Work/Projects/bisonex/data/genome/GRCh38.p13/genomeRef.dict .
ln -s /Work/Projects/bisonex/data/genome/GRCh38.p13/genomeRef.fna .
ln -s /Work/Projects/bisonex/data/genome/GRCh38.p13/genomeRef.fna.fai .
#+end_src
On doit lancer à la main (org-mode ne connait pas le chemin de samtools)
samtools view -b cento.bam NC_000015.10 > cento_chr15.bam
samtools index cento_chr15.bam
Puis on se restreint au chronmosome 15
samtools faidx genomeRef.fna NC_000015.10 > genomeRef_chr15.fa
samtools faidx genomeRef_chr15.fa
gatk CreateSequenceDictionary -R genomeRef_chr15.fa -O genomeRef_chr15.dict
On restreint au chromosome 15 avec l'option -L (dure = 1min)
gatk --java-options "-Xmx3072M" HaplotypeCaller --input cento_chr15.bam \
    --output test.vcf.gz --reference genomeRef.fna --dbsnp dbSNP.gz --tmp-dir . --max-mnp-distance 2 -L NC_000015.10
******* DONE Tutorial haplotycaller
CLOSED: [2023-05-01 Mon 19:58]
Procédure : https://gatk.broadinstitute.org/hc/en-us/articles/360043491652-When-HaplotypeCaller-and-Mutect2-do-not-call-an-expected-variant
******** DONE Supprimer --max-mnp-distance = 2: idem
CLOSED: [2023-04-30 Sun 15:42]
******** DONE --debug &> run.log : Non appelé...
CLOSED: [2023-04-30 Sun 15:52]
******** DONE --linked-de-bruijn-graph: idem
CLOSED: [2023-04-30 Sun 15:55]
******** DONE --recover-all-dangling-branches
CLOSED: [2023-04-30 Sun 16:01]
******** DONE --min-pruning 0 : plus mais pas celui là
CLOSED: [2023-04-30 Sun 15:59]
******** DONE --bam-output
CLOSED: [2023-04-30 Sun 16:50]
********* DONE : rien !
CLOSED: [2023-04-30 Sun 16:08]
********* DONE + --recover-all-dangling-branches : rien !
CLOSED: [2023-04-30 Sun 16:08]
******** DONE Données filtrées ? apparement non
CLOSED: [2023-04-30 Sun 16:41]
183122 read(s) filtered by: MappingQualityReadFilter
3674 read(s) filtered by: NotDuplicateReadFilter
********* DONE --disable-read-filter MappingQualityReadFilter: idem
CLOSED: [2023-04-30 Sun 16:34]
On a bien  - 0 read(s) filtered by: MappingQualityAvailableReadFilter
********* DONE --disable-read-filter NotDuplicateReadFilter: idem
CLOSED: [2023-04-30 Sun 16:40]
******** DONE Essayer freebayes : idem
CLOSED: [2023-04-30 Sun 16:22]
freebayes -f genomeRef.fna -r NC_000015.10 cento_chr15.bam > freebayes-test-chr15.vcf
******** DONE Avec toutes les options : idem
--linked-de-bruijn-graph --recover-all-dangling-branches --min-pruning 0 --bam-output debug.bam
CLOSED: [2023-04-30 Sun 16:50]
******** DONE Vérifier qu'on regarde le même bam : oui
CLOSED: [2023-04-30 Sun 16:50]
******** DONE Désactiver dbSNP : idem
CLOSED: [2023-04-30 Sun 16:52]
******** DONE Changer kmer size : idem
CLOSED: [2023-04-30 Sun 16:56]
par exemple[[https://gatk.broadinstitute.org/hc/en-us/community/posts/360075653152-REAL-Variant-not-called-by-HaplotypeCaller][forum gatk]] --kmer-size 18 --kmer-size 22
******** DONE --adaptive-pruning true
CLOSED: [2023-05-01 Mon 19:57]
******* DONE Mapping quality : est à 0 !!!!
CLOSED: [2023-05-01 Mon 19:58]
****** TODO Comparer VCF avec vcfeval
SCHEDULED: <2023-05-01 Mon>
On prépare les données en julia
#+begin_src ~/recherche/bisonex/simuscop
julia --project=. toVCF.jl
#+end_src
Puis on export sur le mésocentre
#+begin_src
scp variants_for_vcfeval.tsv.gz* meso:centogene_variants/
#+end_src
#+begin_src
z bis
cd simuscop-200x
rtg vcfeval -b ~/centogene_variants/variants_for_vcfeval.tsv.gz -c cento/variantCalling/haplotypecaller/cento.vcf.gz -o compare-haplotypecaller -t /Work/Groups/bisonex/data/giab/GRCh38/genomeRef.sdf
#+end_src
Threshold  True-pos-baseline  True-pos-call  False-pos  False-neg  Precision  Sensitivity  F-measure
----------------------------------------------------------------------------------------------------
   82.000                540            540         60         45     0.9000       0.9231     0.9114
     None                546            546        329         39     0.6240       0.9333     0.7479
****** DONE Méthode naïve 549/585
CLOSED: [2023-05-04 Thu 21:57]
Haplotypecaller: Nb reference SNV 692 vs found 585
Variant calling, filter technical: reference SNV 692 vs found 521
***** TODO Avant annotation
SCHEDULED: <2023-04-28 Fri>
#+begin_src
cd cento/variantCalling
bgzip filter-technical.vcf
tabix -p vcf filter-technical.vcf.gz -f
#+end_src
Threshold  True-pos-baseline  True-pos-call  False-pos  False-neg  Precision 
 Sensitivity  F-measure
----------------------------------------------------------------------------------------------------
   12.000                519            519         55         66     0.9042       0.8872     0.8956
     None                519            519         55         66     0.9042       0.8872     0.8956
****** DONE Méthode naïve 521/585
CLOSED: [2023-05-04 Thu 21:57]
Haplotypecaller: Nb reference SNV 692 vs found 585
Variant calling, filter technical: reference SNV 692 vs found 521
***** TODO Après filtre annotation
****** DONE Méthode naïve : 493/585
CLOSED: [2023-05-04 Thu 22:09]
****** TODO VCf eval
 cd cento/annotation/
 bgzip postvep-filter.vcf
 tabix postvep-filter.vcf.gz
 cd ../..
 rtg vcfeval -b ~/centogene_variants/variants_for_vcfeval.tsv.gz -c cento/annotation/postvep-filter.vcf.gz  -o compare-vepfilter -t /Work/Groups/bisonex/data/giab/GRCh38/genomeRef.sdf
 Threshold  True-pos-baseline  True-pos-call  False-pos  False-neg  Precision  Sensitivity  F-measure
----------------------------------------------------------------------------------------------------
   12.000                491            491         50         94     0.9076       0.8393     0.8721
     None                491            491         50         94     0.9076       0.8393     0.8721
*** KILL NEAT : trop lent :neat:
CLOSED: [2023-04-29 Sat 22:06]
**** KILL Génération fastq sur exno 5 GATAD2B
CLOSED: [2023-04-29 Sat 22:06]
Trop lent : pour 1 exon : 1500 secondes !
#+begin_src sh
samtools faidx genomeRef.fna NC_000001.11 | save -f genomeRef_chr1.fna
python gen_reads.py  -r ../test-simuscop/genomeRef_chr1.fna -o lol  -tr ../test-simuscop/gatad2b-exon6.bed  -R 147 --pe 150 10
#+end_src
*** KILL ReSeq : exome avec exons comme fasta mais ne gère pas des exons trop petits :reseq:
CLOSED: [2023-04-30 Sun 19:44] SCHEDULED: <2023-04-29 Sat>
#+begin_quote
Can I simulate exome sequencing? Yes. You need to use a reference that only contains the exons as individual scaffolds. Using --refBiasFile you can specify the coverage of individual exons. To simulate intron contamination you can add the whole reference to the reference containing the exons and strongly reduce the coverage for these scaffolds using --refBiasFile.
#+end_quote
Par contre, rapide
**** DONE Fasta pour exons seuls
CLOSED: [2023-04-30 Sun 19:25]
Depuis le GFF
#+begin_src sh :dir ~/code/bisonex/test-reseq :results silent
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_genomic.gff.gz
#+end_src
#+begin_src sh :dir ~/code/bisonex/test-reseq  :results silent
gunzip -c GCF_000001405.39_GRCh38.p13_genomic.gff.gz | grep -w "exon" > exons.gff
#+end_src
On génère les exons
#+begin_src sh :dir ~/code/bisonex/test-reseq
bedtools getfasta -fi ../test-simuscop/genomeRef.fna -bed exons.gff -fo exons.fna
#+end_src
A tester avec un profile déjà fait :
https://github.com/schmeing/ReSeq-profiles/tree/master/profiles
On cherche l'exons qui nous intéresse
 NC_000001.11 g.153817496 A>T
N'y est pas ??
***** DONE On test sur les 2 premiers : exec
CLOSED: [2023-04-30 Sun 18:39]
#+begin_src
head exons.fa -n 2 > 2exons.fna
#+end_src
#+begin_src sh
../ReSeq/bin/reseq illuminaPE -j 32 -R exons.fa -s Ec-Hi2000-TruSeq.reseq --ipfIterations 0 -1 reseq-sim_1.fq reseq_sim_2.fq
#+end_src
#+begin_quote
error: All reference sequences are too short for simulating. They should have at least 1991 bases
#+end_quote
#+begin_src sh
grep '^>NC_000001.10' exons.fa  | sed 's/:/,/;s/-/,/;s/^>//' > exons.csv
#+end_src
***** DONE Sur 200 premiers exons du chr1
CLOSED: [2023-04-30 Sun 19:17]
#+begin_src sh :dir ~/code/bisonex/test-reseq  :results silent
head -n200 exons.fna > exons-200.fna
 bwa index exons-200.fna
 #+end_src
Simulation avec 30x
#+begin_src sh :dir ~/code/bisonex/test-reseq  :results silent
 ../ReSeq/bin/reseq illuminaPE -R exons-200.fna -s Ec-Hi2000-TruSeq.reseq --ipfIterations 0 -1 reseq1.fq -2 reseq2.fq -c 30
 #+end_src
 Attention, pour l'alignement, il faut le nfa complet ! Sinon erreur du type
 Erreurs:::sam_hdr_create] Duplicated sequence "NC_000001.10:762970-763155" in file "-"
 Et pas de bam avec
 samtools sort: failed to chang
e sort order header to 'coordinate'
 #+begin_src
 bwa mem ../test-simuscop/bwa/genomeRef.fna reseq1.fq reseq2.fq | samtools sort -o reseq.bam
 #+end_src
 Manque des exons et l'allure ne correspond pas...
***** DONE Utiliser le fichier de capture : exons trop petits
CLOSED: [2023-04-30 Sun 19:25]
Comme pour ART
Trop court avec
echo -e "NC_000001.11\t153817371\t153817542" > gatad2b-exon6.bed
Donc on ajoute 1000 de chaque côté
#+begin_src sh :dir ~/code/bisonex/test-reseq :results silent
echo -e "NC_000001.11\t153816371\t153818542" > gatad2b-exon6.bed
bedtools getfasta -fi ../test-simuscop/genomeRef.fna -bed gatad2b-exon6.bed -fo gatad2b-exon6.fna
bwa index gatad2b-exon6.bed
 ../ReSeq/bin/reseq illuminaPE -R gatad2b-exon6.fna -s Ec-Hi2000-TruSeq.reseq --ipfIterations 0 -1 reseq1.fq -2 reseq2.fq -c 30
 bwa mem ../test-simuscop/bwa/genomeRef.fna reseq1.fq reseq2.fq | samtools sort -o reseq.bam
 samtools index reseq.bam
#+end_src
**** KILL Sur le chromosome 15 puis trier à la main sur les zones de capture ?
CLOSED: [2023-04-30 Sun 19:44]
#+begin_src sh :dir ~/code/bisonex/test-reseq :results silent
samtools faidx ../test-simuscop/genomeRef.fna NC_000015.10 > chr15.fna
 ../ReSeq/bin/reseq illuminaPE -R chr15.fna -s Ec-Hi2000-TruSeq.reseq --ipfIterations 0 -1 reseq1.fq -2 reseq2.fq -c 30
#+end_src
*** DONE ART : fonctionne très mal en targeted
CLOSED: [2023-04-30 Sun 11:49]
**** DONE Génération de reads
CLOSED: [2023-04-30 Sun 11:49]
***** DONE Avec seulement les exons en séquence
CLOSED: [2023-04-30 Sun 10:24]
 head -n6 exons.fa | save three-exons.fna
../art_bin_MountRainier/art_illumina -ss HS25 -i three-exons.fna -o ./paired_end_com -l 150 -f 10 -p -m 500 -s 10 -sam
Le sam n'est pas visible sur igv mais si on aligne avec bwa mem, on a quelques reads
***** DONE Extraire une zone de capture dans le fasta
CLOSED: [2023-04-30 Sun 11:49]
 NC_000001.11 g.153817496 A>T
****** DONE Essai 1: ne dépasse pas la zone
CLOSED: [2023-04-30 Sun 10:49]
#+begin_src sh :dir ~/code/bisonex/test-art :results silent
echo -e "NC_000001.11\t153817371\t153817542" > gatad2b-exon6.bed
bedtools getfasta -fi ../test-simuscop/genomeRef.fna -bed gatad2b-exon6.bed -fo gatad2b-exon6.fa
#+end_src
-ss HS25 : nom du profile illumina
-l 150 : reads de 150
-f 10 : coverage de 10
-p : paired end
-m 500 : longueur moyenne des fragment d'ADN
-s 10 : déviation standard
#+begin_src sh :dir ~/code/bisonex/test-art :results silent
../art_bin_MountRainier/art_illumina -ss HS25 -i gatad2b-exon6.fa -o ./gatad2b -l 150 -f 100 -p -m 500 -s 10
#+end_src
#+begin_src sh :dir ~/code/bisonex/test-art :results silent
bwa mem ../test-simuscop/bwa/genomeRef gatad2b1.fq gatad2b2.fq | samtools sort -o gatad2b.b
am
samtools index gatad2b.bam
#+end_src
#+attr_html: :width 800px
[[./art-capture-1.png]]
****** Avec offset
50bp idem
NC_000001.11 153817371  153817542 -> NC_000001.11 153817321  153817592
On essaie 1000
NC_000001.11 153817371  153817542 -> NC_000001.11 153816371  153818542 ->
#+begin_src sh :dir ~/code/bisonex/test-art :results silent
echo -e "NC_000001.11\t153816371\t153818542" > gatad2b-exon6-offset.bed
bedtools getfasta -fi ../test-simuscop/genomeRef.fna -bed gatad2b-exon6-offset.bed -fo gatad2b-exon6-offset.fa
../art_bin_MountRainier/art_illumina -ss HS25 -i gatad2b-exon6-offset.fa -o ./gatad2b -l 150 -f 100 -p -m 500 -s 10
bwa mem ../test-simuscop/bwa/genomeRef gatad2b1.fq gatad2b2.fq | samtools sort -o gatad2b.bam
samtools index gatad2b.bam
#+end_src
mieux mais trop large
#+attr_html: :width 800px
[[./art-exon6-offset1000.png]]
Sur les vraies données, on a une large de 5

[3.13057]

[7.372887]

oire)
Mais mauvaise répartitiopn
#+attr_html: :width 800px
[[./simuscop-error.png]]
À tester
- Problème de profile ?
  - mauvais patient ?
  - mauvaise génération ? -> comparer avec ceux donnés sur github
- nom des chromosomes ?
***** DONE [#A] Tester sur exon 6 GATAD2B pour NC_000001.11:g.153817496A>T
CLOSED: [2023-04-29 Sat 19:56] SCHEDULED: <2023-04-29 Sat>
****** DONE Configuration + Profile 63003856.profile: idem, mal centré
CLOSED: [2023-04-29 Sat 19:18]
Téléchargement des données
#+begin_src sh :dir ~/code/bisonex/test-simuscop
scp meso:/Work/Projects/bisonex/data/genome/GRCh38.p14/genomeRef.fna .
scp meso:Work/Projects/bisonex/data/simuscop/*.profile .
scp -r meso:/Work/Projects/bisonex/data/genome/GRCh38.p13/bwa .
#+end_src
On récupère l'exon (NB: org-mode ne lance pas le code...)
#+begin_src julia
using CSV,DataFramesMeta
d = CSV.read("VCGS_Exome_Covered_Targets_hg38_40.1MB_renamed.bed", header=false, delim="\t", DataFrame)
@subset d :Column1 .== "NC_000001.11" :Column2 .<= 153817496 :Column3 .>= 153817496
#+end_src
NC_000001.11  153817371  153817542
Génération du bed
#+begin_src sh :dir ~/code/bisonex/test-simuscop
echo -e "NC_000001.11\t153817371\t153817542" > gatad2b-exon6.bed
#+end_src
#+RESULTS:
Génération d'un variant
#+begin_src sh :dir ~/code/bisonex/test-simuscop
echo -e "s\tsingle\tNC_000001.11\t153817496\tA\tT\thet"> variant.txt
#+end_src
#+RESULTS:
Génération du fichier de config
#+begin_src sh :dir ~/code/bisonex/test-simuscop
cat > config_wes.txt << EOL
ref = genomeRef.fna
profile = ./63003856.profile
variation = ./variant.txt
target = ./gatad2b-exon6.bed
layout = PE
threads = 1
name = single
output = test-gatad2b
coverage = 20
EOL
#+end_src
#+RESULTS:
On démarre la simulation
#+begin_src sh :dir ~/code/bisonex/test-simuscop
simuReads config_wes.txt
#+end_src
#+RESULTS:
Alignement
#+begin_src sh :dir ~/code/bisonex/test-simuscop
bwa mem -R '@RG\tID:sample\tSM:sample\tPL:ILLUMINA\tPM:Miseq\tCN:lol\tLB:definition_to_add' bwa/genomeRef test-gatad2b/single_1.fq  test-gatad2b/single_2.fq | samtools sort  -o single.bam
#+end_src
#+RESULTS:
****** DONE Profile github  HiSeq2000
CLOSED: [2023-04-29 Sat 19:56]
#+begin_src sh :dir ~/code/bisonex/test-simuscop :result file
wget https://raw.githubusercontent.com/qasimyu/simuscop/master/testData/Illumina_HiSeq2000.profile
#+end_src
#+RESULTS:
#+begin_src sh :dir ~/code/bisonex/test-simuscop
cat > config_wes.txt << EOL
ref = genomeRef.fna
profile = ./Illumina_HiSeq2000.profile
variation = ./variant.txt
target = ./gatad2b-exon6.bed
layout = PE
threads = 1
name = single
output = test-gatad2b-hiseq2000
coverage = 20
EOL
simuReads config_wes.txt
bwa mem -R '@RG\tID:sample\tSM:sample\tPL:ILLUMINA\tPM:Miseq\tCN:lol\tLB:definition_to_add' bwa/genomeRef test-gatad2b-hiseq2000/single_1.fq  test-gatad2b-hiseq2000/single_2.fq | samtools sort  -o single-hiseq2000.bam
samtools index single-hiseq2000.bam
#+end_src
#+RESULTS:
****** KILL Tester exemple sur github
CLOSED: [2023-04-29 Sat 19:56]
#+begin_src sh
git clone https://github.com/qasimyu/simuscop/
cd simuscop
simuReads configFiles/config_test_wes.txt
#+end_src
****** KILL Centrer la fenêtre sur les zones de capture
CLOSED: [2023-04-30 Sun 13:28] SCHEDULED: <2023-04-29 Sat>
1000bp par défaut, ce qui est plus grand que les zones de captures...
Changer fragzip ne fonctionne pas
Si on rajoute un offset sur l'exon: 200bp, est encore plus allongé
NC_000001.11 153817371 153817542 ->
NC_000001.11 153817171 153817742
Si on désactive les target ?
Regarder les target sur le chromosome 1
#+begin_src sh :dir ~/code/bisonex/test-simuscop :results silent
scp meso:/Work/Projects/bisonex/data/simuscop/VCGS_Exome_Covered_Targets_hg38_40.1MB_renamed.bed .
#+end_src
#+begin_src sh :dir ~/code/bisonex/test-simuscop :results silent
head -n 100 VCGS_Exome_Covered_Targets_hg38_40.1MB_renamed.bed > 100exons.bed
echo -e "s\tsingle\tNC_000001.11\t153817496\tA\tT\thet"> variant.txt
cat > config_wes.txt << EOL
ref = genomeRef.fna
profile = ./63003856.profile
variation = ./variant.txt
layout = PE
threads = 4
target = 100exons.bed
name = single
output = test-gatad2b
coverage = 200
EOL
./simuscop/bin/simuReads config_wes.txt
bwa mem bwa/genomeRef test-gatad2b/single_1.fq  test-gatad2b/single_2.fq | samtools sort  -o single.bam
samtools index single.bam
#+end_src
**** TODO Vérifier tous les variants sont retrouvés en 200x
SCHEDULED: <2023-04-22 Sat>
***** DONE Après alignement
CLOSED: [2023-04-29 Sat 18:27] SCHEDULED: <2023-04-28 Fri>
****** DONE SNV: avec doublons
CLOSED: [2023-04-28 Fri 18:12]
On utilise [[file:~/recherche/bisonex/simuscop/checkBam.jl][checkBam.jl]]
#+begin_src julia
d = prepareVariant("../parsevariants/variant_genomic.csv")
root = "/home/alex/code/bisonex/simuscop-centogene/cento"
bam = root * "/preprocessing/applybqsr/cento.bam"
bai = root * "/preprocessing/recalibrated/cento.bam.bai"
snv = getSNV(d, bam, bai)
#+end_src
Nombreux faux homozygouteS
Vérification avec checkFalseHemizygous(snv) : nombreux doublons dans le fichier pour simuscop...
****** DONE SNV sans doublons
CLOSED: [2023-04-29 Sat 18:27]
******* DONE 18 faux homozygote mais avec peu de reads
CLOSED: [2023-04-29 Sat 18:27]
julia> @subset snv :refCount .== 0 :altCount .> 0 :zygosity .== "heterozygous"
18×10 DataFrame
 Row │ chrom         pos        variant         variantType  zygosity      ref        alt        refCount  altCount  readsCount
     │ SubStrin…?    Int64      SubStrin…?      String?      String15      SubStrin…  SubStrin…  Int64     Int64     Int64
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ NC_000022.11   42213078  g.42213078T>G   snv          heterozygous  T          G                 0         1           1
   2 │ NC_000012.12  101680427  g.101680427C>A  snv          heterozygous  C          A                 0         3           3
   3 │ NC_000014.9   105385684  g.105385684G>C  snv          heterozygous  G          C                 0         4           4
   4 │ NC_000011.10  125978299  g.125978299C>T  snv          heterozygous  C          T                 0         3           3
   5 │ NC_000023.11   77998618  g.77998618C>T   snv          heterozygous  C          T                 0         2           2
   6 │ NC_000015.10   66703292  g.66703292C>T   snv          heterozygous  C          T                 0         3           3
   7 │ NC_000010.11   87961118  g.87961118G>A   snv          heterozygous  G          A                 0         3           3
   8 │ NC_000012.12  112477719  g.112477719A>G  snv          heterozygous  A          G                 0         2           2
   9 │ NC_000020.11    6778406  g.6778406C>T    snv          heterozygous  C          T                 0         3           3
  10 │ NC_000023.11   68192943  g.68192943G>A   snv          heterozygous  G          A                 0         2           2
  11 │ NC_000004.12     987858  g.987858C>T     snv          heterozygous  C          T                 0         3           4
  12 │ NC_000015.10   66435145  g.66435145G>A   snv          heterozygous  G          A                 0         1           2
  13 │ NC_000002.12   47809595  g.47809595C>T   snv          heterozygous  C          T                 0         2           2
  14 │ NC_000003.12  136477305  g.136477305C>G  snv          heterozygous  C          G                 0         4           4
  15 │ NC_000005.10  157285458  g.157285458C>T  snv          heterozygous  C          T                 0         3           3
  16 │ NC_000012.12   23604413  g.23604413T>G   snv          heterozygous  T          G                 0         5           5
  17 │ NC_000019.10   52219703  g.52219703C>T   snv          heterozygous  C          T                 0         1           1
  18 │ NC_000016.10   88856757  g.88856757C>T   snv          heterozygous  C          T                 0         8           8
******* DONE 8 non retrouvé => probablement hors de la zjone de capture
CLOSED: [2023-04-28 Fri 19:49]
julia> @subset snv :refCount .== 0 :altCount .== 0
8×10 DataFrame
 Row │ chrom         pos        variant         variantType  zygosity      ref        alt        refCount  altCount  readsCount
     │ SubStrin…?    Int64      SubStrin…?      String?      String15      SubStrin…  SubStrin…  Int64     Int64     Int64
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ NC_000015.10   74343027  g.74343027C>T   snv          heterozygous  C          T                 0         0           0
   2 │ NC_000011.10   20638345  g.20638345A>G   snv          heterozygous  A          G                 0         0           0
   3 │ NC_000004.12  139370252  g.139370252C>T  snv          heterozygous  C          T                 0         0           2
   4 │ NC_000017.11   61966475  g.61966475G>T   snv          heterozygous  G          T                 0         0           0
   5 │ NC_000019.10   54144058  g.54144058G>A   snv          heterozygous  G          A                 0         0           0
   6 │ NC_000023.11   77635947  g.77635947A>G   snv          hemizygous    A          G                 0         0           0
   7 │ NC_000005.10    1258495  g.1258495G>A    snv          heterozygous  G          A                 0         0           0
   8 │ NC_000012.12    2449086  g.2449086C>G    snv          heterozygous  C          G                 0         0           0
***** TODO Après haplotypecaller
SCHEDULED: <2023-04-28 Fri>
****** KILL 20x
CLOSED: [2023-04-29 Sat 15:39]
Manque 183 sur 766
[[file:~/recherche/bisonex/simuscop/checkVCF.jl][checkVCF.jl]]
#+begin_src julia
@subset leftjoin(d2, dHaplo2, on=:genomic) ismissing.(:Column1)
#+end_src
Problème de profondeur ?
Ex: chr13 nombre de 101081606
NC_000011.10   16014966  g.16014966G>A
1 read sur 11 pour allèle alternative
Sur le patient de référence, 202 reads!
Celui-ci n'est pas le fichier de capture (ni dans le bam !)
ex: NC_000015.10   74343027  g.74343027C>T
Pour les autres, on devrait les retrouver...
Vérifier le nombre de reads sur 63003856
Vérifier la paramétrisation du modèle également
****** DONE [#B] 200x
CLOSED: [2023-05-18 Thu 11:04] SCHEDULED: <2023-04-30 Sun>
120 manquants (99 sans doublon)!
On vérifie dans IGV (vcf + bam après alignement) :
******* snv NC_000015.10   74343027
- rien d'appelé
- pas une région répétée
- base quality (voir [[*Phred score][Phred score]] ) à 37 donc ok
- variant retrouvé à 26/42
- Bam après aplybqsr: base qualità 35 donc ok
chr15 également à 89318565, variant retrouvé à 25/33 avec basequal de 37
Sans oublier de charger les instructions avx
#+begin_src sh
module load gcc@11.3.0/gcc-12.1.0
#+end_src
On coupe le .bam par chromosome pour débugger (sur le mesocentre)
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/simuscop-centogene-200x/cento/testing :results silent
ln -s ../preprocessing/applybqsr/cento.bam .
ln -s ../preprocessing/recalibrated/cento.bam.bai .
ln -s /Work/Projects/bisonex/data/dbSNP/GRCh38.p13/dbSNP.gz .
ln -s /Work/Projects/bisonex/data/dbSNP/GRCh38.p13/dbSNP.gz.tbi .
ln -s /Work/Projects/bisonex/data/genome/GRCh38.p13/genomeRef.dict .
ln -s /Work/Projects/bisonex/data/genome/GRCh38.p13/genomeRef.fna .
ln -s /Work/Projects/bisonex/data/genome/GRCh38.p13/genomeRef.fna.fai .
#+end_src
On doit lancer à la main (org-mode ne connait pas le chemin de samtools)
samtools view -b cento.bam NC_000015.10 > cento_chr15.bam
samtools index cento_chr15.bam
Puis on se restreint au chronmosome 15
samtools faidx genomeRef.fna NC_000015.10 > genomeRef_chr15.fa
samtools faidx genomeRef_chr15.fa
gatk CreateSequenceDictionary -R genomeRef_chr15.fa -O genomeRef_chr15.dict
On restreint au chromosome 15 avec l'option -L (dure = 1min)
gatk --java-options "-Xmx3072M" HaplotypeCaller --input cento_chr15.bam \
    --output test.vcf.gz --reference genomeRef.fna --dbsnp dbSNP.gz --tmp-dir . --max-mnp-distance 2 -L NC_000015.10
******* DONE Tutorial haplotycaller
CLOSED: [2023-05-01 Mon 19:58]
Procédure : https://gatk.broadinstitute.org/hc/en-us/articles/360043491652-When-HaplotypeCaller-and-Mutect2-do-not-call-an-expected-variant
******** DONE Supprimer --max-mnp-distance = 2: idem
CLOSED: [2023-04-30 Sun 15:42]
******** DONE --debug &> run.log : Non appelé...
CLOSED: [2023-04-30 Sun 15:52]
******** DONE --linked-de-bruijn-graph: idem
CLOSED: [2023-04-30 Sun 15:55]
******** DONE --recover-all-dangling-branches
CLOSED: [2023-04-30 Sun 16:01]
******** DONE --min-pruning 0 : plus mais pas celui là
CLOSED: [2023-04-30 Sun 15:59]
******** DONE --bam-output
CLOSED: [2023-04-30 Sun 16:50]
********* DONE : rien !
CLOSED: [2023-04-30 Sun 16:08]
********* DONE + --recover-all-dangling-branches : rien !
CLOSED: [2023-04-30 Sun 16:08]
******** DONE Données filtrées ? apparement non
CLOSED: [2023-04-30 Sun 16:41]
183122 read(s) filtered by: MappingQualityReadFilter
3674 read(s) filtered by: NotDuplicateReadFilter
********* DONE --disable-read-filter MappingQualityReadFilter: idem
CLOSED: [2023-04-30 Sun 16:34]
On a bien  - 0 read(s) filtered by: MappingQualityAvailableReadFilter
********* DONE --disable-read-filter NotDuplicateReadFilter: idem
CLOSED: [2023-04-30 Sun 16:40]
******** DONE Essayer freebayes : idem
CLOSED: [2023-04-30 Sun 16:22]
freebayes -f genomeRef.fna -r NC_000015.10 cento_chr15.bam > freebayes-test-chr15.vcf
******** DONE Avec toutes les options : idem
--linked-de-bruijn-graph --recover-all-dangling-branches --min-pruning 0 --bam-output debug.bam
CLOSED: [2023-04-30 Sun 16:50]
******** DONE Vérifier qu'on regarde le même bam : oui
CLOSED: [2023-04-30 Sun 16:50]
******** DONE Désactiver dbSNP : idem
CLOSED: [2023-04-30 Sun 16:52]
******** DONE Changer kmer size : idem
CLOSED: [2023-04-30 Sun 16:56]
par exemple[[https://gatk.broadinstitute.org/hc/en-us/community/posts/360075653152-REAL-Variant-not-called-by-HaplotypeCaller][forum gatk]] --kmer-size 18 --kmer-size 22
******** DONE --adaptive-pruning true
CLOSED: [2023-05-01 Mon 19:57]
******* DONE Mapping quality : est à 0 !!!!
CLOSED: [2023-05-01 Mon 19:58]
****** TODO Comparer VCF avec vcfeval
SCHEDULED: <2023-05-01 Mon>
On prépare les données en julia
#+begin_src ~/recherche/bisonex/simuscop
julia --project=. toVCF.jl
#+end_src
Puis on export sur le mésocentre
#+begin_src
scp variants_for_vcfeval.tsv.gz* meso:centogene_variants/
#+end_src
#+begin_src
z bis
cd simuscop-200x
rtg vcfeval -b ~/centogene_variants/variants_for_vcfeval.tsv.gz -c cento/variantCalling/haplotypecaller/cento.vcf.gz -o compare-haplotypecaller -t /Work/Groups/bisonex/data/giab/GRCh38/genomeRef.sdf
#+end_src
Threshold  True-pos-baseline  True-pos-call  False-pos  False-neg  Precision  Sensitivity  F-measure
----------------------------------------------------------------------------------------------------
   82.000                540            540         60         45     0.9000       0.9231     0.9114
     None                546            546        329         39     0.6240       0.9333     0.7479
****** TODO Comparer avec hap.py
SCHEDULED: <2023-05-19 Fri>
****** DONE Méthode naïve 549/585
CLOSED: [2023-05-04 Thu 21:57]
Haplotypecaller: Nb reference SNV 692 vs found 585
Variant calling, filter technical: reference SNV 692 vs found 521
***** TODO Avant annotation
SCHEDULED: <2023-04-28 Fri>
#+begin_src
cd cento/variantCalling
bgzip filter-technical.vcf
tabix -p vcf filter-technical.vcf.gz -f
#+end_src
Threshold  True-pos-baseline  True-pos-call  False-pos  False-neg  Precision  Sensitivity  F-measure
----------------------------------------------------------------------------------------------------
   12.000                519            519         55         66     0.9042       0.8872     0.8956
     None                519            519         55         66     0.9042       0.8872     0.8956
****** DONE Méthode naïve 521/585
CLOSED: [2023-05-04 Thu 21:57]
Haplotypecaller: Nb reference SNV 692 vs found 585
Variant calling, filter technical: reference SNV 692 vs found 521
****** TODO Comparer avec hap.py
***** TODO Après filtre annotation
****** DONE Méthode naïve : 493/585
CLOSED: [2023-05-04 Thu 22:09]
****** TODO Comparer avec hap.py
****** TODO VCf eval
 cd cento/annotation/
 bgzip postvep-filter.vcf
 tabix postvep-filter.vcf.gz
 cd ../..
 rtg vcfeval -b ~/centogene_variants/variants_for_vcfeval.tsv.gz -c cento/annotation/postvep-filter.vcf.gz  -o compare-vepfilter -t /Work/Groups/bisonex/data/giab/GRCh38/genomeRef.sdf
 Threshold  True-pos-baseline  True-pos-call  False-pos  False-neg  Precision  Sensitivity  F-measure
----------------------------------------------------------------------------------------------------
   12.000                491            491         50         94     0.9076       0.8393     0.8721
     None                491            491         50         94     0.9076       0.8393     0.8721
*** KILL NEAT : trop lent :neat:
CLOSED: [2023-04-29 Sat 22:06]
**** KILL Génération fastq sur exno 5 GATAD2B
CLOSED: [2023-04-29 Sat 22:06]
Trop lent : pour 1 exon : 1500 secondes !
#+begin_src sh
samtools faidx genomeRef.fna NC_000001.11 | save -f genomeRef_chr1.fna
python gen_reads.py  -r ../test-simuscop/genomeRef_chr1.fna -o lol  -tr ../test-simuscop/gatad2b-exon6.bed  -R 147 --pe 150 10
#+end_src
*** KILL ReSeq : exome avec exons comme fasta mais ne gère pas des exons trop petits :reseq:
CLOSED: [2023-04-30 Sun 19:44] SCHEDULED: <2023-04-29 Sat>
#+begin_quote
Can I simulate exome sequencing? Yes. You need to use a reference that only contains the exons as individual scaffolds. Using --refBiasFile you can specify the coverage of individual exons. To simulate intron contamination you can add the whole reference to the reference containing the exons and strongly reduce the coverage for these scaffolds using --refBiasFile.
#+end_quote
Par contre, rapide
**** DONE Fasta pour exons seuls
CLOSED: [2023-04-30 Sun 19:25]
Depuis le GFF
#+begin_src sh :dir ~/code/bisonex/test-reseq :results silent
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_genomic.gff.gz
#+end_src
#+begin_src sh :dir ~/code/bisonex/test-reseq  :results silent
gunzip -c GCF_000001405.39_GRCh38.p13_genomic.gff.gz | grep -w "exon" > exons.gff
#+end_src
On génère les exons
#+begin_src sh :dir ~/code/bisonex/test-reseq
bedtools getfasta -fi ../test-simuscop/genomeRef.fna -bed exons.gff -fo exons.fna
#+end_src
A tester avec un profile déjà fait :
https://github.com/schmeing/ReSeq-profiles/tree/master/profiles
On cherche l'exons qui nous intéresse
 NC_000001.11 g.153817496 A>T
N'y est pas ??
***** DONE On test sur les 2 premiers : exec
CLOSED: [2023-04-30 Sun 18:39]
#+begin_src
head exons.fa -n 2 > 2exons.fna
#+end_src
#+begin_src sh
../ReSeq/bin/reseq illuminaPE -j 32 -R exons.fa -s Ec-Hi2000-TruSeq.reseq --ipfIterations 0 -1 reseq-sim_1.fq reseq_sim_2.fq
#+end_src
#+begin_quote
error: All reference sequences are too short for simulating. They should have at least 1991 bases
#+end_quote
#+begin_src sh
grep '^>NC_000001.10' exons.fa  | sed 's/:/,/;s/-/,/;s/^>//' > exons.csv
#+end_src
***** DONE Sur 200 premiers exons du chr1
CLOSED: [2023-04-30 Sun 19:17]
#+begin_src sh :dir ~/code/bisonex/test-reseq  :results silent
head -n200 exons.fna > exons-200.fna
 bwa index exons-200.fna
 #+end_src
Simulation avec 30x
#+begin_src sh :dir ~/code/bisonex/test-reseq  :results silent
 ../ReSeq/bin/reseq illuminaPE -R exons-200.fna -s Ec-Hi2000-TruSeq.reseq --ipfIterations 0 -1 reseq1.fq -2 reseq2.fq -c 30
 #+end_src
 Attention, pour l'alignement, il faut le nfa complet ! Sinon erreur du type
 Erreurs:::sam_hdr_create] Duplicated sequence "NC_000001.10:762970-763155" in file "-"
 Et pas de bam avec
 samtools sort: failed to change sort order header to 'coordinate'
 #+begin_src
 bwa mem ../test-simuscop/bwa/genomeRef.fna reseq1.fq reseq2.fq | samtools sort -o reseq.bam
 #+end_src
 Manque des exons et l'allure ne correspond pas...
***** DONE Utiliser le fichier de capture : exons trop petits
CLOSED: [2023-04-30 Sun 19:25]
Comme pour ART
Trop court avec
echo -e "NC_000001.11\t153817371\t153817542" > gatad2b-exon6.bed
Donc on ajoute 1000 de chaque côté
#+begin_src sh :dir ~/code/bisonex/test-reseq :results silent
echo -e "NC_000001.11\t153816371\t153818542" > gatad2b-exon6.bed
bedtools getfasta -fi ../test-simuscop/genomeRef.fna -bed gatad2b-exon6.bed -fo gatad2b-exon6.fna
bwa index gatad2b-exon6.bed
 ../ReSeq/bin/reseq illuminaPE -R gatad2b-exon6.fna -s Ec-Hi2000-TruSeq.reseq --ipfIterations 0 -1 reseq1.fq -2 reseq2.fq -c 30
 bwa mem ../test-simuscop/bwa/genomeRef.fna reseq1.fq reseq2.fq | samtools sort -o reseq.bam
 samtools index reseq.bam
#+end_src
**** KILL Sur le chromosome 15 puis trier à la main sur les zones de capture ?
CLOSED: [2023-04-30 Sun 19:44]
#+begin_src sh :dir ~/code/bisonex/test-reseq :results silent
samtools faidx ../test-simuscop/genomeRef.fna NC_000015.10 > chr15.fna
 ../ReSeq/bin/reseq illuminaPE -R chr15.fna -s Ec-Hi2000-TruSeq.reseq --ipfIterations 0 -1 reseq1.fq -2 reseq2.fq -c 30
#+end_src
*** DONE ART : fonctionne très mal en targeted
CLOSED: [2023-04-30 Sun 11:49]
**** DONE Génération de reads
CLOSED: [2023-04-30 Sun 11:49]
***** DONE Avec seulement les exons en séquence
CLOSED: [2023-04-30 Sun 10:24]
 head -n6 exons.fa | save three-exons.fna
../art_bin_MountRainier/art_illumina -ss HS25 -i three-exons.fna -o ./paired_end_com -l 150 -f 10 -p -m 500 -s 10 -sam
Le sam n'est pas visible sur igv mais si on aligne avec bwa mem, on a quelques reads
***** DONE Extraire une zone de capture dans le fasta
CLOSED: [2023-04-30 Sun 11:49]
 NC_000001.11 g.153817496 A>T
****** DONE Essai 1: ne dépasse pas la zone
CLOSED: [2023-04-30 Sun 10:49]
#+begin_src sh :dir ~/code/bisonex/test-art :results silent
echo -e "NC_000001.11\t153817371\t153817542" > gatad2b-exon6.bed
bedtools getfasta -fi ../test-simuscop/genomeRef.fna -bed gatad2b-exon6.bed -fo gatad2b-exon6.fa
#+end_src
-ss HS25 : nom du profile illumina
-l 150 : reads de 150
-f 10 : coverage de 10
-p : paired end
-m 500 : longueur moyenne des fragment d'ADN
-s 10 : déviation standard
#+begin_src sh :dir ~/code/bisonex/test-art :results silent
../art_bin_MountRainier/art_illumina -ss HS25 -i gatad2b-exon6.fa -o ./gatad2b -l 150 -f 100 -p -m 500 -s 10
#+end_src
#+begin_src sh :dir ~/code/bisonex/test-art :results silent
bwa mem ../test-simuscop/bwa/genomeRef gatad2b1.fq gatad2b2.fq | samtools sort -o gatad2b.b
am
samtools index gatad2b.bam
#+end_src
#+attr_html: :width 800px
[[./art-capture-1.png]]
****** Avec offset
50bp idem
NC_000001.11 153817371  153817542 -> NC_000001.11 153817321  153817592
On essaie 1000
NC_000001.11 153817371  153817542 -> NC_000001.11 153816371  153818542 ->
#+begin_src sh :dir ~/code/bisonex/test-art :results silent
echo -e "NC_000001.11\t153816371\t153818542" > gatad2b-exon6-offset.bed
bedtools getfasta -fi ../test-simuscop/genomeRef.fna -bed gatad2b-exon6-offset.bed -fo gatad2b-exon6-offset.fa
../art_bin_MountRainier/art_illumina -ss HS25 -i gatad2b-exon6-offset.fa -o ./gatad2b -l 150 -f 100 -p -m 500 -s 10
bwa mem ../test-simuscop/bwa/genomeRef gatad2b1.fq gatad2b2.fq | samtools sort -o gatad2b.bam
samtools index gatad2b.bam
#+end_src
mieux mais trop large
#+attr_html: :width 800px
[[./art-exon6-offset1000.png]]
Sur les vraies données, on a une large de 5

Replacement in projects/bisonex.org at line 42 [4.35]

B:BD[8.24496] → [2.27:8178]

323426  g.89323426C>G  snv          heterozygous  C          G                93       109         202
  25 │ NC_000015.10  89318595  g.89318595T>C  snv          heterozygous  T          C               321       128         453
  26 │ NC_000015.10  48488437  g.48488437T>C  snv          heterozygous  T          C               356       132         488
CLOSED: [2023-05-01 Mon 17:18]
***** KILL Chromosome1 15 :Test haplotype caller : échec car CIGARE non mis à jour
CLOSED: [2023-05-13 Sat 18:29] SCHEDULED: <2023-05-01 Mon>
#+begin_src
julia -Jbisonex.so --project=. insertVariants.jl `63003856_S135_chr15.bam` 63003856_S135_chr15_inserted.bam
scp 63003856_S135_chr15_inserted.bam* meso:/Work/Users/apraga/bisonex/tests/synthetic/
#+end_src
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/tests/synthetic :results silent
ln -s /Work/Projects/bisonex/data/dbSNP/GRCh38.p13/dbSNP.gz .
ln -s /Work/Projects/bisonex/data/dbSNP/GRCh38.p13/dbSNP.gz.tbi
ln -s /Work/Projects/bisonex/data/genome/GRCh38.p13/genomeRef.dict .
ln -s /Work/Projects/bisonex/data/genome/GRCh38.p13/genomeRef.fna .
ln -s /Work/Projects/bisonex/data/genome/GRCh38.p13/genomeRef.fna.fai .
#+end_src
puis
#+begin_src
gatk --java-options "-Xmx3072M" HaplotypeCaller --input 63003856_S135_chr15_inserted.bam --output testchr15.vcf.gz --reference genomeRef.fna  --tmp-dir . -L NC_000015.10
#+end_src
scp meso:/Work/Users/apraga/bisonex/tests/synthetic/testchr15.vcf.gz haplotypecaller-chr15.vcf.gz
Aucun variant inséré
- base quality ok
  -
****** DONE bam out : non appelé
CLOSED: [2023-05-01 Mon 21:57]
gatk --java-options "-Xmx3072M" HaplotypeCaller --input 63003856_S135_chr15_inserted.bam     --output haplotypecaller-chr15.vcf.gz --reference genomeRef.f
na  --tmp-dir . -L NC_000015.10  --bam-output debug.bam
****** DONE --linked-de-bruijn-graph : idem
CLOSED: [2023-05-01 Mon 21:57]
readlink testchr15.vcf.gz -f^C
[apraga@mesointeractive synthetic]$ gatk --java-options "-Xmx3072M" HaplotypeCaller --input 63003856_S135_chr15_inserted.bam     --output haplotypecaller-chr15.vcf.gz --reference genomeRef.fna  --tmp-dir . -L NC_000015.10  --linked-de-bruijn-graph
****** KILL regénérer fastq
CLOSED: [2023-05-13 Sat 18:29]
Non
***** KILL Générer bam données pour tous les chromosomes
CLOSED: [2023-05-13 Sat 18:29]
 timeit julia -Jbisonex.so --project=. insertVariants.jl ~/code/bisonex/out/63003856/preprocessing/63003856_S135.bam 63003856_S135_inserted.bam
40min 516ms 835µs 405ns
Avertissement:
 [W::bam_hdr_read] EOF marker is absent. The input is probably truncated
Inserted.bam et excluded.bam (fichier avant le merge)  ont l'air ok...
On réessaie à la main : ça passe
#+begin_src
samtools merge test-all.bam inserted.bam excluded.bam
❯ mv test-all.bam `63003856_S135_inserted.bam` -f
❯ mv test-all.bam.bai `63003856_S135_chr15_inserted.bam.bai` -f
#+end_src
***** DONE BAm2fastq pour avoir CIGAR à jour : échec (variants "cachés")
CLOSED: [2023-05-04 Thu 20:30] SCHEDULED: <2023-05-01 Mon>
On lance la génération de bam depuis le mesocentro (la copie plante via le VPN)
#+begin_src sh
cd /Work/Users/apraga/recherche/bisonex/generate
julia --project=. insertVariants.jl  ../../../bisonex/out/63003856_S135/preprocessing/applybqsr/63003856_S135.bam 63003856_S135_inserted.bam
#+end_src
Workflow après avec désactivé storeDir pour SAMTOOLS_BAM2FQ dans nextflow.config (pourquoi ??)
#+begin_src nextflow
include { SAMTOOLS_BAM2FQ }                            from "${params.modulesDir}/samtools/bam2fq/main"
include { SAMTOOLS_SORT as sortBamByName }             from "${params.modulesDir}/samtools/sort/main"
workflow {
    f = Channel.fromPath("${params.dataDir}/synthetic/63003856_S135_inserted.bam",
                         checkIfExists: true).map{it -> [["id": "synthetic_63003856"], it]}
    // Important: use "-n" option !!
    sortBamByName(f)
    SAMTOOLS_BAM2FQ(sortBamByName.out.bam, true)
}
#+end_src
Puis
#+begin_src
cp work/34/fb2fc136f6f6d7f42d0960512f06de/*.fq.gz /Work/Groups/bisonex/data/synthetic/
#+end_src
***** KILL Lancer pipeline
CLOSED: [2023-05-04 Thu 20:30] SCHEDULED: <2023-05-01 Mon>
NXF_OPTS=-D"user.name=apraga" nextflow run   main.nf -c nextflow.config  -profile standard,helios -bg --input="/Work/Groups/bisonex/data/synthetic/synthetic_63003856_{1,2}.fq.gz" --outdir out/synthetic_63003856
*** TODO Bamsurgeon :bamsurgeon:
**** DONE Package nix
CLOSED: [2023-05-18 Thu 11:05]
1. Patcher la recherche du génome de référence pour bien trouver les index (en utilisant une regexp comme nf-core)
2. Rajouter le chemin de picard dans les arguments
3. Option -O3 pour performance
**** DONE Test sur mini-bam: échec
CLOSED: [2023-05-14 Sun 21:12]
❯ samtools view -h ~/code/bisonex/simuscop-centogene-200x/cento/preprocessing/mapped/cento.bam | head -n1000 | samtools view -Sb - > mini.bam
❯ samtools index mini.bam
Sans spécfier le variant:
#+begin_quote
NC_000001.11	17651	17651
#+end_quote
./result/bin/addsnv -v snv.txt -f mini.bam -r ../data/genomeRef.fna -o test.bam
**** DONE Test chr22
CLOSED: [2023-05-15 Mon 23:24]
Pas assez de reads, on prend le chromosome 22
#+begin_src sh
samtools view ../simuscop-centogene-200x/cento/preprocessing/mapped/cento.bam NC_000022.11 -b -o chr22.bam
samtools index chr22.bam
#+end_src
Mésocentre
dans tests/bamsurgeno
#+begin_src
addsnv -v snv.txt -f chr22.bam -r ../genomeRef.fna -o test.bam --aligner mem
#+end_src
***** DONE SNV aléatoire:
CLOSED: [2023-05-15 Mon 23:13]
NC_000022.11	17499704	17499704    0.2
On retrouve bien un variant à cette position A > T
***** DONE SNV avec ALT prédéfini : retrouvée dans IGV (mais pas dans pileup)
CLOSED: [2023-05-15 Mon 23:13]
NC_000022.11	17499704	17499704    0.2 G
***** DONE Variants patients chr22: ok IGV
CLOSED: [2023-05-15 Mon 23:23]
Fichier non trié donc
samtools sort test.bam -o test-sorted.bam
samtools index test-sorted.bam
***** DONE Vérifier qu'il faut POS et POS+1: non
CLOSED: [2023-05-14 Sun 21:21]
**** TODO Variants cento
***** STRT SNV
SCHEDULED: <2023-05-15 Mon>
Attention à la mémoire: 32G ne semble pas suffire avec 12 threads
#+begin_src sh
NXF_OPTS=-D"user.name=${USER}" nextflow run workflows/bamsurgeon.nf -profile standard,helios --input=tests/bamsurgeon/snv-cento.tsv -bg
#+end_src
ET
#+begin_src nextflow
workflow {
    f = Channel.fromPath(params.input, checkIfExists: true)
    bam = Channel.fromPath("simuscop-centogene-200x/cento/preprocessing/mapped/cento.bam",
                           checkIfExists: true)
    bamIndex = bam.map { it -> it + ".bai" }
    downloadGenome | indexGenome
    indexGenome.out.index | view
    addSNV(f, bam, bamIndex, downloadGenome.out, indexGenome.out.index, indexGenome.out.dict, indexGenome.out.fai)
}
#+end_src
****** TODO Erreur ValueError: quality and sequence mismatch
******* DONE Idem avec dernière version sur github
CLOSED: [2023-05-18 Thu 14:36]
******* TODO Version 1.3: ok mais
Test sur chr22: variants ok mais VAF=1...
S'exécute "normalement" (échec selon nextflow) mais le bam de sorstie quasiment vide
La fusion du bam avec les variants et du fichier de référence n'a fonctionné correctement.
******** DONE Lancer replacereads.py à la main
CLOSED: [2023-05-18 Thu 21:41]
Dans /Work/Users/apraga/bisonex/work/f3/ce044f80ca91016d68d1bc4f4f5301
#+begin_src sh
 /nix/store/xw277la6w4sjqlsvw9h32cvrlacrfkgm-python3-3.10.9-env/bin/python3.10 /nix/store/abzangf0q8k37053p776cfkw181dzjn3-bamsurgeon-1.3/bin/bamsurgeon/replacereads.py -b cento.bam -r addsnv.e14561be-4fdd-45cc-9989-048ab6da6cc6.muts.bam -o snv-manual.bam
#+end_src
Puis
#+begin_src
samtools sort snv-manual.bam -o snv-manual-sorted.bam
#+end_src
A l'air de fonctionne
******** TODO Forcer la VAF
SCHEDULED: <2023-05-19 Fri>
Est à 1
******* TODO Lancer le pipeline pour vérifier qu'on retrouve les variants
SCHEDULED: <2023-05-18 Thu>
***** TODO del
***** TODO ins
*** Divers
**** DONE Vérifier nombre de reads fastq - bam
CLOSED: [2022-10-09 Sun 22:31]

[8.24496]

323426  g.89323426C>G  snv          heterozygous  C          G                93       109         202
  25 │ NC_000015.10  89318595  g.89318595T>C  snv          heterozygous  T          C               321       128         453
  26 │ NC_000015.10  48488437  g.48488437T>C  snv          heterozygous  T          C               356       132         488
CLOSED: [2023-05-01 Mon 17:18]
***** KILL Chromosome1 15 :Test haplotype caller : échec car CIGARE non mis à jour
CLOSED: [2023-05-13 Sat 18:29] SCHEDULED: <2023-05-01 Mon>
#+begin_src
julia -Jbisonex.so --project=. insertVariants.jl `63003856_S135_chr15.bam` 63003856_S135_chr15_inserted.bam
scp 63003856_S135_chr15_inserted.bam* meso:/Work/Users/apraga/bisonex/tests/synthetic/
#+end_src
#+begin_src sh :dir /ssh:meso:/Work/Users/apraga/bisonex/tests/synthetic :results silent
ln -s /Work/Projects/bisonex/data/dbSNP/GRCh38.p13/dbSNP.gz .
ln -s /Work/Projects/bisonex/data/dbSNP/GRCh38.p13/dbSNP.gz.tbi
ln -s /Work/Projects/bisonex/data/genome/GRCh38.p13/genomeRef.dict .
ln -s /Work/Projects/bisonex/data/genome/GRCh38.p13/genomeRef.fna .
ln -s /Work/Projects/bisonex/data/genome/GRCh38.p13/genomeRef.fna.fai .
#+end_src
puis
#+begin_src
gatk --java-options "-Xmx3072M" HaplotypeCaller --input 63003856_S135_chr15_inserted.bam --output testchr15.vcf.gz --reference genomeRef.fna  --tmp-dir . -L NC_000015.10
#+end_src
scp meso:/Work/Users/apraga/bisonex/tests/synthetic/testchr15.vcf.gz haplotypecaller-chr15.vcf.gz
Aucun variant inséré
- base quality ok
  -
****** DONE bam out : non appelé
CLOSED: [2023-05-01 Mon 21:57]
gatk --java-options "-Xmx3072M" HaplotypeCaller --input 63003856_S135_chr15_inserted.bam     --output haplotypecaller-chr15.vcf.gz --reference genomeRef.f
na  --tmp-dir . -L NC_000015.10  --bam-output debug.bam
****** DONE --linked-de-bruijn-graph : idem
CLOSED: [2023-05-01 Mon 21:57]
readlink testchr15.vcf.gz -f^C
[apraga@mesointeractive synthetic]$ gatk --java-options "-Xmx3072M" HaplotypeCaller --input 63003856_S135_chr15_inserted.bam     --output haplotypecaller-chr15.vcf.gz --reference genomeRef.fna  --tmp-dir . -L NC_000015.10  --linked-de-bruijn-graph
****** KILL regénérer fastq
CLOSED: [2023-05-13 Sat 18:29]
Non
***** KILL Générer bam données pour tous les chromosomes
CLOSED: [2023-05-13 Sat 18:29]
 timeit julia -Jbisonex.so --project=. insertVariants.jl ~/code/bisonex/out/63003856/preprocessing/63003856_S135.bam 63003856_S135_inserted.bam
40min 516ms 835µs 405ns
Avertissement:
 [W::bam_hdr_read] EOF marker is absent. The input is probably truncated
Inserted.bam et excluded.bam (fichier avant le merge)  ont l'air ok...
On réessaie à la main : ça passe
#+begin_src
samtools merge test-all.bam inserted.bam excluded.bam
❯ mv test-all.bam `63003856_S135_inserted.bam` -f
❯ mv test-all.bam.bai `63003856_S135_chr15_inserted.bam.bai` -f
#+end_src
***** DONE BAm2fastq pour avoir CIGAR à jour : échec (variants "cachés")
CLOSED: [2023-05-04 Thu 20:30] SCHEDULED: <2023-05-01 Mon>
On lance la génération de bam depuis le mesocentro (la copie plante via le VPN)
#+begin_src sh
cd /Work/Users/apraga/recherche/bisonex/generate
julia --project=. insertVariants.jl  ../../../bisonex/out/63003856_S135/preprocessing/applybqsr/63003856_S135.bam 63003856_S135_inserted.bam
#+end_src
Workflow après avec désactivé storeDir pour SAMTOOLS_BAM2FQ dans nextflow.config (pourquoi ??)
#+begin_src nextflow
include { SAMTOOLS_BAM2FQ }                            from "${params.modulesDir}/samtools/bam2fq/main"
include { SAMTOOLS_SORT as sortBamByName }             from "${params.modulesDir}/samtools/sort/main"
workflow {
    f = Channel.fromPath("${params.dataDir}/synthetic/63003856_S135_inserted.bam",
                         checkIfExists: true).map{it -> [["id": "synthetic_63003856"], it]}
    // Important: use "-n" option !!
    sortBamByName(f)
    SAMTOOLS_BAM2FQ(sortBamByName.out.bam, true)
}
#+end_src
Puis
#+begin_src
cp work/34/fb2fc136f6f6d7f42d0960512f06de/*.fq.gz /Work/Groups/bisonex/data/synthetic/
#+end_src
***** KILL Lancer pipeline
CLOSED: [2023-05-04 Thu 20:30] SCHEDULED: <2023-05-01 Mon>
NXF_OPTS=-D"user.name=apraga" nextflow run   main.nf -c nextflow.config  -profile standard,helios -bg --input="/Work/Groups/bisonex/data/synthetic/synthetic_63003856_{1,2}.fq.gz" --outdir out/synthetic_63003856
*** TODO Bamsurgeon :bamsurgeon:
**** TODO Package nix
1. Patcher la recherche du génome de référence pour bien trouver les index (en utilisant une regexp comme nf-core)
2. Rajouter le chemin de picard dans les arguments
3. Option -O3 pour performance
***** DONE Erreur ValueError: quality and sequence mismatch
CLOSED: [2023-05-19 Fri 18:44]
****** DONE Idem avec dernière version sur github
CLOSED: [2023-05-18 Thu 14:36]
****** DONE Version 1.3: ok mais
CLOSED: [2023-05-19 Fri 18:44]
Test sur chr22: variants ok mais VAF=1...
S'exécute "normalement" (échec selon nextflow) mais le bam de sorstie quasiment vide
La fusion du bam avec les variants et du fichier de référence n'a fonctionné correctement.
******* DONE Lancer replacereads.py à la main
CLOSED: [2023-05-18 Thu 21:41]
Dans /Work/Users/apraga/bisonex/work/f3/ce044f80ca91016d68d1bc4f4f5301
#+begin_src sh
 /nix/store/xw277la6w4sjqlsvw9h32cvrlacrfkgm-python3-3.10.9-env/bin/python3.10 /nix/store/abzangf0q8k37053p776cfkw181dzjn3-bamsurgeon-1.3/bin/bamsurgeon/replacereads.py -b cento.bam -r addsnv.e14561be-4fdd-45cc-9989-048ab6da6cc6.muts.bam -o snv-manual.bam
#+end_src
Puis
#+begin_src
samtools sort snv-manual.bam -o snv-manual-sorted.bam
#+end_src
A l'air de fonctionne
***** TODO Corriger run nextflow pour éviter les erreurs
SCHEDULED: <2023-05-21 Sun>
Trop de message d'erreur en sortie ?
**** DONE Test sur mini-bam: échec
CLOSED: [2023-05-14 Sun 21:12]
❯ samtools view -h ~/code/bisonex/simuscop-centogene-200x/cento/preprocessing/mapped/cento.bam | head -n1000 | samtools view -Sb - > mini.bam
❯ samtools index mini.bam
Sans spécfier le variant:
#+begin_quote
NC_000001.11	17651	17651
#+end_quote
./result/bin/addsnv -v snv.txt -f mini.bam -r ../data/genomeRef.fna -o test.bam
**** DONE Test chr22
CLOSED: [2023-05-15 Mon 23:24]
Pas assez de reads, on prend le chromosome 22
#+begin_src sh
samtools view ../simuscop-centogene-200x/cento/preprocessing/mapped/cento.bam NC_000022.11 -b -o chr22.bam
samtools index chr22.bam
#+end_src
Mésocentre
dans tests/bamsurgeno
#+begin_src
addsnv -v snv.txt -f chr22.bam -r ../genomeRef.fna -o test.bam --aligner mem
#+end_src
***** DONE SNV aléatoire:
CLOSED: [2023-05-15 Mon 23:13]
NC_000022.11	17499704	17499704    0.2
On retrouve bien un variant à cette position A > T
***** DONE SNV avec ALT prédéfini : retrouvée dans IGV (mais pas dans pileup)
CLOSED: [2023-05-15 Mon 23:13]
NC_000022.11	17499704	17499704    0.2 G
***** DONE Variants patients chr22: ok IGV
CLOSED: [2023-05-15 Mon 23:23]
Fichier non trié donc
samtools sort test.bam -o test-sorted.bam
samtools index test-sorted.bam
***** DONE Vérifier qu'il faut POS et POS+1: non
CLOSED: [2023-05-14 Sun 21:21]
**** TODO Variants cento
***** STRT SNV
SCHEDULED: <2023-05-15 Mon>
Attention à la mémoire: 32G ne semble pas suffire avec 12 threads
#+begin_src sh
NXF_OPTS=-D"user.name=${USER}" nextflow run workflows/bamsurgeon.nf -profile standard,helios --input=tests/bamsurgeon/snv-cento.tsv -bg
#+end_src
ET
#+begin_src nextflow
workflow {
    f = Channel.fromPath(params.input, checkIfExists: true)
    bam = Channel.fromPath("simuscop-centogene-200x/cento/preprocessing/mapped/cento.bam",
                           checkIfExists: true)
    bamIndex = bam.map { it -> it + ".bai" }
    downloadGenome | indexGenome
    indexGenome.out.index | view
    addSNV(f, bam, bamIndex, downloadGenome.out, indexGenome.out.index, indexGenome.out.dict, indexGenome.out.fai)
}
#+end_src
****** DONE v1.3: Lancer le pipeline pour vérifier qu'on retrouve les variants
CLOSED: [2023-05-19 Fri 18:41] SCHEDULED: <2023-05-18 Thu>
****** TODO Forcer la VAF
SCHEDULED: <2023-05-19 Fri>
Est à 1
***** TODO del
***** TODO ins
*** Divers
**** DONE Vérifier nombre de reads fastq - bam
CLOSED: [2022-10-09 Sun 22:31]