diff --git a/06-annotation/.DS_Store b/06-annotation/.DS_Store new file mode 100644 index 0000000000000000000000000000000000000000..5008ddfcf53c02e82d7eee2e57c38e5672ef89f6 Binary files /dev/null and b/06-annotation/.DS_Store differ diff --git a/06-annotation/README.md b/06-annotation/README.md new file mode 100644 index 0000000000000000000000000000000000000000..e55eefdc3595759fd25dc0b8540f6a59ad41064f --- /dev/null +++ b/06-annotation/README.md @@ -0,0 +1,315 @@ +# Annotation + +Code in this folder covers the annotation of the genome with gene models, mostly for protein-coding +genes. The starting point is the scaffolded, decontaminated draft assembly. + +## Annotating protein-coding genes on the _Pycnogonum_ genome + +### BRAKER3 with developmental RNA-seq data + +BRAKER3 is one of the most widely used tools for eukaryotic gene prediction, and one that +consistently performs well in benchmarks. However, installing the software is a non-trivial +endeavour in a HPC system without admin privileges. We provide here a summary of our experience in +the hopes that it is useful to other hopeful users. + +- [Installing BRAKER3 in a shared system without admin access](install_braker.md) + +We generated RNA-seq data for a time series spanning development from late embryogenesis until the +subadult stage (embryo 3-4, instars 1 to 6, juvenile 1, subadult), and our collaborators provided +data for even earlier timepoints (zygote, early cleavage, embryo 3-5). This data was mapped to the +draft genome (average: 72.8% uniquely mapping, 15% multimapping, so pretty good completeness) and +the resulting BAM files used as input to BRAKER3. + +- [`prep-RNA-mkindex.sh`](prep-RNA-mkindex.sh) - before mapping the RNA we need to index the draft + genome so that STAR can actually map to it. +- [`prep-RNA-map-all.sh`](prep-RNA-map-all.sh) - main script that will submit all mapping jobs. +- [`prep-RNA-map-single.sh`](prep-RNA-map-single.sh) - worker script that will map the RNA reads + from a developmental timepoint to the draft genome. +- [`prep-RNA-bam.sh`](prep-RNA-bam.sh) - main script that will submit the SAM-to-BAM conversion jobs + to the cluster. +- [`prep-RNA-sam2bam.sh`](prep-RNA-sam2bam.sh) - worker script that will convert a SAM file to a BAM + file in the same location. +- [`prep-RNA-index_bam.sh`](prep-RNA-index_bam.sh) - index the resulting BAM files + +BRAKER3 also expects a FASTA file with protein sequences expected to be found in the genome. For +this, we used the Arthropoda partition from [orthoDB](https://www.orthodb.org). + +BRAKER3 was run from a container on the Vienna LiSC (Life Science Cluster; LiSC for short). + +- [annotate/annot-braker.sh](./annot1-braker.sh) + +This predicted 11,451 gene models. However, manual inspection of the RNA-seq mapping on the draft +genome revealed that BRAKER3 had been rather conservative in its predictions, and that many loci +that looked like they contained exon structures did not receive any annotation. + +### using Iso-seq collapsed transcripts as gene models + +We generated a long-read transcriptome using PacBio Iso-seq technology. After collapsing isoforms by +mapping onto the draft genome, we were left with 8,904 transcript families, which we interpreted as +genes. By comparing the Iso-seq "genes" with the BRAKER3 models we were able to identify multiple +cases where BRAKER3 occasionally merged gene models if they were too close to each other. + +We decided to combine the Iso-seq and BRAKER3 annotation by giving precedence to Iso-seq. We +filtered the BRAKER3 gene models to remove any that overlapped (same strand, any overlap) with the +Iso-seq genes, and were left with 3,596 gene models, raising our total to 3,596 + 8,904 = 12,500. + +- [Map](annot1-isoseq-map.sh) the Iso-seq collapsed transcripts to the draft genome +- [isoseq_confirm.ipynb](./annot1-isoseq_confirm.ipynb) - cross-reference the Iso-seq mapping + locations with BRAKER predictions and produce a new GFF file that contains their union. + +Unfortunately, the Iso-seq transcriptome we generated was part of a multiplexed run. This meant that +we "lost" almost 60% of our effective sequence depth on other organisms, making the _Pycnogonum_ +Iso-seq less effective for gene annotation. Manual inspection confirmed what the numbers suggested: +there were still plenty of loci with RNA-seq peaks that looked like _bona fide_ genes but received +no annotation from either PacBio or BRAKER3. + +We consider Iso-seq+BRAKER3 "round 1" of the genome annotation. In the final GFF file, gene models +that are derived from Iso-seq are have "PacBio" in the `source` column, and their IDs are in the +form `PB.XXXX`, where `X` are digits. BRAKER3 gene models have "AUGUSTUS" listed as their `source` +and their IDs have the form `gXXXXX`, with `X` again being digits. + +### Re-using unannotated loci for a second round of annotation + +There were a couple of indications that the genome annotation was not yet complete. First, BUSCO +completeness of the gene models was lower than the entire genome; second, the mapping rates of the +RNA-seq data was much lower for gene models (avg. uniquely mapping: 48.69%) compared to the entire +genome (avg. uniquely mapping: 72.8%); finally, manual inspection easily identified loci that looked +like genes but had no annotation. + +We proceeded to a second round of annotation. We filtered the RNA-seq data to only keep the mapped +reads that did _not_ overlap with any of the gene models of the first BRAKER/Iso-seq round: + +- [annot2-RNA-extracts.sh](./annot2-RNA-extracts.sh) - main script that will submit worker scripts + for each transcriptome. +- [text](annot2-RNA-extract_uncovered.sh) - worker script that, for a bulk transcriptome will + extract the reads that map on un-modelled regions. + +...and supplied the same orthoDB arthropod partition as protein hints: + +- [annot2-braker.sh](./annot2-braker.sh) + +As the protein hints overlapped with round 1, it was to be expected that BRAKER3 would also produce +gene models that would overlap with round 1. To safely exclude these cases we used _exons_ to test +for gene overlap. Using entire gene models would preclude genes that overlap with other genes (but +on the other strand), or genes that lie in another gene's intron (such as the CHAT/VCHAT genes, +conserved in _Platynereis_). + +This task was mostly achieved on the command line, using a combination of `bedtools intersect` to +identify overlaps between GFF files, `AGAT` to convert from GTF to GFF, and bash utility tools +`grep`, `cut`, `sort`, and `sed` to extract and lightly manipulate lines from the resulting files. + +We consider this "round 2" of the genome annotation. In the final GFF file, gene models from round 2 +have "AGAT" as `source` for gene/mRNA entries or "AUGUSTUS" for exons/CDS. Gene IDs are in the form +`r2_gXXXX`, with `r2` standing for "round 2" and the rest being the usual BRAKER3 gene ID. Round 2 +produced an additional 2,223 gene models, bringing our total to 14,723. + +<details> +<summary>Click here for details</summary> + +First extract the round 2 exons that don't overlap: + +```bash +$ grep exon braker/braker.gtf > braker_exons.gtf +$ grep exon ../annot_merge/merged.gff3 > draft_exons.gtf +$ bedtools intersect -v -b draft_exons.gtf -a braker_exons.gtf > unique_exons.gtf +``` + +Keep track of which genes these come from: + +```bash +$ cat unique_exons.gtf | cut -f9 | cut -d" " -f 4 | sort -u > keep_candidates.txt +``` + +At the same time, there might be round 2 genes that are a bit bigger than round 1 genes, and thus +mostly overlap but have some unique exons. We'd like to avoid those: + +```bash +$ bedtools intersect -wa -b draft_exons.gtf -a braker_exons.gtf > shared_exons.gtf +$ cat shared_exons.gtf | cut -f9 | cut -f4 -d" " | sort -u > overlap.txt +``` + +We can now filter the candidate genes by those that have some overlap: + +```bash +$ grep -v -f overlap.txt keep_candidates.txt > keep_genes.txt +``` + +And finally, we can filter the GTF file: + +```bash +$ grep -f keep_genes.txt braker/braker.gtf > braker2_unique.gtf +``` + +Now, before we can use this GTF file we need to make the gene names unique so that they don't +overlap with the round 1 gene names. We can do this by adding a suffix to the gene names: + +```bash +$ sed -r 's/(g[[:digit:]])/r2_\1/g' braker2_unique.gtf > braker2_unique_renamed.gtf +``` + +Finally, we can convert to a GFF3 file and append to our existing annotation: + +```bash +$ module load conda +$ conda activate agat-1.4.0 +$ agat_convert_sp_gxf2gxf.pl -g braker2_unique_renamed.gtf -o ./braker2_unique_renamed.gff3 +``` + +Let's copy that over to the annotation merge site: + +```bash +$ cp braker2_unique_renamed.gff3 ../annot_merge/braker2_unique.gff3 +``` + +Now we can remove the lines that contain introns or start/stop codons: + +```bash +$ cd ../annot_merge +$ grep -v -E -i "(intron)|(codon)" braker2_unique_renamed.gff3 > braker2_unique_renamed_nocodon_intron.gff3 +``` + +</details> + +### Using _de novo_ transcriptomes as additional evidence + +Despite the improvements, manual inspection continued to suggest that there were loci with real +RNA-seq peaks and exon-like structures that did not receive gene models. We reasoned that loci that +represented real genes would be more likely to be present in multiple time points. Since BRAKER3 had +already seen these loci and didn't identify them as genes, we hypothesized that the signal-to-noise +ratio might have played a role. Instead of using raw reads we therefore turned to _de novo_ +assembled transcripts, and tried to find loci where transcripts from multiple timepoints mapped onto +the draft genome. + +We used [Trinity](https://github.com/trinityrnaseq/trinityrnaseq) to perform _de novo_ assemblies +for the in-house RNA-seq datasets, as they had been sequenced at a much deeper level than the +transcriptomes provided by our collaborators, giving the tool better chances to assemble real +sequences (also refer to the [corresponding section](../05-transcriptomes/README.md)) We kept +complete ORFs, as annotated by [TransDecoder](https://github.com/TransDecoder/TransDecoder), and +mapped those against the draft genome. + +- [`filter_assemblies.sh`](./annot3-RNA-01-filter_assemblies.sh) - main script that submits + filtering jobs to the cluster. +- [`extract_complete_ORFs.sh`](annot3-RNA-01-extract_complete_ORFs.sh) - worker script that will + extract coding regions that represent complete ORFs +- [`map_assemblies.sh`](./annot3-RNA-02-map_assemblies.sh) - main script that will submit + mapping jobs. +- [map complete ORFs to genome](annot3-RNA-02-map_assembly_single.sh) - worker script +- [`pack_sams.sh`](./annot3-RNA-03-pack_sams.sh) - main script that will submit formatting jobs to + the cluster. +- [convert to GFF3](annot3-RNA-03-pack_sam.sh) - worker script that will extract reads that mapped + well and save them in GFF3 format. + +We chose to use the instar 3 transcriptome, which had 99% BUSCO completeness (metazoa), as a +reference. We used `bedtools intersect` to find all draft genome loci where an instar 3 transcript +mapped and overlapped (at 90% of its length) with a mapped transcript from another time point (also +at 90% of its length), and excluded all loci that overlapped with gene models from rounds 1 and 2. +We then translated the cDNA match parts of each locus to exons. Whenever there was disagreement +between the different _de novo_ transcriptomes about exon boundaries, we picked the most common exon +boundary. + +- [mining de novo assembled transcriptomes](annot3-RNA-04-denovo_mining.ipynb) + +While this approach is less sensitive than BRAKER3 and is incapable of distinguishing between +isoforms, it is fairly robust in finding gene and exon boundaries, at least so far as the coding +regions are concerned. + +We consider this "round 3" of the genome annotation. In the final GFF file, gene models from round 3 +have "Trinity" as `source`. Gene IDs are in the form `at_DNXXXX`, with `at` standing for "assembled +transcriptomes", `DN` being a tribute to Trinity transcript naming, and `X` being digits. Round 3 +produced an additional 774 gene models, bringing our total to 15,497. + +### Consolidating results + +Finally, we can form the GFF3 file and sort it: + +```bash +$ cat isoseq.gff > merged.gff3 +$ cat braker.gff >> merged.gff3 +$ cat braker2_unique_renamed_nocodon_intron.gff3 >> merged.gff3 +$ cat denovo_txomes/overlap_translated.gff3 >> merged.gff3 +$ cat ../trnascan/trnascan.gff3 >> merged.gff3 +$ module load genometools/ +$ gt gff3 -tidy -retainids -o merged_sorted.gff3 -force merged.gff3 +``` + +### Testing + +First, convert the GFF to GTF for convenience: + +```bash +$ gt gff3_to_gtf -o merged_sorted.gtf merged_sorted.gff3 +``` + +Now we can use [AGAT](https://agat.readthedocs.io/en/latest/tools/agat_sp_extract_sequences.html) to +extract transcripts from the genome: + +```bash +$ agat_sp_extract_sequences.pl -g merged_sorted.gff3 -f ../draft_softmasked.fasta -t exon --merge -o transcripts.fa +``` + +We can then use this file as input for TransDecoder... + +```bash +$ TransDecoder.LongOrfs -t transcripts.fa +$ TransDecoder.Predict -t transcripts.fa +``` + +...and finally look at BUSCO completeness: + +```bash +$ conda deactivate +$ conda activate busco-5.7.1 +$ busco -i transcripts.fa.transdecoder.pep -l metazoa -m protein -o metazoa -r -c 4 --offline --download_path ../../busco/busco_downloads/ +``` + +Which returns: + +``` +--------------------------------------------------- +|Results from dataset metazoa_odb10 | +--------------------------------------------------- +|C:96.5%[S:40.7%,D:55.8%],F:1.5%,M:2.0%,n:954 | +|920 Complete BUSCOs (C) | +|388 Complete and single-copy BUSCOs (S) | +|532 Complete and duplicated BUSCOs (D) | +|14 Fragmented BUSCOs (F) | +|20 Missing BUSCOs (M) | +|954 Total BUSCO groups searched | +--------------------------------------------------- +``` + +This is very, very close to what we had for the entire genome (except for the high prevalence of +duplicated BUSCOs, which can be explained by the fact that we included all isoforms). In particular, +it is worth noting that, for the entire genome, we had 1.6% fragmented and 1.7% missing BUSCOs. +Taken together, this suggests that we have a fairly complete set of gene models. + +We can do the same with the arthropod BUSCO set: + +```bash +$ busco -i transcripts.fa.transdecoder.pep -l arthropoda -m protein -o arthropoda -r -c 4 --offline --download_path ../../busco/busco_downloads/ +``` + +Which returns: + +``` +--------------------------------------------------- +|Results from dataset arthropoda_odb10 | +--------------------------------------------------- +|C:95.8%[S:37.3%,D:58.5%],F:2.0%,M:2.2%,n:1013 | +|971 Complete BUSCOs (C) | +|378 Complete and single-copy BUSCOs (S) | +|593 Complete and duplicated BUSCOs (D) | +|20 Fragmented BUSCOs (F) | +|22 Missing BUSCOs (M) | +|1013 Total BUSCO groups searched | +--------------------------------------------------- +``` + +Very similar results; again suggesting that we have a fairly complete set of gene models. + +## tRNAscan + +We also used tRNAscan to predict tRNAs in the draft genome. + +- [run the tool](annot-trnascan.sh) +- [export in GFF3 format](tRNAscan_to_gff3.ipynb) \ No newline at end of file diff --git a/06-annotation/annot-trnascan.sh b/06-annotation/annot-trnascan.sh new file mode 100644 index 0000000000000000000000000000000000000000..68d13a6c99bb685c57bfa4d58a5616df3e29a29c --- /dev/null +++ b/06-annotation/annot-trnascan.sh @@ -0,0 +1,42 @@ +#!/usr/bin/env bash +# +#SBATCH --job-name=pycno_trnascan +#SBATCH --ntasks=1 +#SBATCH --cpus-per-task=2 +#SBATCH --mem=500M +#SBATCH --time=1:00:00 +#SBATCH --mail-type=ALL +#SBATCH --mail-user=nikolaos.papadopoulos@univie.ac.at +#SBATCH --output=/lisc/user/papadopoulos/log/pycno-trnascan-%j.out +#SBATCH --error=/lisc/user/papadopoulos/log/pycno-trnascan-%j.err + +module load trnascan + +# Testing was performed with a FASTA file with 60 chars/line. + +# | CPUs | input size (lines) | real time | max memory | CPU util. | +# |------|--------------------|-----------|------------|-----------| +# | 1 | 78753 | 45s | 57.5Mb | 97% | +# | 2 | 78753 | 31s | 97Mb | 142% | +# | 4 | 78753 | 30s | 16.8Mb | 182% | (testing) +# | 1 | 787530 | 7:04 | 172Mb | 98% | +# | 2 | 787530 | 4:58 | 225Mb | 145% | +# | 4 | 787530 | 4:12 | 336Mb | 188% | +# |------|--------------------|-----------|------------|-----------| +# | 2 | 7875382 | 53:40 | 304Mb | 60% | (run) + +ASSEMBLY=/lisc/scratch/zoology/pycnogonum/genome/draft/draft.fasta +OUTPUT=/lisc/scratch/zoology/pycnogonum/genome/draft/trnascan/ +mkdir "$OUTPUT" -p || exit + +cd "$OUTPUT" || exit + +tRNAscan-SE \ +-o trnascan.out \ +-f trnascan.fasta \ +-b trnascan.bed \ +--stats trnascan.stats \ +--progress \ +--hitsrc \ +--thread 2 \ +"$ASSEMBLY" \ No newline at end of file diff --git a/06-annotation/annot1-braker.sh b/06-annotation/annot1-braker.sh new file mode 100644 index 0000000000000000000000000000000000000000..74c07ed17334508b95b96bfbf5b45fcaf037f9cb --- /dev/null +++ b/06-annotation/annot1-braker.sh @@ -0,0 +1,56 @@ +#!/usr/bin/env bash +# +#SBATCH --job-name=pycno_braker +#SBATCH --ntasks=1 +#SBATCH --cpus-per-task=18 +#SBATCH --mem=40G +#SBATCH --time=20:00:00 +#SBATCH --mail-type=ALL +#SBATCH --mail-user=nikolaos.papadopoulos@univie.ac.at +#SBATCH --output=/lisc/user/papadopoulos/log/pycno-braker-%j.out +#SBATCH --error=/lisc/user/papadopoulos/log/pycno-braker-%j.err + +module load conda +conda activate singularity + +# container location +BRAKER=/lisc/user/papadopoulos/containers/braker3.sif +# path to the output directory, where all the magic should happen +OUTPUT=/lisc/scratch/zoology/pycnogonum/genome/draft/braker +mkdir "$OUTPUT" -p || exit +cd "$OUTPUT" || exit + +# copy AUGUSTUS config files for BRAKER +cp -r ~/repos/Augustus/config . + +# define input files +# the genome assembly +GENOME=/lisc/scratch/zoology/pycnogonum/genome/draft/draft_softmasked.fasta +# the RNA-seq dataset BAMs +BASE=/lisc/scratch/zoology/pycnogonum/genome/draft/transcriptome +ZYGOTE="$BASE/ZYGOTE/Aligned.out.bam" +EARLY_CLEAVAGE="$BASE/EARLY_CLEAVAGE/Aligned.out.bam" +EMBRYO0_1="$BASE/EMBRYO0_1/Aligned.out.bam" +EMBRYO3_5="$BASE/EMBRYO3_5/Aligned.out.bam" +EMBRYO9_10="$BASE/EMBRYO9_10/Aligned.out.bam" +EMBRYO3="$BASE/EMBRYO3/Aligned.out.bam" +INSTAR1="$BASE/INSTAR1/Aligned.out.bam" +INSTAR2="$BASE/INSTAR2/Aligned.out.bam" +INSTAR3="$BASE/INSTAR3/Aligned.out.bam" +INSTAR4="$BASE/INSTAR4/Aligned.out.bam" +INSTAR5="$BASE/INSTAR5/Aligned.out.bam" +INSTAR6="$BASE/INSTAR6/Aligned.out.bam" +JUV1="$BASE/JUV1/Aligned.out.bam" +SUBADULT="$BASE/SUBADULT/Aligned.out.bam" +# the Arthropoda OrthoDB protein sequences +ARTHROPODA=/lisc/scratch/zoology/db/Arthropoda.fa + +# now call the container. Important to mount the current directory, /scratch/, and the $TMPDIR, +# in case it is needed. +singularity exec -B "$PWD,/lisc/scratch/zoology/,$TMPDIR" "$BRAKER" braker.pl \ +--AUGUSTUS_CONFIG_PATH="${PWD}"/config \ +--genome=$GENOME \ +--prot_seq=$ARTHROPODA \ +--bam="$ZYGOTE","$EARLY_CLEAVAGE","$EMBRYO0_1","$EMBRYO3_5","$EMBRYO9_10","$EMBRYO3","$INSTAR1","$INSTAR2","$INSTAR3","$INSTAR4","$INSTAR5","$INSTAR6","$JUV1","$SUBADULT" \ +--softmasking \ +--threads 18 \ No newline at end of file diff --git a/06-annotation/annot1-isoseq-map.sh b/06-annotation/annot1-isoseq-map.sh new file mode 100644 index 0000000000000000000000000000000000000000..628e64f612c6cd77aa42043fe8f890a981a11948 --- /dev/null +++ b/06-annotation/annot1-isoseq-map.sh @@ -0,0 +1,18 @@ +#!/usr/bin/env bash +# +#SBATCH --job-name=mmap_pycno +#SBATCH --ntasks=1 +#SBATCH --cpus-per-task=16 +#SBATCH --mem=15G +#SBATCH --time=15:00 +#SBATCH --mail-type=ALL +#SBATCH --mail-user=nikolaos.papadopoulos@univie.ac.at +#SBATCH --output=/lisc/user/papadopoulos/log/pycno-mmap-%j.out +#SBATCH --error=/lisc/user/papadopoulos/log/pycno-mmap-%j.err + +module load conda +conda activate minimap2-2.28 + +DRAFT=/lisc/scratch/zoology/pycnogonum/genome/draft/draft.fasta +ISOSEQ=/lisc/scratch/zoology/pycnogonum/genome/draft/transcriptome/isoseq/isoseq.fastq +/usr/bin/time minimap2 -ax splice -t16 -uf -C5 $DRAFT $ISOSEQ | samtools view -bS -o isoseq.bam \ No newline at end of file diff --git a/06-annotation/annot1-isoseq_confirm.ipynb b/06-annotation/annot1-isoseq_confirm.ipynb new file mode 100644 index 0000000000000000000000000000000000000000..1894b7dc75403f2fba166965600d016b9fa1b473 --- /dev/null +++ b/06-annotation/annot1-isoseq_confirm.ipynb @@ -0,0 +1,515 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Confirming gene models with isoform-level RNA-seq data\n", + "\n", + "## Motivation\n", + "\n", + "Our gene prediction with BRAKER3 produced a low number of gene models (~11.5k). We suspect this\n", + "might be because GeneMark-ET/StringTie are too aggressive with exons/genes that are close to each\n", + "other, merging them and messing up the gene models. We will use the isoform-level RNA-seq data to\n", + "confirm some of the gene models." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "import numpy as np\n", + "import pandas as pd\n", + "\n", + "from tqdm import tqdm" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "I extracted all transcript lines from the corresponding gff/gtf files and used bedtools to intersect\n", + "them:\n", + "\n", + "```bash\n", + "$ ISOSEQ=/lisc/scratch/zoology/pycnogonum/transcriptome/isoseq/isoseq_confirmed.gff\n", + "$ BRAKER=/lisc/scratch/zoology/pycnogonum/genome/draft/braker/braker/braker_proposed.gtf\n", + "$ bedtools intersect -wao -b $ISOSEQ -a $BRAKER > brak_iso.bed\n", + "```" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "gtf = [\"seqname\", \"source\", \"feature\", \"start\", \"end\", \"score\", \"strand\", \"frame\", \"attributes\"]\n", + "\n", + "# iso2bra = pd.read_csv(\"/Volumes/scratch/pycnogonum/genome/draft/annot_merge/iso_brak.bed\", sep=\"\\t\", header=None)\n", + "# iso2bra.columns = [\"iso_\" + x for x in gtf] + [\"bra_\" + x for x in gtf] + [\"overlap\"]\n", + "# iso2bra[\"iso_gene\"] = iso2bra[\"iso_attributes\"].str.extract(r'gene_id \"(.*?)\";')\n", + "# iso2bra[\"bra_gene\"] = iso2bra[\"bra_attributes\"].str.split(\"\\.\").str[0]\n", + "# iso2bra[\"bra_gene\"] = iso2bra[\"bra_gene\"].replace(\"\", np.nan)\n", + "\n", + "bra2iso = pd.read_csv(\"/Volumes/scratch/pycnogonum/genome/draft/annot_merge/brak_iso.bed\", sep=\"\\t\", header=None)\n", + "bra2iso.columns = [\"bra_\" + x for x in gtf] + [\"iso_\" + x for x in gtf] + [\"overlap\"]\n", + "bra2iso[\"bra_gene\"] = bra2iso[\"bra_attributes\"].str.split(\".\").str[0]\n", + "bra2iso[\"iso_gene\"] = bra2iso[\"iso_attributes\"].str.extract(r'gene_id \"(.*?)\";')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now to find overlaps:" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "def contained(df, a_start=\"bra_start\", a_end=\"bra_end\", b_start=\"iso_start\", b_end=\"iso_end\"):\n", + " \"\"\"Checks if sequence A is completely contained within sequence B or vice versa.\n", + "\n", + " Parameters\n", + " ----------\n", + " df : pd.DataFrame\n", + " A dataframe of a bedtools intersect file.\n", + " a_start : str, optional\n", + " the start position of sequence A, by default \"bra_start\"\n", + " a_end : str, optional\n", + " the end position of sequence B, by default \"bra_end\"\n", + " b_start : str, optional\n", + " The start position of sequence B, by default \"iso_start\"\n", + " b_end : str, optional\n", + " the end position of sequence B, by default \"iso_end\"\n", + "\n", + " Returns\n", + " -------\n", + " bool\n", + " returns True if the sequence is contained, False otherwise\n", + " \"\"\"\n", + " if df[\"bra_strand\"] != df[\"iso_strand\"]:\n", + " return False # make sure that we are comparing genes on the same strand\n", + " if df[b_start] < 0 or df[b_end] < 0 or df[a_start] < 0 or df[a_end] < 0:\n", + " return False\n", + " # first check if the braker gene is completely contained within the isoseq one\n", + " if df[a_start] >= df[b_start] and df[a_end] <= df[b_end]:\n", + " return True\n", + " # then check if the braker gene completely contains the isoseq one\n", + " if df[a_start] <= df[b_start] and df[a_end] >= df[b_end]:\n", + " return True\n", + " return False" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [], + "source": [ + "genes = bra2iso.groupby(\"bra_gene\")" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [], + "source": [ + "def calc_overlap(df):\n", + " bra_len = df.iloc[0][\"bra_end\"] - df.iloc[0][\"bra_start\"]\n", + " # print(bra_len)\n", + " seq = np.zeros(bra_len, dtype=bool)\n", + " for i, transcript in df.iterrows():\n", + " start = transcript[\"iso_start\"] - transcript[\"bra_start\"]\n", + " start = np.max((0, start))\n", + " end = transcript[\"iso_end\"] - transcript[\"bra_start\"]\n", + " # print(end, len(seq))\n", + " end = np.min([bra_len, end])\n", + " seq[start:end] = True\n", + " return np.sum(seq) / bra_len" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "100%|██████████| 11451/11451 [00:03<00:00, 2913.12it/s]\n" + ] + } + ], + "source": [ + "braker_unique = []\n", + "significant = []\n", + "unknown = []\n", + "truncate = []\n", + "\n", + "\n", + "for gene in tqdm(genes.groups):\n", + " try:\n", + " transcripts = genes.get_group(gene)\n", + " except KeyError:\n", + " unknown.append(gene)\n", + " continue\n", + " # first see how many candidate ISO-seq genes this BRAKER gene overlaps with\n", + " # if there is zero, then it means no possible corresponding ISO-seq gene was found\n", + " no_iso = transcripts.groupby(\"iso_gene\").size().shape[0] # how many ISO-seq transcripts match this BRAKER gene?\n", + " if no_iso == 0: \n", + " braker_unique.append(gene)\n", + " continue\n", + " \n", + " # if there is one it means we have an 1-to-1 correspondence\n", + " if no_iso == 1:\n", + " # try to verify that most BRAKER isoforms are contained within the ISO-seq isoforms\n", + " # (or vice versa)\n", + " no_contained = transcripts.apply(contained, axis=1).sum()\n", + " if no_contained / transcripts.shape[0] >= 0.4:\n", + " significant.append(gene)\n", + " continue\n", + "\n", + " # test if the ISO-seq isoforms cumulatively cover at least 50% of the BRAKER gene\n", + " total_overlap = transcripts.groupby(\"bra_attributes\").apply(calc_overlap, include_groups=False)\n", + " if np.mean(total_overlap) >= 0.5:\n", + " significant.append(gene)\n", + " continue\n", + "\n", + " # if we have arrived here, all else failed and this gene needs to go for manual curation\n", + " unknown.append(gene)" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(7591, 3596, 264, np.int64(11451))" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "len(significant), len(braker_unique), len(unknown), np.sum([len(significant), len(braker_unique), len(unknown)])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "At this point the strategy should be to copy the ISO-seq and \"braker_unique\" gene models from their\n", + "respective GTF files to a new, merged one. This should already take us from the 11,451 BRAKER gene\n", + "models to 8,904 + 3,442 = 12,346 gene models. We can then screen the unknown modules further." + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [], + "source": [ + "isoseq_loc = \"/Volumes/scratch/pycnogonum/transcriptome/isoseq/collapsed.gff\"\n", + "braker_loc = \"/Volumes/scratch/pycnogonum/genome/draft/braker/braker/braker.gtf\"" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": {}, + "outputs": [], + "source": [ + "def gene_string(gene_id, start, stop, last_line):\n", + " chunks = last_line.split(\"\\t\")\n", + "\n", + " seqid = chunks[0]\n", + " source = chunks[1]\n", + " seq_type = \"gene\"\n", + " score = \".\"\n", + " strand = chunks[6]\n", + " phase = \".\"\n", + " attributes = f\"ID={gene_id};gene_id={gene_id};\"\n", + " string = f\"{seqid}\\t{source}\\t{seq_type}\\t{start}\\t{stop}\\t{score}\\t{strand}\\t{phase}\\t{attributes};\\n\"\n", + " return string\n", + "\n", + "def transcript_string(gene_id, transcript_id, line):\n", + " chunks = line.split(\"\\t\")\n", + "\n", + " seqid = chunks[0]\n", + " source = chunks[1]\n", + " seq_type = \"mRNA\"\n", + " start = chunks[3]\n", + " stop = chunks[4]\n", + " score = chunks[5]\n", + " strand = chunks[6]\n", + " phase = \".\"\n", + " attributes = f\"ID={transcript_id};Parent={gene_id}\"\n", + " string = f\"{seqid}\\t{source}\\t{seq_type}\\t{start}\\t{stop}\\t{score}\\t{strand}\\t{phase}\\t{attributes};\\n\"\n", + " return string\n", + "\n", + "def exon_string(transcript_id, line, exon_counter):\n", + " chunks = line.split(\"\\t\")\n", + "\n", + " seqid = chunks[0]\n", + " source = chunks[1]\n", + " start = chunks[3]\n", + " stop = chunks[4]\n", + " score = chunks[5]\n", + " strand = chunks[6]\n", + "\n", + " seq_type = \"exon\"\n", + " phase = \".\"\n", + " attributes = f\"ID={transcript_id}.exon{exon_counter};Parent={transcript_id}\"\n", + " exon_string = f\"{seqid}\\t{source}\\t{seq_type}\\t{start}\\t{stop}\\t{score}\\t{strand}\\t{phase}\\t{attributes};\\n\"\n", + " \n", + " seq_type = \"CDS\"\n", + " phase=\"0\"\n", + " attributes = f\"ID={transcript_id}.CDS{exon_counter};Parent={transcript_id}\"\n", + " CDS_string = f\"{seqid}\\t{source}\\t{seq_type}\\t{start}\\t{stop}\\t{score}\\t{strand}\\t{phase}\\t{attributes};\\n\"\n", + " \n", + " return exon_string + CDS_string" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": {}, + "outputs": [], + "source": [ + "def write_gene(current_gene_id, gene_lines, edit, start, end):\n", + " chunks = gene_lines[0].split(\"\\t\")\n", + " # write the gene line first\n", + " gene = gene_string(current_gene_id, start, end, gene_lines[0])\n", + " edit.write(gene)\n", + " # now write each of the transcripts\n", + " exon_counter = 1\n", + " for l in gene_lines:\n", + " chunks = l.split(\"\\t\")\n", + " if chunks[2] == \"transcript\":\n", + " transcript_id = chunks[8].split(\";\")[1].split(\" \")[2][1:-1]\n", + " transcript = transcript_string(current_gene_id, transcript_id, l)\n", + " edit.write(transcript)\n", + " else:\n", + " exon = exon_string(transcript_id, l, exon_counter)\n", + " edit.write(exon)\n", + " exon_counter += 1\n", + "\n", + " edit.write(\"###\\n\")" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [], + "source": [ + "gene_lines = []\n", + "current_gene_id = None\n", + "start = np.inf\n", + "end = 0\n", + "\n", + "with open(\"/Volumes/scratch/pycnogonum/transcriptome/isoseq/collapsed.gff\", \"r\") as isoseq:\n", + " with open(\"/Volumes/scratch/pycnogonum/genome/draft/annot_merge/isoseq.gff\", \"w\") as edit:\n", + " for line in isoseq:\n", + " if line.startswith(\"#\"):\n", + " edit.write(line)\n", + " continue\n", + " chunks = line.split(\"\\t\")\n", + " entry_type = chunks[2]\n", + " # print(line)\n", + " if entry_type == \"transcript\":\n", + " gene_id = chunks[8].split(\";\")[0].split(\" \")[1][1:-1]\n", + " if gene_id != current_gene_id:\n", + " if current_gene_id is None: # first gene\n", + " current_gene_id = gene_id\n", + " gene_lines.append(line)\n", + " continue\n", + " # new gene! Write it out:\n", + " write_gene(current_gene_id, gene_lines, edit, start, end)\n", + " # reset the gene\n", + " current_gene_id = gene_id\n", + " gene_lines = []\n", + " start = int(chunks[3])\n", + " end = int(chunks[4])\n", + " else:\n", + " start = min(start, int(chunks[3]))\n", + " end = max(end, int(chunks[4]))\n", + " gene_lines.append(line)\n", + " # write last gene\n", + " write_gene(current_gene_id, gene_lines, edit, start, end)" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": {}, + "outputs": [], + "source": [ + "gene_lines = []\n", + "current_gene_id = None\n", + "\n", + "with open(\"/Volumes/scratch/pycnogonum/genome/draft/braker/braker/braker.gff3\", \"r\") as braker:\n", + " with open(\"/Volumes/scratch/pycnogonum/genome/draft/annot_merge/braker.gff\", \"w\") as edit:\n", + " for line in braker:\n", + " chunks = line.split(\"\\t\")\n", + " entry_type = chunks[2]\n", + "\n", + " if entry_type in [\"start_codon\", \"stop_codon\", \"intron\"]:\n", + " continue\n", + "\n", + " if entry_type == \"gene\":\n", + " current_gene_id = chunks[8].split(\";\")[0].split(\"=\")[1]\n", + " line = line.rstrip() + \"gene_id=\" + current_gene_id + \";\\n\"\n", + " \n", + " if current_gene_id in braker_unique:\n", + " edit.write(line)" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [], + "source": [ + "import matplotlib.pyplot as plt" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [], + "source": [ + "def plot_overlap(df):\n", + " # plot the overlapping regions as straight lines, color by transcript ID\n", + " start = np.min((df[\"bra_start\"].min(), df[\"iso_start\"].min()))\n", + " end = np.max((df[\"bra_end\"].max(), df[\"iso_end\"].max()))\n", + "\n", + " fig, ax = plt.subplots()\n", + "\n", + " # for i, transcript in df.iterrows():\n", + " # ax.hlines(i*2, transcript[\"iso_start\"], transcript[\"iso_end\"], label=transcript[\"iso_gene\"])\n", + " # ax.hlines(i*2+1, transcript[\"bra_start\"], transcript[\"bra_end\"], label=transcript[\"bra_gene\"])\n", + " ax.set_xlim(start-1000, end+1000)\n", + " ax.legend()" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [], + "source": [ + "# a function that plots all the unique transcripts in a dataframe, coloring them by gene ID\n", + "def plot_overlap(df):\n", + " # first find out the minimum start and maximum end positions of all the real BRAKER genes\n", + " # (i.e. not the -1 values)\n", + " bra_start_min = np.min(df[\"bra_start\"][df[\"bra_start\"] > 0])\n", + " bra_end_max = np.max(df[\"bra_end\"][df[\"bra_end\"] > 0])\n", + " # now do the same for the iso-seq genes\n", + " iso_start_min = np.min(df[\"iso_start\"][df[\"iso_start\"] > 0])\n", + " iso_end_max = np.max(df[\"iso_end\"][df[\"iso_end\"] > 0])\n", + " # the plot should start 1000 bp before the minimum start and end 1000 bp after the maximum end\n", + " start = np.min((bra_start_min, iso_start_min)) - 1000\n", + " end = np.max((bra_end_max, iso_end_max)) + 1000\n", + "\n", + " fig, ax = plt.subplots()\n", + "\n", + " # get a list of all the unique gene IDs\n", + " gene_list = np.unique(df[\"iso_gene\"].tolist() + df[\"bra_gene\"].tolist())\n", + " # generate a color map for the gene IDs - this should be always less than 10, so we can take\n", + " # the Set1 colormap\n", + " colors = plt.cm.Set1(np.linspace(0, 1, len(gene_list)))\n", + " # create a dictionary that maps gene IDs to colors\n", + " cmap = dict(zip(gene_list, colors))\n", + "\n", + " # each transcript should be a straight line, colored by the gene ID. Let's do BRAKER first.\n", + " braker = df.groupby(\"bra_attributes\").first()\n", + " for i, row in braker.iterrows():\n", + " ax.hlines(i, row[\"bra_start\"], row[\"bra_end\"], color=cmap[row[\"bra_gene\"]])\n", + "\n", + " # now do the same for the iso-seq genes\n", + " iso = df.groupby(\"iso_attributes\").first()\n", + " for i, row in iso.iterrows():\n", + " try:\n", + " ax.hlines(i, row[\"iso_start\"], row[\"iso_end\"], color=cmap[row[\"iso_gene\"]])\n", + " except KeyError:\n", + " # there might be some empty IDs here if there is no overlap for this transcript\n", + " # ax.hlines(i, start, end, linestyles=\"dashed\", color=\"black\")\n", + " continue\n", + " \n", + " # set the limits of the plot\n", + " ax.set_xlim(start, end)\n", + " # keep track of what chromosome we're in in case we want to go to the genome browser\n", + " ax.set_title(f\"{df['bra_seqname'].iloc[0]}\")\n", + " # save the plot for manual inspection\n", + " plt.savefig(f\"figs/{df['bra_gene'].iloc[0]}.png\")\n", + " # close plot to save memory\n", + " plt.close()" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "100%|██████████| 264/264 [00:14<00:00, 18.59it/s]\n" + ] + } + ], + "source": [ + "for gene in tqdm(unknown):\n", + " keep = bra2iso[\"bra_gene\"] == gene\n", + " plot_overlap(bra2iso[keep])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Manual inspection of the ISO-seq/BRAKER overlap for these 264 genes identifies another 49 that need\n", + "further examination. Those were aligned against uniref90 to check whether the BRAKER portions\n", + "correspond to real genes or are spurious." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "ascc24", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.4" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/06-annotation/annot2-RNA-extract_uncovered.sh b/06-annotation/annot2-RNA-extract_uncovered.sh new file mode 100644 index 0000000000000000000000000000000000000000..ab9f0fcc1d986672e1163c07850a6446d2af8883 --- /dev/null +++ b/06-annotation/annot2-RNA-extract_uncovered.sh @@ -0,0 +1,19 @@ +#!/usr/bin/env bash +# +#SBATCH --job-name=extr_unannot +#SBATCH --ntasks=1 +#SBATCH --cpus-per-task=1 +#SBATCH --mem=2G +#SBATCH --time=2:00:00 +#SBATCH --mail-type=ALL +#SBATCH --mail-user=nikolaos.papadopoulos@univie.ac.at +#SBATCH --output=/lisc/user/papadopoulos/log/pycno-subset_bam-%j.out +#SBATCH --error=/lisc/user/papadopoulos/log/pycno-subset_bam-%j.err + +module load bedtools + +GFF=/lisc/scratch/zoology/pycnogonum/genome/draft/annot_merge/merged.gff3 +OUTPUT=$1 +cd "$OUTPUT" || exit + +bedtools intersect -v -a Aligned.out.bam -b $GFF > unannotated.bam \ No newline at end of file diff --git a/06-annotation/annot2-RNA-extracts.sh b/06-annotation/annot2-RNA-extracts.sh new file mode 100644 index 0000000000000000000000000000000000000000..6c2ee1c46fd9be49eb498bc2f3ed2d1fe972dd70 --- /dev/null +++ b/06-annotation/annot2-RNA-extracts.sh @@ -0,0 +1,20 @@ +#!/usr/bin/env bash + +EXTRACT=/lisc/user/papadopoulos/repos/pycno-seq/annotate/annot2-RNA-extract_uncovered.sh + +BASE=/lisc/scratch/zoology/pycnogonum/genome/draft/transcriptome/ + +sbatch $EXTRACT "$BASE/ZYGOTE/" +sbatch $EXTRACT "$BASE/EARLY_CLEAVAGE/" +sbatch $EXTRACT "$BASE/EMBRYO0_1/" +sbatch $EXTRACT "$BASE/EMBRYO3_5/" +sbatch $EXTRACT "$BASE/EMBRYO9_10/" +sbatch $EXTRACT "$BASE/EMBRYO3/" +sbatch $EXTRACT "$BASE/INSTAR1/" +sbatch $EXTRACT "$BASE/INSTAR2/" +sbatch $EXTRACT "$BASE/INSTAR3/" +sbatch $EXTRACT "$BASE/INSTAR4/" +sbatch $EXTRACT "$BASE/INSTAR5/" +sbatch $EXTRACT "$BASE/INSTAR6/" +sbatch $EXTRACT "$BASE/JUV1/" +sbatch $EXTRACT "$BASE/SUBADULT/" \ No newline at end of file diff --git a/06-annotation/annot2-braker.sh b/06-annotation/annot2-braker.sh new file mode 100644 index 0000000000000000000000000000000000000000..f90d87970caa68c55feae981984bdcf085c2179c --- /dev/null +++ b/06-annotation/annot2-braker.sh @@ -0,0 +1,56 @@ +#!/usr/bin/env bash +# +#SBATCH --job-name=pycno_braker +#SBATCH --ntasks=1 +#SBATCH --cpus-per-task=10 +#SBATCH --mem=40G +#SBATCH --time=20:00:00 +#SBATCH --mail-type=ALL +#SBATCH --mail-user=nikolaos.papadopoulos@univie.ac.at +#SBATCH --output=/lisc/user/papadopoulos/log/pycno-braker-%j.out +#SBATCH --error=/lisc/user/papadopoulos/log/pycno-braker-%j.err + +module load conda +conda activate singularity + +# container location +BRAKER=/lisc/user/papadopoulos/containers/braker3.sif +# path to the output directory, where all the magic should happen +OUTPUT=/lisc/scratch/zoology/pycnogonum/genome/draft/braker2 +mkdir "$OUTPUT" -p || exit +cd "$OUTPUT" || exit + +# copy AUGUSTUS config files for BRAKER +cp -r ~/repos/Augustus/config . + +# define input files +# the genome assembly +GENOME=/lisc/scratch/zoology/pycnogonum/genome/draft/draft_softmasked.fasta +# the RNA-seq dataset BAMs +BASE=/lisc/scratch/zoology/pycnogonum/genome/draft/transcriptome +ZYGOTE="$BASE/ZYGOTE/unannotated.bam" +EARLY_CLEAVAGE="$BASE/EARLY_CLEAVAGE/unannotated.bam" +EMBRYO0_1="$BASE/EMBRYO0_1/unannotated.bam" +EMBRYO3_5="$BASE/EMBRYO3_5/unannotated.bam" +EMBRYO9_10="$BASE/EMBRYO9_10/unannotated.bam" +EMBRYO3="$BASE/EMBRYO3/unannotated.bam" +INSTAR1="$BASE/INSTAR1/unannotated.bam" +INSTAR2="$BASE/INSTAR2/unannotated.bam" +INSTAR3="$BASE/INSTAR3/unannotated.bam" +INSTAR4="$BASE/INSTAR4/unannotated.bam" +INSTAR5="$BASE/INSTAR5/unannotated.bam" +INSTAR6="$BASE/INSTAR6/unannotated.bam" +JUV1="$BASE/JUV1/unannotated.bam" +SUBADULT="$BASE/SUBADULT/unannotated.bam" +# the Arthropoda OrthoDB protein sequences +ARTHROPODA=/lisc/scratch/zoology/db/Arthropoda.fa + +# now call the container. Important to mount the current directory, /scratch/, and the $TMPDIR, +# in case it is needed. +singularity exec -B "$PWD,/lisc/scratch/zoology/,$TMPDIR" "$BRAKER" braker.pl \ +--AUGUSTUS_CONFIG_PATH="${PWD}"/config \ +--genome=$GENOME \ +--prot_seq=$ARTHROPODA \ +--bam="$ZYGOTE","$EARLY_CLEAVAGE","$EMBRYO0_1","$EMBRYO3_5","$EMBRYO9_10","$EMBRYO3","$INSTAR1","$INSTAR2","$INSTAR3","$INSTAR4","$INSTAR5","$INSTAR6","$JUV1","$SUBADULT" \ +--softmasking \ +--threads 10 \ No newline at end of file diff --git a/06-annotation/annot3-RNA-01-extract_complete_ORFs.sh b/06-annotation/annot3-RNA-01-extract_complete_ORFs.sh new file mode 100644 index 0000000000000000000000000000000000000000..89bcdcbf726e2222aa22fca6e9243edc3b4498ce --- /dev/null +++ b/06-annotation/annot3-RNA-01-extract_complete_ORFs.sh @@ -0,0 +1,34 @@ +#!/usr/bin/env bash +# +#SBATCH --job-name=extr_unannot +#SBATCH --ntasks=1 +#SBATCH --cpus-per-task=1 +#SBATCH --mem=500M +#SBATCH --time=17:00 +#SBATCH --mail-type=ALL +#SBATCH --mail-user=nikolaos.papadopoulos@univie.ac.at +#SBATCH --output=/lisc/user/papadopoulos/log/pycno-extract_orfs-%j.out +#SBATCH --error=/lisc/user/papadopoulos/log/pycno-extract_orfs-%j.err + +module load perl/5.40.0 +module load transdecoder/5.7.1 + +DENOVO=$1 +OUTPUT=$2 + +# 228.06user +# 747.90user +# say 1000s = 16-17min + +mkdir -p "$OUTPUT" || exit 1 +cd "$OUTPUT" || exit 1 + +# first we need to extract ORFs +TransDecoder.LongOrfs -t "$DENOVO" +# now predict the coding regions +TransDecoder.Predict -t "$DENOVO" +# now convert the resulting .cds file to a single-line fasta +# almost instantaneous +perl -pe '$. > 1 and /^>/ ? print "\n" : chomp' Trinity.fasta.transdecoder.cds > oneline.fasta +# extract complete ORFs. Almost instantaneous. +grep -A1 complete oneline.fasta > complete.fasta \ No newline at end of file diff --git a/06-annotation/annot3-RNA-01-filter_assemblies.sh b/06-annotation/annot3-RNA-01-filter_assemblies.sh new file mode 100644 index 0000000000000000000000000000000000000000..fcaddea2962360afd4e8b7f08f71f1223743558c --- /dev/null +++ b/06-annotation/annot3-RNA-01-filter_assemblies.sh @@ -0,0 +1,37 @@ +#!/usr/bin/env bash + +FILTER=/lisc/user/papadopoulos/repos/pycno-seq/annotate/annot3-RNA-extract_complete_ORFs.sh + +IN=/lisc/project/zoology/pycnogonum/transcriptome/development + +EMBRYO3_5_IN=$IN/embryonic_stage3-4/Trinity.fasta +INSTAR1_IN=$IN/instarI-protonymphon/Trinity.fasta +INSTAR2_IN=$IN/instar_II/Trinity.fasta +INSTAR3_IN=$IN/instar_III/Trinity.fasta +INSTAR4_IN=$IN/instar_IV/Trinity.fasta +INSTAR5_IN=$IN/instar_V/Trinity.fasta +INSTAR6_IN=$IN/instar_VI/Trinity.fasta +JUV1_IN=$IN/juvenile_I/Trinity.fasta +SUBADULT_IN=$IN/subadult/Trinity.fasta + +OUT=/lisc/scratch/zoology/pycnogonum/genome/draft/transcriptome/ + +EMBRYO3_5_OUT=$OUT/EMBRYO3/ +INSTAR1_OUT=$OUT/INSTAR1/ +INSTAR2_OUT=$OUT/INSTAR2/ +INSTAR3_OUT=$OUT/INSTAR3/ +INSTAR4_OUT=$OUT/INSTAR4/ +INSTAR5_OUT=$OUT/INSTAR5/ +INSTAR6_OUT=$OUT/INSTAR6/ +JUV1_OUT=$OUT/JUV1/ +SUBADULT_OUT=$OUT/SUBADULT/ + +sbatch $FILTER $EMBRYO3_5_IN $EMBRYO3_5_OUT +sbatch $FILTER $INSTAR1_IN $INSTAR1_OUT +sbatch $FILTER $INSTAR2_IN $INSTAR2_OUT +sbatch $FILTER $INSTAR3_IN $INSTAR3_OUT +sbatch $FILTER $INSTAR4_IN $INSTAR4_OUT +sbatch $FILTER $INSTAR5_IN $INSTAR5_OUT +sbatch $FILTER $INSTAR6_IN $INSTAR6_OUT +sbatch $FILTER $JUV1_IN $JUV1_OUT +sbatch $FILTER $SUBADULT_IN $SUBADULT_OUT \ No newline at end of file diff --git a/06-annotation/annot3-RNA-02-map_assemblies.sh b/06-annotation/annot3-RNA-02-map_assemblies.sh new file mode 100644 index 0000000000000000000000000000000000000000..469933f1d05d8c1ae8f9da25f3501eea04c595ce --- /dev/null +++ b/06-annotation/annot3-RNA-02-map_assemblies.sh @@ -0,0 +1,25 @@ +#!/usr/bin/env bash + +MAP=/lisc/user/papadopoulos/repos/pycno-seq/annotate/annot3-RNA-02-map_assembly_single.sh + +OUT=/lisc/scratch/zoology/pycnogonum/genome/draft/transcriptome/ + +EMBRYO3_5=$OUT/EMBRYO3 +INSTAR1=$OUT/INSTAR1 +INSTAR2=$OUT/INSTAR2 +INSTAR3=$OUT/INSTAR3 +INSTAR4=$OUT/INSTAR4 +INSTAR5=$OUT/INSTAR5 +INSTAR6=$OUT/INSTAR6 +JUV1=$OUT/JUV1 +SUBADULT=$OUT/SUBADULT + +sbatch $MAP $EMBRYO3_5/complete.fasta $EMBRYO3_5 +sbatch $MAP $INSTAR1/complete.fasta $INSTAR1 +sbatch $MAP $INSTAR2/complete.fasta $INSTAR2 +sbatch $MAP $INSTAR3/complete.fasta $INSTAR3 +sbatch $MAP $INSTAR4/complete.fasta $INSTAR4 +sbatch $MAP $INSTAR5/complete.fasta $INSTAR5 +sbatch $MAP $INSTAR6/complete.fasta $INSTAR6 +sbatch $MAP $JUV1/complete.fasta $JUV1 +sbatch $MAP $SUBADULT/complete.fasta $SUBADULT \ No newline at end of file diff --git a/06-annotation/annot3-RNA-02-map_assembly_single.sh b/06-annotation/annot3-RNA-02-map_assembly_single.sh new file mode 100644 index 0000000000000000000000000000000000000000..2e4612f4b15daaf6d0dafeb4ca070785eadcb4a3 --- /dev/null +++ b/06-annotation/annot3-RNA-02-map_assembly_single.sh @@ -0,0 +1,30 @@ +#!/usr/bin/env bash +# +#SBATCH --job-name=mmap_pycno +#SBATCH --ntasks=1 +#SBATCH --cpus-per-task=3 +#SBATCH --mem=10G +#SBATCH --time=5:00 +#SBATCH --mail-type=ALL +#SBATCH --mail-user=nikolaos.papadopoulos@univie.ac.at +#SBATCH --output=/lisc/user/papadopoulos/log/pycno-map-txome-%j.out +#SBATCH --error=/lisc/user/papadopoulos/log/pycno-map-txome-%j.err + +# with 1% of data and 3 CPUs: +# [M::main] Real time: 448.742 sec; CPU: 1260.678 sec; Peak RSS: 1.830 GB +# 1251.43user 9.26system 7:28.76elapsed 280%CPU (0avgtext+0avgdata 1919296maxresident)k + +# with 10% of data and 3 CPUs: +# [M::main] Real time: 1528.465 sec; CPU: 3772.846 sec; Peak RSS: 2.758 GB +# 3743.86user 29.02system 25:28.51elapsed 246%CPU (0avgtext+0avgdata 2892008maxresident)k +# BUT THAT WAS WITH THE WRONG ARGUMENT ORDER, it is much faster the other way round + +module load conda +conda activate minimap2-2.28 + +TXOME=$1 +OUTPUT=$2 +DRAFT=/lisc/scratch/zoology/pycnogonum/genome/draft/draft.fasta + +cd "$TMPDIR" || exit +minimap2 -ax splice:hq $DRAFT "$TXOME" > "$OUTPUT"/txome_assembly.sam \ No newline at end of file diff --git a/06-annotation/annot3-RNA-03-pack_sam.sh b/06-annotation/annot3-RNA-03-pack_sam.sh new file mode 100644 index 0000000000000000000000000000000000000000..a81b66e6147734600d90558a0a14da257214fb3c --- /dev/null +++ b/06-annotation/annot3-RNA-03-pack_sam.sh @@ -0,0 +1,24 @@ +#!/usr/bin/env bash +# +#SBATCH --job-name=pycno_filter +#SBATCH --ntasks=1 +#SBATCH --cpus-per-task=2 +#SBATCH --mem=2G +#SBATCH --time=5:00 +#SBATCH --mail-type=ALL +#SBATCH --mail-user=nikolaos.papadopoulos@univie.ac.at +#SBATCH --output=/lisc/user/papadopoulos/log/pycno-filter-%j.out +#SBATCH --error=/lisc/user/papadopoulos/log/pycno-filter-%j.err + +module load samtools + +TXOME=$1 + +cd "$TXOME" || exit +samtools view -@2 -F 0x900 -q 30 -S -b txome_assembly.sam > txome_assembly.hd.bam + + +module load conda +conda activate agat-1.4.0 + +agat_convert_minimap2_bam2gff.pl -i txome_assembly.hd.bam -o txome_assembly.gff \ No newline at end of file diff --git a/06-annotation/annot3-RNA-03-pack_sams.sh b/06-annotation/annot3-RNA-03-pack_sams.sh new file mode 100644 index 0000000000000000000000000000000000000000..149d0554548707ffa6335c8d8858b761a9981ca4 --- /dev/null +++ b/06-annotation/annot3-RNA-03-pack_sams.sh @@ -0,0 +1,25 @@ +#!/usr/bin/env bash + +FILTER=/lisc/user/papadopoulos/repos/pycno-seq/annotate/annot3-RNA-03-pack_sam.sh + +OUT=/lisc/scratch/zoology/pycnogonum/genome/draft/transcriptome/ + +EMBRYO3_OUT=$OUT/EMBRYO3/ +INSTAR1_OUT=$OUT/INSTAR1/ +INSTAR2_OUT=$OUT/INSTAR2/ +INSTAR3_OUT=$OUT/INSTAR3/ +INSTAR4_OUT=$OUT/INSTAR4/ +INSTAR5_OUT=$OUT/INSTAR5/ +INSTAR6_OUT=$OUT/INSTAR6/ +JUV1_OUT=$OUT/JUV1/ +SUBADULT_OUT=$OUT/SUBADULT/ + +sbatch $FILTER $EMBRYO3_OUT +sbatch $FILTER $INSTAR1_OUT +sbatch $FILTER $INSTAR2_OUT +sbatch $FILTER $INSTAR3_OUT +sbatch $FILTER $INSTAR4_OUT +sbatch $FILTER $INSTAR5_OUT +sbatch $FILTER $INSTAR6_OUT +sbatch $FILTER $JUV1_OUT +sbatch $FILTER $SUBADULT_OUT \ No newline at end of file diff --git a/06-annotation/annot3-RNA-04-denovo_mining.ipynb b/06-annotation/annot3-RNA-04-denovo_mining.ipynb new file mode 100644 index 0000000000000000000000000000000000000000..0e096cb616e3a0c7d42ad4569d286eff1d52ad6c --- /dev/null +++ b/06-annotation/annot3-RNA-04-denovo_mining.ipynb @@ -0,0 +1,426 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Mining the _de novo_ transcriptomes to improve genome annotation\n", + "\n", + "An ISO-seq transcriptome and two rounds of BRAKER3 have produced close to 14,500 gene models for the\n", + "_P. litorale_ genome. However, manual inspection of the RNA-seq peaks in the genome browser revealed\n", + "that there are still plenty of loci with plausible gene locations that did not receive a gene model.\n", + "\n", + "To alleviate that I will try to use the _de novo_ transcriptomes that were generated from the same \n", + "RNA-seq data that was used for the genome annotation. The idea is to map onto the draft genome and \n", + "keep the transcripts that don't overlap with any gene models, then gather evidence from multiple\n", + "transcriptomes to generate gene models." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "%%bash\n", + "cd /Volumes/scratch/pycnogonum/genome/draft/annot_merge/\n", + "cat isoseq.gff > merged.gff3\n", + "cat braker.gff >> merged.gff3\n", + "cat braker2_unique_renamed_nocodon_intron.gff3 >> merged.gff3\n", + "\n", + "cd /Volumes/scratch/pycnogonum/genome/draft/annot_merge/denovo_txomes\n", + "DRAFT=/Volumes/scratch/pycnogonum/genome/draft/annot_merge/merged.gff3\n", + "BASE=/Volumes/scratch/pycnogonum/genome/draft/transcriptome\n", + "\n", + "EMBRYO3=$BASE/EMBRYO3/txome_assembly.gff\n", + "INSTAR1=$BASE/INSTAR1/txome_assembly.gff\n", + "INSTAR2=$BASE/INSTAR2/txome_assembly.gff\n", + "INSTAR3=$BASE/INSTAR3/txome_assembly.gff\n", + "INSTAR4=$BASE/INSTAR4/txome_assembly.gff\n", + "INSTAR5=$BASE/INSTAR5/txome_assembly.gff\n", + "INSTAR6=$BASE/INSTAR6/txome_assembly.gff\n", + "JUV1=$BASE/JUV1/txome_assembly.gff\n", + "SUBADULT=$BASE/SUBADULT/txome_assembly.gff\n", + "\n", + "bedtools intersect -v -a $EMBRYO3 -b $DRAFT | sed s/TRINITY/EMBRYO3/g > EMBRYO3.gff\n", + "bedtools intersect -v -a $INSTAR1 -b $DRAFT | sed s/TRINITY/INSTAR1/g > INSTAR1.gff\n", + "bedtools intersect -v -a $INSTAR2 -b $DRAFT | sed s/TRINITY/INSTAR2/g > INSTAR2.gff\n", + "bedtools intersect -v -a $INSTAR3 -b $DRAFT | sed s/TRINITY/INSTAR3/g > INSTAR3.gff\n", + "bedtools intersect -v -a $INSTAR4 -b $DRAFT | sed s/TRINITY/INSTAR4/g > INSTAR4.gff\n", + "bedtools intersect -v -a $INSTAR5 -b $DRAFT | sed s/TRINITY/INSTAR5/g > INSTAR5.gff\n", + "bedtools intersect -v -a $INSTAR6 -b $DRAFT | sed s/TRINITY/INSTAR6/g > INSTAR6.gff\n", + "bedtools intersect -v -a $JUV1 -b $DRAFT | sed s/TRINITY/JUV1/g > JUV1.gff\n", + "bedtools intersect -v -a $SUBADULT -b $DRAFT | sed s/TRINITY/SUBADULT/g > SUBADULT.gff" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now we can look at the overlap of all of these GFF files, again using `bedtools intersect`. Since\n", + "this is already tenuous evidence for gene models we should be as strict as possible." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "%%bash \n", + "cd /Volumes/scratch/pycnogonum/genome/draft/annot_merge/denovo_txomes\n", + "\n", + "bedtools intersect -f 0.9 -F 0.9 -wao -a INSTAR3.gff -b EMBRYO3.gff INSTAR1.gff INSTAR2.gff INSTAR4.gff INSTAR5.gff INSTAR6.gff JUV1.gff SUBADULT.gff > overlap.gff" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "from tqdm import tqdm \n", + "\n", + "import numpy as np\n", + "import pandas as pd\n", + "\n", + "from matplotlib import pyplot as plt\n", + "from matplotlib.lines import Line2D" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [], + "source": [ + "file_order = {\n", + " \"1\": \"EMBRYO3\",\n", + " \"2\": \"INSTAR1\",\n", + " \"3\": \"INSTAR2\",\n", + " \"4\": \"INSTAR4\",\n", + " \"5\": \"INSTAR5\",\n", + " \"6\": \"INSTAR6\",\n", + " \"7\": \"JUV1\",\n", + " \"8\": \"SUBADULT\",\n", + "}\n", + "\n", + "# assign the tab10 colors to each stage\n", + "colors = {\n", + " \"EMBRYO3\": \"tab:blue\",\n", + " \"INSTAR1\": \"tab:orange\",\n", + " \"INSTAR2\": \"tab:green\",\n", + " \"INSTAR4\": \"tab:red\",\n", + " \"INSTAR5\": \"tab:purple\",\n", + " \"INSTAR6\": \"tab:brown\",\n", + " \"JUV1\": \"tab:pink\",\n", + " \"SUBADULT\": \"tab:gray\",\n", + "}\n", + "\n", + "legend_elements = [\n", + " Line2D([0], [0], marker='o', color='w', label='EMBRYO3', markerfacecolor='tab:blue', markersize=8),\n", + " Line2D([0], [0], marker='o', color='w', label='INSTAR1', markerfacecolor='tab:orange', markersize=8),\n", + " Line2D([0], [0], marker='o', color='w', label='INSTAR2', markerfacecolor='tab:green', markersize=8),\n", + " Line2D([0], [0], marker='o', color='w', label='INSTAR4', markerfacecolor='tab:red', markersize=8),\n", + " Line2D([0], [0], marker='o', color='w', label='INSTAR5', markerfacecolor='tab:purple', markersize=8),\n", + " Line2D([0], [0], marker='o', color='w', label='INSTAR6', markerfacecolor='tab:brown', markersize=8),\n", + " Line2D([0], [0], marker='o', color='w', label='JUV1', markerfacecolor='tab:pink', markersize=8),\n", + " Line2D([0], [0], marker='o', color='w', label='SUBADULT', markerfacecolor='tab:gray', markersize=8),\n", + "]" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/var/folders/md/d6lwwbv97xb6g6ddypntnprh0000gp/T/ipykernel_61764/3506524708.py:1: DtypeWarning: Columns (9,15) have mixed types. Specify dtype option on import or set low_memory=False.\n", + " overlap = pd.read_csv('/Volumes/scratch/pycnogonum/genome/draft/annot_merge/denovo_txomes/overlap.gff', sep='\\t', header=None)\n" + ] + } + ], + "source": [ + "overlap = pd.read_csv('/Volumes/scratch/pycnogonum/genome/draft/annot_merge/denovo_txomes/overlap.gff', sep='\\t', header=None)\n", + "overlap = overlap.drop(columns=[1, 11])\n", + "overlap[\"instar3_match\"] = overlap[8].str.split(';').str[0].str.split('=').str[1].str.split(\".\").str[0]\n", + "overlap[\"target_match\"] = overlap[18].str.split(';').str[0].str.split('=').str[1].str.split(\".\").str[0]\n", + "overlap[\"target_dataset\"] = overlap[9].astype(str).replace(file_order)\n", + "overlap[\"target_id\"] = overlap[\"target_dataset\"].astype(str) + \"_\" + overlap[\"target_match\"].astype(str)\n", + "# we should only keep lines that match similar things (cDNA_match or cDNA_match_part)\n", + "keep = overlap[2] == overlap[12]\n", + "overlap = overlap[keep]" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(2635957, 22)" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "overlap.shape" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [], + "source": [ + "gene_level_matches = overlap[overlap[2] == 'cDNA_match']['instar3_match'].unique()\n", + "overlap = overlap[overlap[\"instar3_match\"].isin(gene_level_matches)]" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(531784, 22)" + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "overlap.shape" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [], + "source": [ + "overview = overlap[overlap[2] == \"cDNA_match\"].groupby(\"instar3_match\").apply(lambda x: x[\"target_dataset\"].value_counts()>0, include_groups=False).unstack()\n", + "overview[\"total_score\"] = overview.sum(axis=1)" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "iVBORw0KGgoAAAANSUhEUgAAAkAAAAGdCAYAAAD60sxaAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMCwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy80BEi2AAAACXBIWXMAAA9hAAAPYQGoP6dpAAAneUlEQVR4nO3dfXSU9Z3//9fklhAyCUGSQCHhRjCEG1nAwgi2FVJSyHJQOF20USKwdXUDAilUqFRcqAbpijdb7uxiAksRy1ZsxXKPhVWC3AjIzR7uvAmaTMIpkCHxkITk+v3hYX47Bb8lw5VcST7PxznXOc51XTPz/lhOeXrNNYnLsixLAAAABglxegAAAIDGRgABAADjEEAAAMA4BBAAADAOAQQAAIxDAAEAAOMQQAAAwDgEEAAAME6Y0wM0BXV1dSouLlZMTIxcLpfT4wAAgFtgWZauXLmijh07KiSkftd0CCBJxcXF6ty5s9NjAACAIJw/f16dOnWq13MIIEkxMTGSvvkX6Ha7HZ4GAADcCp/Pp86dO/v/Hq8PAkjyf+zldrsJIAAAmplgbl/hJmgAAGAcAggAABiHAAIAAMYhgAAAgHEIIAAAYBwCCAAAGIcAAgAAxiGAAACAcQggAABgHAIIAAAYhwACAADGIYAAAIBxCCAAAGAcAggAABgnzOkBAAAwXZc57zk9Qr19vijT6RFuC1eAAACAcQggAABgHAIIAAAYhwACAADGIYAAAIBxCCAAAGAcAggAABiHAAIAAMYhgAAAgHEIIAAAYBwCCAAAGIcAAgAAxiGAAACAcQggAABgHAIIAAAYhwACAADGIYAAAIBxCCAAAGAcAggAABiHAAIAAMYhgAAAgHEIIAAAYBwCCAAAGIcAAgAAxiGAAACAcQggAABgHAIIAAAYhwACAADGIYAAAIBxCCAAAGAcRwPoueeek8vlCthSU1P9x69evaqcnBy1a9dObdq00fjx41VaWhrwGkVFRcrMzFTr1q2VkJCg2bNn69q1a429FAAA0IyEOT1A7969tWPHDv/jsLD/f6SZM2fqvffe04YNGxQbG6upU6dq3Lhx+vDDDyVJtbW1yszMVFJSkvbu3auSkhJNnDhR4eHheuGFFxp9LQAAoHlwPIDCwsKUlJR0w/7y8nKtWrVK69at0/DhwyVJ+fn56tWrl/bt26chQ4Zo27ZtOnnypHbs2KHExET1799fCxcu1NNPP63nnntOERERjb0cAADQDDh+D9CZM2fUsWNHdevWTVlZWSoqKpIkHTp0SDU1NUpPT/efm5qaquTkZBUWFkqSCgsL1bdvXyUmJvrPycjIkM/n04kTJ771PauqquTz+QI2AABgDkcDaPDgwSooKNCWLVu0fPlyffbZZ7rvvvt05coVeb1eRUREKC4uLuA5iYmJ8nq9kiSv1xsQP9ePXz/2bfLy8hQbG+vfOnfubO/CAABAk+boR2CjRo3y/3O/fv00ePBgpaSk6Pe//72ioqIa7H3nzp2r3Nxc/2Ofz0cEAQBgEMc/Avu/4uLi1LNnT509e1ZJSUmqrq7W5cuXA84pLS313zOUlJR0w7fCrj++2X1F10VGRsrtdgdsAADAHE0qgCoqKnTu3Dl16NBBAwcOVHh4uHbu3Ok/furUKRUVFcnj8UiSPB6Pjh07prKyMv8527dvl9vtVlpaWqPPDwAAmgdHPwKbNWuWxowZo5SUFBUXF2v+/PkKDQ3Vww8/rNjYWE2ZMkW5ubmKj4+X2+3WtGnT5PF4NGTIEEnSyJEjlZaWpkcffVSLFy+W1+vVvHnzlJOTo8jISCeXBgAAmjBHA+jLL7/Uww8/rL/+9a9q3769hg0bpn379ql9+/aSpJdfflkhISEaP368qqqqlJGRoWXLlvmfHxoaqk2bNunJJ5+Ux+NRdHS0srOztWDBAqeWBAAAmgGXZVmW00M4zefzKTY2VuXl5dwPBABodF3mvOf0CPX2+aJMp0e4rb+/m9Q9QAAAAI2BAAIAAMYhgAAAgHEIIAAAYBwCCAAAGIcAAgAAxiGAAACAcQggAABgHAIIAAAYhwACAADGIYAAAIBxCCAAAGAcAggAABiHAAIAAMYhgAAAgHEIIAAAYBwCCAAAGIcAAgAAxiGAAACAcQggAABgHAIIAAAYhwACAADGIYAAAIBxCCAAAGAcAggAABiHAAIAAMYhgAAAgHEIIAAAYBwCCAAAGIcAAgAAxiGAAACAcQggAABgHAIIAAAYhwACAADGIYAAAIBxCCAAAGAcAggAABiHAAIAAMYhgAAAgHEIIAAAYBwCCAAAGIcAAgAAxiGAAACAcQggAABgHAIIAAAYhwACAADGIYAAAIBxCCAAAGAcAggAABiHAAIAAMYhgAAAgHEIIAAAYJwmE0CLFi2Sy+XSjBkz/PuuXr2qnJwctWvXTm3atNH48eNVWloa8LyioiJlZmaqdevWSkhI0OzZs3Xt2rVGnh4AADQnTSKADhw4oJUrV6pfv34B+2fOnKl3331XGzZs0O7du1VcXKxx48b5j9fW1iozM1PV1dXau3evVq9erYKCAj377LONvQQAANCMOB5AFRUVysrK0m9/+1u1bdvWv7+8vFyrVq3SkiVLNHz4cA0cOFD5+fnau3ev9u3bJ0natm2bTp48qbVr16p///4aNWqUFi5cqKVLl6q6utqpJQEAgCbO8QDKyclRZmam0tPTA/YfOnRINTU1AftTU1OVnJyswsJCSVJhYaH69u2rxMRE/zkZGRny+Xw6ceJE4ywAAAA0O2FOvvn69ev18ccf68CBAzcc83q9ioiIUFxcXMD+xMREeb1e/zn/N36uH79+7NtUVVWpqqrK/9jn8wW7BAAA0Aw5dgXo/Pnzmj59un73u9+pVatWjfreeXl5io2N9W+dO3du1PcHAADOciyADh06pLKyMg0YMEBhYWEKCwvT7t279dprryksLEyJiYmqrq7W5cuXA55XWlqqpKQkSVJSUtIN3wq7/vj6OTczd+5clZeX+7fz58/buzgAANCkORZAI0aM0LFjx3TkyBH/NmjQIGVlZfn/OTw8XDt37vQ/59SpUyoqKpLH45EkeTweHTt2TGVlZf5ztm/fLrfbrbS0tG9978jISLnd7oANAACYw7F7gGJiYtSnT5+AfdHR0WrXrp1//5QpU5Sbm6v4+Hi53W5NmzZNHo9HQ4YMkSSNHDlSaWlpevTRR7V48WJ5vV7NmzdPOTk5ioyMbPQ1AQCA5sHRm6D/npdfflkhISEaP368qqqqlJGRoWXLlvmPh4aGatOmTXryySfl8XgUHR2t7OxsLViwwMGpAQBAU+eyLMtyegin+Xw+xcbGqry8nI/DAACNrsuc95weod4+X5Tp9Ai39fe34z8HCAAAoLERQAAAwDgEEAAAMA4BBAAAjEMAAQAA4zTpr8EDAMA3pNAQuAIEAACMQwABAADjEEAAAMA4BBAAADAOAQQAAIxDAAEAAOMQQAAAwDgEEAAAMA4BBAAAjEMAAQAA4xBAAADAOAQQAAAwDgEEAACMQwABAADjEEAAAMA4BBAAADAOAQQAAIxDAAEAAOMQQAAAwDgEEAAAMA4BBAAAjEMAAQAA4xBAAADAOAQQAAAwTlAB9Omnn9o9BwAAQKMJKoDuvPNO3X///Vq7dq2uXr1q90wAAAANKqgA+vjjj9WvXz/l5uYqKSlJ//Iv/6L9+/fbPRsAAECDCCqA+vfvr1dffVXFxcV64403VFJSomHDhqlPnz5asmSJLly4YPecAAAAtrmtm6DDwsI0btw4bdiwQS+++KLOnj2rWbNmqXPnzpo4caJKSkrsmhMAAMA2txVABw8e1L/+67+qQ4cOWrJkiWbNmqVz585p+/btKi4u1tixY+2aEwAAwDZhwTxpyZIlys/P16lTpzR69GitWbNGo0ePVkjINz3VtWtXFRQUqEuXLnbOCgAAYIugAmj58uWaPHmyHnvsMXXo0OGm5yQkJGjVqlW3NRwAAEBDCCqAzpw583fPiYiIUHZ2djAvDwAA0KCCugcoPz9fGzZsuGH/hg0btHr16tseCgAAoCEFdQUoLy9PK1euvGF/QkKCHn/8ca78GKbLnPecHqHePl+U6fQIAAAHBRVARUVF6tq16w37U1JSVFRUdNtDtSTEAQAATU9QH4ElJCTok08+uWH/0aNH1a5du9seCgAAoCEFFUAPP/ywnnrqKb3//vuqra1VbW2tdu3apenTp+uhhx6ye0YAAABbBfUR2MKFC/X5559rxIgRCgv75iXq6uo0ceJEvfDCC7YOCAAAYLegAigiIkJvvfWWFi5cqKNHjyoqKkp9+/ZVSkqK3fMBAADYLqgAuq5nz57q2bOnXbMAAAA0iqACqLa2VgUFBdq5c6fKyspUV1cXcHzXrl22DAcAANAQggqg6dOnq6CgQJmZmerTp49cLpfdcwEAADSYoAJo/fr1+v3vf6/Ro0fbPQ8AAECDC+pr8BEREbrzzjvtngUAAKBRBBVAP/vZz/Tqq6/Ksiy75wEAAGhwQX0E9sEHH+j999/X5s2b1bt3b4WHhwccf/vtt20ZDgAAoCEEdQUoLi5ODz74oL7//e/rjjvuUGxsbMB2q5YvX65+/frJ7XbL7XbL4/Fo8+bN/uNXr15VTk6O2rVrpzZt2mj8+PEqLS0NeI2ioiJlZmaqdevWSkhI0OzZs3Xt2rVglgUAAAwR1BWg/Px8W968U6dOWrRokXr06CHLsrR69WqNHTtWhw8fVu/evTVz5ky999572rBhg2JjYzV16lSNGzdOH374oaRvvo6fmZmppKQk7d27VyUlJZo4caLCw8P5idQAAOBbBXUFSJKuXbumHTt2aOXKlbpy5Yokqbi4WBUVFbf8GmPGjNHo0aPVo0cP9ezZU88//7zatGmjffv2qby8XKtWrdKSJUs0fPhwDRw4UPn5+dq7d6/27dsnSdq2bZtOnjyptWvXqn///ho1apQWLlyopUuXqrq6OtilAQCAFi6oAPriiy/Ut29fjR07Vjk5Obpw4YIk6cUXX9SsWbOCGqS2tlbr169XZWWlPB6PDh06pJqaGqWnp/vPSU1NVXJysgoLCyVJhYWF6tu3rxITE/3nZGRkyOfz6cSJE9/6XlVVVfL5fAEbAAAwR1ABNH36dA0aNEiXLl1SVFSUf/+DDz6onTt31uu1jh07pjZt2igyMlJPPPGENm7cqLS0NHm9XkVERCguLi7g/MTERHm9XkmS1+sNiJ/rx68f+zZ5eXkB9yx17ty5XjMDAIDmLah7gP7nf/5He/fuVURERMD+Ll266KuvvqrXa9111106cuSIysvL9d///d/Kzs7W7t27gxnrls2dO1e5ubn+xz6fjwgCAMAgQQVQXV2damtrb9j/5ZdfKiYmpl6v9X9/qOLAgQN14MABvfrqq5owYYKqq6t1+fLlgKtApaWlSkpKkiQlJSVp//79Aa93/Vti18+5mcjISEVGRtZrTgAA0HIE9RHYyJEj9corr/gfu1wuVVRUaP78+bf96zHq6upUVVWlgQMHKjw8POAjtVOnTqmoqEgej0eS5PF4dOzYMZWVlfnP2b59u9xut9LS0m5rDgAA0HIFdQXopZdeUkZGhtLS0nT16lX95Cc/0ZkzZ3THHXfozTffvOXXmTt3rkaNGqXk5GRduXJF69at01/+8hdt3bpVsbGxmjJlinJzcxUfHy+3261p06bJ4/FoyJAhkr4JsbS0ND366KNavHixvF6v5s2bp5ycHK7wAACAbxVUAHXq1ElHjx7V+vXr9cknn6iiokJTpkxRVlZWwE3Rf09ZWZkmTpyokpISxcbGql+/ftq6dat++MMfSpJefvllhYSEaPz48aqqqlJGRoaWLVvmf35oaKg2bdqkJ598Uh6PR9HR0crOztaCBQuCWRYAADBEUAEkSWFhYXrkkUdu681XrVr1/zzeqlUrLV26VEuXLv3Wc1JSUvTnP//5tuYAAABmCSqA1qxZ8/88PnHixKCGAZqiLnPec3qEevt8UabTIwBAkxZUAE2fPj3gcU1Njb7++mtFRESodevWBBAAAGjSgvoW2KVLlwK2iooKnTp1SsOGDavXTdAAAABOCPp3gf2tHj16aNGiRTdcHQIAAGhqbAsg6Zsbo4uLi+18SQAAANsFdQ/Qn/70p4DHlmWppKREv/nNbzR06FBbBgPQOLjJG4CJggqgBx54IOCxy+VS+/btNXz4cL300kt2zAUAtiHyAPytoH8XGAAAQHNl6z1AAAAAzUFQV4Byc3Nv+dwlS5YE8xYAAAANJqgAOnz4sA4fPqyamhrdddddkqTTp08rNDRUAwYM8J/ncrnsmRIAAMBGQQXQmDFjFBMTo9WrV6tt27aSvvnhiJMmTdJ9992nn/3sZ7YOCQAAYKeg7gF66aWXlJeX548fSWrbtq1+9atf8S0wAADQ5AUVQD6fTxcuXLhh/4ULF3TlypXbHgoAAKAhBRVADz74oCZNmqS3335bX375pb788kv94Q9/0JQpUzRu3Di7ZwQAALBVUPcArVixQrNmzdJPfvIT1dTUfPNCYWGaMmWKfv3rX9s6IAAAgN2CCqDWrVtr2bJl+vWvf61z585Jkrp3767o6GhbhwMAAGgIt/WDEEtKSlRSUqIePXooOjpalmXZNRcAAECDCSqA/vrXv2rEiBHq2bOnRo8erZKSEknSlClT+Ao8AABo8oIKoJkzZyo8PFxFRUVq3bq1f/+ECRO0ZcsW24YDAABoCEHdA7Rt2zZt3bpVnTp1Ctjfo0cPffHFF7YMBgAA0FCCugJUWVkZcOXnuosXLyoyMvK2hwIAAGhIQQXQfffdpzVr1vgfu1wu1dXVafHixbr//vttGw4AAKAhBPUR2OLFizVixAgdPHhQ1dXV+vnPf64TJ07o4sWL+vDDD+2eEQAAwFZBXQHq06ePTp8+rWHDhmns2LGqrKzUuHHjdPjwYXXv3t3uGQEAAGxV7ytANTU1+tGPfqQVK1bomWeeaYiZAAAAGlS9Ayg8PFyffPJJQ8wCAAhClznvOT1CvX2+KNPpEWC4oD4Ce+SRR7Rq1Sq7ZwEAAGgUQd0Efe3aNb3xxhvasWOHBg4ceMPvAFuyZIktwwEAADSEegXQp59+qi5duuj48eMaMGCAJOn06dMB57hcLvumAwAAaAD1CqAePXqopKRE77//vqRvfvXFa6+9psTExAYZDgAAoCHU6x6gv/1t75s3b1ZlZaWtAwEAADS0oG6Cvu5vgwgAAKA5qFcAuVyuG+7x4Z4fAADQ3NTrHiDLsvTYY4/5f+Hp1atX9cQTT9zwLbC3337bvgkBAABsVq8Ays7ODnj8yCOP2DoMAABAY6hXAOXn5zfUHAAAAI3mtm6CBgAAaI4IIAAAYBwCCAAAGIcAAgAAxiGAAACAcQggAABgHAIIAAAYhwACAADGIYAAAIBxCCAAAGAcAggAABiHAAIAAMYhgAAAgHEIIAAAYBwCCAAAGMfRAMrLy9M999yjmJgYJSQk6IEHHtCpU6cCzrl69apycnLUrl07tWnTRuPHj1dpaWnAOUVFRcrMzFTr1q2VkJCg2bNn69q1a425FAAA0Iw4GkC7d+9WTk6O9u3bp+3bt6umpkYjR45UZWWl/5yZM2fq3Xff1YYNG7R7924VFxdr3Lhx/uO1tbXKzMxUdXW19u7dq9WrV6ugoEDPPvusE0sCAADNQJiTb75ly5aAxwUFBUpISNChQ4f0ve99T+Xl5Vq1apXWrVun4cOHS5Ly8/PVq1cv7du3T0OGDNG2bdt08uRJ7dixQ4mJierfv78WLlyop59+Ws8995wiIiKcWBoAAGjCmtQ9QOXl5ZKk+Ph4SdKhQ4dUU1Oj9PR0/zmpqalKTk5WYWGhJKmwsFB9+/ZVYmKi/5yMjAz5fD6dOHHipu9TVVUln88XsAEAAHM0mQCqq6vTjBkzNHToUPXp00eS5PV6FRERobi4uIBzExMT5fV6/ef83/i5fvz6sZvJy8tTbGysf+vcubPNqwEAAE1ZkwmgnJwcHT9+XOvXr2/w95o7d67Ky8v92/nz5xv8PQEAQNPh6D1A102dOlWbNm3Snj171KlTJ//+pKQkVVdX6/LlywFXgUpLS5WUlOQ/Z//+/QGvd/1bYtfP+VuRkZGKjIy0eRUAAKC5cPQKkGVZmjp1qjZu3Khdu3apa9euAccHDhyo8PBw7dy507/v1KlTKioqksfjkSR5PB4dO3ZMZWVl/nO2b98ut9uttLS0xlkIAABoVhy9ApSTk6N169bpj3/8o2JiYvz37MTGxioqKkqxsbGaMmWKcnNzFR8fL7fbrWnTpsnj8WjIkCGSpJEjRyotLU2PPvqoFi9eLK/Xq3nz5iknJ4erPAAA4KYcDaDly5dLkn7wgx8E7M/Pz9djjz0mSXr55ZcVEhKi8ePHq6qqShkZGVq2bJn/3NDQUG3atElPPvmkPB6PoqOjlZ2drQULFjTWMgAAQDPjaABZlvV3z2nVqpWWLl2qpUuXfus5KSkp+vOf/2znaAAAoAVrMt8CAwAAaCwEEAAAMA4BBAAAjEMAAQAA4xBAAADAOAQQAAAwDgEEAACMQwABAADjEEAAAMA4BBAAADAOAQQAAIxDAAEAAOMQQAAAwDgEEAAAMA4BBAAAjEMAAQAA4xBAAADAOAQQAAAwDgEEAACMQwABAADjEEAAAMA4BBAAADAOAQQAAIxDAAEAAOMQQAAAwDgEEAAAMA4BBAAAjEMAAQAA4xBAAADAOAQQAAAwDgEEAACMQwABAADjEEAAAMA4BBAAADAOAQQAAIxDAAEAAOMQQAAAwDgEEAAAMA4BBAAAjEMAAQAA4xBAAADAOAQQAAAwDgEEAACMQwABAADjEEAAAMA4BBAAADAOAQQAAIxDAAEAAOMQQAAAwDgEEAAAMA4BBAAAjEMAAQAA4xBAAADAOI4G0J49ezRmzBh17NhRLpdL77zzTsBxy7L07LPPqkOHDoqKilJ6errOnDkTcM7FixeVlZUlt9utuLg4TZkyRRUVFY24CgAA0Nw4GkCVlZW6++67tXTp0pseX7x4sV577TWtWLFCH330kaKjo5WRkaGrV6/6z8nKytKJEye0fft2bdq0SXv27NHjjz/eWEsAAADNUJiTbz5q1CiNGjXqpscsy9Irr7yiefPmaezYsZKkNWvWKDExUe+8844eeugh/e///q+2bNmiAwcOaNCgQZKk//iP/9Do0aP17//+7+rYsWOjrQUAADQfTfYeoM8++0xer1fp6en+fbGxsRo8eLAKCwslSYWFhYqLi/PHjySlp6crJCREH3300be+dlVVlXw+X8AGAADM0WQDyOv1SpISExMD9icmJvqPeb1eJSQkBBwPCwtTfHy8/5ybycvLU2xsrH/r3LmzzdMDAICmrMkGUEOaO3euysvL/dv58+edHgkAADSiJhtASUlJkqTS0tKA/aWlpf5jSUlJKisrCzh+7do1Xbx40X/OzURGRsrtdgdsAADAHE02gLp27aqkpCTt3LnTv8/n8+mjjz6Sx+ORJHk8Hl2+fFmHDh3yn7Nr1y7V1dVp8ODBjT4zAABoHhz9FlhFRYXOnj3rf/zZZ5/pyJEjio+PV3JysmbMmKFf/epX6tGjh7p27apf/vKX6tixox544AFJUq9evfSjH/1IP/3pT7VixQrV1NRo6tSpeuihh/gGGAAA+FaOBtDBgwd1//33+x/n5uZKkrKzs1VQUKCf//znqqys1OOPP67Lly9r2LBh2rJli1q1auV/zu9+9ztNnTpVI0aMUEhIiMaPH6/XXnut0dcCAACaD0cD6Ac/+IEsy/rW4y6XSwsWLNCCBQu+9Zz4+HitW7euIcYDAAAtVJO9BwgAAKChEEAAAMA4BBAAADAOAQQAAIxDAAEAAOMQQAAAwDgEEAAAMA4BBAAAjEMAAQAA4xBAAADAOAQQAAAwDgEEAACMQwABAADjEEAAAMA4BBAAADAOAQQAAIxDAAEAAOMQQAAAwDgEEAAAMA4BBAAAjEMAAQAA4xBAAADAOAQQAAAwDgEEAACMQwABAADjEEAAAMA4BBAAADAOAQQAAIxDAAEAAOMQQAAAwDgEEAAAMA4BBAAAjEMAAQAA4xBAAADAOAQQAAAwDgEEAACMQwABAADjEEAAAMA4BBAAADAOAQQAAIxDAAEAAOMQQAAAwDgEEAAAMA4BBAAAjEMAAQAA4xBAAADAOAQQAAAwDgEEAACMQwABAADjEEAAAMA4BBAAADAOAQQAAIzTYgJo6dKl6tKli1q1aqXBgwdr//79To8EAACaqBYRQG+99ZZyc3M1f/58ffzxx7r77ruVkZGhsrIyp0cDAABNUIsIoCVLluinP/2pJk2apLS0NK1YsUKtW7fWG2+84fRoAACgCQpzeoDbVV1drUOHDmnu3Ln+fSEhIUpPT1dhYeFNn1NVVaWqqir/4/LyckmSz+ezfb66qq9tf82GVt9/Dy19jayv6eHPaCDW1/TwZ7RxZ7Asq/5Ptpq5r776ypJk7d27N2D/7Nmzre9+97s3fc78+fMtSWxsbGxsbGwtYDt37ly9+6HZXwEKxty5c5Wbm+t/XFdXp4sXL6pdu3ZyuVwOTnbrfD6fOnfurPPnz8vtdjs9ju1YX/PX0tfY0tcntfw1sr7mr7y8XMnJyYqPj6/3c5t9AN1xxx0KDQ1VaWlpwP7S0lIlJSXd9DmRkZGKjIwM2BcXF9dQIzYot9vdYv9gS6yvJWjpa2zp65Na/hpZX/MXElL/W5qb/U3QERERGjhwoHbu3OnfV1dXp507d8rj8Tg4GQAAaKqa/RUgScrNzVV2drYGDRqk7373u3rllVdUWVmpSZMmOT0aAABoglpEAE2YMEEXLlzQs88+K6/Xq/79+2vLli1KTEx0erQGExkZqfnz59/wUV5Lwfqav5a+xpa+Pqnlr5H1NX+3s0aXZQXz3TEAAIDmq9nfAwQAAFBfBBAAADAOAQQAAIxDAAEAAOMQQM3Mnj17NGbMGHXs2FEul0vvvPOO0yPZKi8vT/fcc49iYmKUkJCgBx54QKdOnXJ6LNssX75c/fr18/9gMo/Ho82bNzs9VoNZtGiRXC6XZsyY4fQotnnuuefkcrkCttTUVKfHstVXX32lRx55RO3atVNUVJT69u2rgwcPOj2Wbbp06XLD/4Yul0s5OTlOj2aL2tpa/fKXv1TXrl0VFRWl7t27a+HChcH9vqwm6sqVK5oxY4ZSUlIUFRWle++9VwcOHKjXa7SIr8GbpLKyUnfffbcmT56scePGOT2O7Xbv3q2cnBzdc889unbtmn7xi19o5MiROnnypKKjo50e77Z16tRJixYtUo8ePWRZllavXq2xY8fq8OHD6t27t9Pj2erAgQNauXKl+vXr5/Qotuvdu7d27NjhfxwW1nL+r/TSpUsaOnSo7r//fm3evFnt27fXmTNn1LZtW6dHs82BAwdUW1vrf3z8+HH98Ic/1I9//GMHp7LPiy++qOXLl2v16tXq3bu3Dh48qEmTJik2NlZPPfWU0+PZ4p//+Z91/Phx/dd//Zc6duyotWvXKj09XSdPntR3vvOdW3uRev/2MDQZkqyNGzc6PUaDKisrsyRZu3fvdnqUBtO2bVvrP//zP50ew1ZXrlyxevToYW3fvt36/ve/b02fPt3pkWwzf/586+6773Z6jAbz9NNPW8OGDXN6jEY1ffp0q3v37lZdXZ3To9giMzPTmjx5csC+cePGWVlZWQ5NZK+vv/7aCg0NtTZt2hSwf8CAAdYzzzxzy6/DR2Bo0srLyyUpqF9019TV1tZq/fr1qqysbHG/tiUnJ0eZmZlKT093epQGcebMGXXs2FHdunVTVlaWioqKnB7JNn/60580aNAg/fjHP1ZCQoL+4R/+Qb/97W+dHqvBVFdXa+3atZo8eXKz+WXYf8+9996rnTt36vTp05Kko0eP6oMPPtCoUaMcnswe165dU21trVq1ahWwPyoqSh988MEtv07LuW6LFqeurk4zZszQ0KFD1adPH6fHsc2xY8fk8Xh09epVtWnTRhs3blRaWprTY9lm/fr1+vjjj+v9eXxzMXjwYBUUFOiuu+5SSUmJ/u3f/k333Xefjh8/rpiYGKfHu22ffvqpli9frtzcXP3iF7/QgQMH9NRTTykiIkLZ2dlOj2e7d955R5cvX9Zjjz3m9Ci2mTNnjnw+n1JTUxUaGqra2lo9//zzysrKcno0W8TExMjj8WjhwoXq1auXEhMT9eabb6qwsFB33nnnrb+Q3Zem0HjUwj8Ce+KJJ6yUlBTr/PnzTo9iq6qqKuvMmTPWwYMHrTlz5lh33HGHdeLECafHskVRUZGVkJBgHT161L+vpX0E9rcuXbpkud3uFvMxZnh4uOXxeAL2TZs2zRoyZIhDEzWskSNHWv/4j//o9Bi2evPNN61OnTpZb775pvXJJ59Ya9asseLj462CggKnR7PN2bNnre9973uWJCs0NNS65557rKysLCs1NfWWX4MrQGiSpk6dqk2bNmnPnj3q1KmT0+PYKiIiwv9fKQMHDtSBAwf06quvauXKlQ5PdvsOHTqksrIyDRgwwL+vtrZWe/bs0W9+8xtVVVUpNDTUwQntFxcXp549e+rs2bNOj2KLDh063HBFslevXvrDH/7g0EQN54svvtCOHTv09ttvOz2KrWbPnq05c+booYcekiT17dtXX3zxhfLy8lrMVbzu3btr9+7dqqyslM/nU4cOHTRhwgR169btll+De4DQpFiWpalTp2rjxo3atWuXunbt6vRIDa6urk5VVVVOj2GLESNG6NixYzpy5Ih/GzRokLKysnTkyJEWFz+SVFFRoXPnzqlDhw5Oj2KLoUOH3vCjJ06fPq2UlBSHJmo4+fn5SkhIUGZmptOj2Orrr79WSEjgX++hoaGqq6tzaKKGEx0drQ4dOujSpUvaunWrxo4de8vP5QpQM1NRURHwX5qfffaZjhw5ovj4eCUnJzs4mT1ycnK0bt06/fGPf1RMTIy8Xq8kKTY2VlFRUQ5Pd/vmzp2rUaNGKTk5WVeuXNG6dev0l7/8RVu3bnV6NFvExMTccL9WdHS02rVr12Lu45o1a5bGjBmjlJQUFRcXa/78+QoNDdXDDz/s9Gi2mDlzpu6991698MIL+qd/+ift379fr7/+ul5//XWnR7NVXV2d8vPzlZ2d3aJ+jIEkjRkzRs8//7ySk5PVu3dvHT58WEuWLNHkyZOdHs02W7dulWVZuuuuu3T27FnNnj1bqampmjRp0q2/SIN9QIcG8f7771uSbtiys7OdHs0WN1ubJCs/P9/p0WwxefJkKyUlxYqIiLDat29vjRgxwtq2bZvTYzWolnYP0IQJE6wOHTpYERER1ne+8x1rwoQJ1tmzZ50ey1bvvvuu1adPHysyMtJKTU21Xn/9dadHst3WrVstSdapU6ecHsV2Pp/Pmj59upWcnGy1atXK6tatm/XMM89YVVVVTo9mm7feesvq1q2bFRERYSUlJVk5OTnW5cuX6/UaLstqQT8aEgAA4BZwDxAAADAOAQQAAIxDAAEAAOMQQAAAwDgEEAAAMA4BBAAAjEMAAQAA4xBAAADAOAQQAAAwDgEEAACMQwABAADjEEAAAMA4/x8aGBSYY3Q3ogAAAABJRU5ErkJggg==", + "text/plain": [ + "<Figure size 640x480 with 1 Axes>" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "overview[\"total_score\"].plot(kind='hist', bins=np.arange(9)+0.6, color='tab:blue', width=0.8);" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [], + "source": [ + "def find_gene_boundaries(df):\n", + " entire = df[df[2] == \"cDNA_match\"]\n", + " if entire.shape[0] == 0:\n", + " return None\n", + " start = entire[3].unique()\n", + " end = entire[4].unique()\n", + " if len(start) > 1 or len(end) > 1:\n", + " return ValueError(\"More than one start or end\")\n", + " return (df[0].iloc[0], start[0], end[0])" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [], + "source": [ + "def find_exon_boundaries(df):\n", + " start = df[13].value_counts().idxmax()\n", + " end = df[14].value_counts().idxmax()\n", + " return (start, end)" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [], + "source": [ + "def gene_string(gene_id, start, stop, seqid, source, strand, support):\n", + " seq_type = \"gene\"\n", + " score = \".\"\n", + " phase = \".\"\n", + " attributes = f\"ID={gene_id};gene_id={gene_id};support={support}\"\n", + " string = f\"{seqid}\\t{source}\\t{seq_type}\\t{start}\\t{stop}\\t{score}\\t{strand}\\t{phase}\\t{attributes};\\n\"\n", + " return string\n", + "\n", + "def transcript_string(gene_id, transcript_id, seqid, source, start, stop, score, strand):\n", + " seq_type = \"mRNA\"\n", + " phase = \".\"\n", + " attributes = f\"ID={transcript_id};Parent={gene_id}\"\n", + " string = f\"{seqid}\\t{source}\\t{seq_type}\\t{start}\\t{stop}\\t{score}\\t{strand}\\t{phase}\\t{attributes};\\n\"\n", + " return string\n", + "\n", + "def exon_string(transcript_id, exon_counter, seqid, source, start, stop, score, strand):\n", + " seq_type = \"exon\"\n", + " phase = \".\"\n", + " attributes = f\"ID={transcript_id}.exon{exon_counter};Parent={transcript_id}\"\n", + " exon_string = f\"{seqid}\\t{source}\\t{seq_type}\\t{start}\\t{stop}\\t{score}\\t{strand}\\t{phase}\\t{attributes};\\n\"\n", + " \n", + " seq_type = \"CDS\"\n", + " phase=\"0\"\n", + " attributes = f\"ID={transcript_id}.CDS{exon_counter};Parent={transcript_id}\"\n", + " CDS_string = f\"{seqid}\\t{source}\\t{seq_type}\\t{start}\\t{stop}\\t{score}\\t{strand}\\t{phase}\\t{attributes};\\n\"\n", + " \n", + " return exon_string + CDS_string\n", + "\n", + "def contained(gene, catalogue):\n", + " scaffold, start, stop = gene\n", + " for seen in catalogue:\n", + " if scaffold != seen[0]:\n", + " continue\n", + " if start >= seen[1] and stop <= seen[2]:\n", + " return True\n", + " if start == seen[1] or stop == seen[2]:\n", + " return True\n", + " return False" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "100%|██████████| 2686/2686 [02:12<00:00, 20.34it/s]\n" + ] + } + ], + "source": [ + "# test = overlap[overlap[\"instar3_match\"] == \"match262\"]\n", + "problematic = []\n", + "seen = []\n", + "\n", + "with open(\"/Volumes/scratch/pycnogonum/genome/draft/annot_merge/denovo_txomes/overlap_translated.gff3\", \"w\") as gff:\n", + " for i, match in enumerate(tqdm(gene_level_matches)):\n", + " gene = overlap[overlap[\"instar3_match\"] == match]\n", + " pseudochrom, start, stop = find_gene_boundaries(gene)\n", + " if contained([pseudochrom, start, stop], seen):\n", + " continue\n", + " seen.append([pseudochrom, start, stop])\n", + " parts = gene[gene[2] == \"cDNA_match_part\"].copy()\n", + " if parts.shape[0] == 0:\n", + " problematic.append(match)\n", + " continue\n", + " parts[\"boundary\"] = parts[3].astype(str) + \"-\" + parts[4].astype(str)\n", + " exons = parts.groupby(\"boundary\").apply(lambda x: find_exon_boundaries(x), include_groups=False)\n", + "\n", + " seqid = gene[0].values[0]\n", + " gene_id = f\"at_DN{i+1:04d}\"\n", + " source = \"Trinity\"\n", + " strand = gene[6].values[0]\n", + " support = len(gene[\"target_dataset\"].unique())\n", + " gene_gff = gene_string(gene_id, start, stop, seqid, source, strand, support)\n", + " gff.write(gene_gff)\n", + " transcript_gff = transcript_string(gene_id, gene_id+\".t1\", seqid, source, start, stop, \".\", strand)\n", + " gff.write(transcript_gff)\n", + " for exon in exons:\n", + " exon_gff = exon_string(gene_id + \".t1\", i, seqid, source, exon[0], exon[1], \".\", strand)\n", + " gff.write(exon_gff)" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "['match20947',\n", + " 'match31308',\n", + " 'match31516',\n", + " 'match31653',\n", + " 'match32973',\n", + " 'match36716']" + ] + }, + "execution_count": 15, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "problematic" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "780" + ] + }, + "execution_count": 16, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "len(seen)" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.4" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/06-annotation/install_braker.md b/06-annotation/install_braker.md new file mode 100644 index 0000000000000000000000000000000000000000..ae68eec064208e4a002a4b45716cd2227e8b842a --- /dev/null +++ b/06-annotation/install_braker.md @@ -0,0 +1,361 @@ +# Installing BRAKER3 on LISC + +#### Accurate as of 2023-12-21 + +## Motivation + +BRAKER3 is one of the premier pipelines for eukaryotic gene prediction. Unfortunately, it is also +one of the most difficult to install and run, especially in a HPC environment and _especially_ +without root access. This is because BRAKER3 is a pipeline that combines several different tools, +needs several obscure dependencies, and requires a lot of configuration. + +Since I foresee a lot of gene prediction in my future, I decided to take the time to document my +progress installing and running the software so that future me (and maybe current/future you) can +benefit the next time this is needed. + +## Installation + +My main sources are +* the [BRAKER3 GitHub page](https://github.com/Gaius-Augustus/BRAKER) +* the [BRAKER3 DockerHub page](https://hub.docker.com/r/teambraker/braker3) +* the [LISC HOWTO run containers page](https://lisc.univie.ac.at/wiki/content/working_environment/applications/software_as_container#singularity_and_apptainer) + +### Step 1: Get BRAKER image + +We are going to be working exclusively with Singularity, since Docker is not really optimised for +HPC environments. There is no dedicated Singularity image for BRAKER3, but there is a Docker image +that Singularity can use. Let's pull that. + +First I will create an appropriate directory structure for all of this: + +``` +>niko@laptop$ ssh lisc +>niko@lisc:~$ mkdir containers +>niko@lisc:~$ cd containers +``` + +The LISC website suggests the `apptainer` env, which should contain the newest version of +apptainer/singularity. Unfortunately, this does not seem to be the case, so I will use the +`singularity` env instead. + +``` +>niko@lisc:~/containers$ module load conda +>niko@lisc:~/containers$ conda activate singularity +``` + +Pull the image: + +``` +>niko@lisc:~/containers$ singularity build braker3.sif docker://teambraker/braker3:latest +``` + +<details> + <summary>output</summary> + ``` + (lots of text but the end should look similar to this:) + 2023/12/21 11:16:46 warn rootless{opt/conda/share/jupyter/lab/staging/node_modules/url-loader/node_modules/.bin/webpack} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers" + 2023/12/21 11:16:46 warn rootless{opt/conda/share/jupyter/lab/staging/node_modules/vega-embed/node_modules/.bin/semver} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers" + 2023/12/21 11:16:46 warn rootless{opt/conda/share/jupyter/lab/staging/node_modules/vega-embed/node_modules/.bin/vl2pdf} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers" + 2023/12/21 11:16:46 warn rootless{opt/conda/share/jupyter/lab/staging/node_modules/vega-embed/node_modules/.bin/vl2png} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers" + 2023/12/21 11:16:46 warn rootless{opt/conda/share/jupyter/lab/staging/node_modules/vega-embed/node_modules/.bin/vl2svg} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers" + 2023/12/21 11:16:46 warn rootless{opt/conda/share/jupyter/lab/staging/node_modules/vega-embed/node_modules/.bin/vl2vg} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers" + 2023/12/21 11:16:47 warn rootless{opt/conda/share/jupyter/lab/staging/node_modules/vega-loader/node_modules/.bin/csv2json} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers" + 2023/12/21 11:16:47 warn rootless{opt/conda/share/jupyter/lab/staging/node_modules/vega-loader/node_modules/.bin/csv2tsv} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers" + 2023/12/21 11:16:47 warn rootless{opt/conda/share/jupyter/lab/staging/node_modules/vega-loader/node_modules/.bin/dsv2dsv} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers" + 2023/12/21 11:16:47 warn rootless{opt/conda/share/jupyter/lab/staging/node_modules/vega-loader/node_modules/.bin/dsv2json} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers" + 2023/12/21 11:16:47 warn rootless{opt/conda/share/jupyter/lab/staging/node_modules/vega-loader/node_modules/.bin/json2csv} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers" + 2023/12/21 11:16:47 warn rootless{opt/conda/share/jupyter/lab/staging/node_modules/vega-loader/node_modules/.bin/json2dsv} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers" + 2023/12/21 11:16:47 warn rootless{opt/conda/share/jupyter/lab/staging/node_modules/vega-loader/node_modules/.bin/json2tsv} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers" + 2023/12/21 11:16:47 warn rootless{opt/conda/share/jupyter/lab/staging/node_modules/vega-loader/node_modules/.bin/topo2geo} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers" + 2023/12/21 11:16:47 warn rootless{opt/conda/share/jupyter/lab/staging/node_modules/vega-loader/node_modules/.bin/topomerge} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers" + 2023/12/21 11:16:47 warn rootless{opt/conda/share/jupyter/lab/staging/node_modules/vega-loader/node_modules/.bin/topoquantize} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers" + 2023/12/21 11:16:47 warn rootless{opt/conda/share/jupyter/lab/staging/node_modules/vega-loader/node_modules/.bin/tsv2csv} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers" + 2023/12/21 11:16:47 warn rootless{opt/conda/share/jupyter/lab/staging/node_modules/vega-loader/node_modules/.bin/tsv2json} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers" + 2023/12/21 11:16:47 warn rootless{opt/conda/share/jupyter/lab/staging/node_modules/vega-projection/node_modules/.bin/geo2svg} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers" + 2023/12/21 11:16:47 warn rootless{opt/conda/share/jupyter/lab/staging/node_modules/vega-projection/node_modules/.bin/geograticule} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers" + 2023/12/21 11:16:47 warn rootless{opt/conda/share/jupyter/lab/staging/node_modules/vega-projection/node_modules/.bin/geoproject} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers" + 2023/12/21 11:16:47 warn rootless{opt/conda/share/jupyter/lab/staging/node_modules/vega-projection/node_modules/.bin/geoquantize} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers" + 2023/12/21 11:16:47 warn rootless{opt/conda/share/jupyter/lab/staging/node_modules/vega-projection/node_modules/.bin/geostitch} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers" + 2023/12/21 11:16:47 warn rootless{opt/conda/share/jupyter/lab/staging/node_modules/vega-themes/node_modules/.bin/vl2pdf} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers" + 2023/12/21 11:16:47 warn rootless{opt/conda/share/jupyter/lab/staging/node_modules/vega-themes/node_modules/.bin/vl2png} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers" + 2023/12/21 11:16:47 warn rootless{opt/conda/share/jupyter/lab/staging/node_modules/vega-themes/node_modules/.bin/vl2svg} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers" + 2023/12/21 11:16:47 warn rootless{opt/conda/share/jupyter/lab/staging/node_modules/vega-themes/node_modules/.bin/vl2vg} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers" + 2023/12/21 11:16:47 warn rootless{opt/conda/share/jupyter/lab/staging/node_modules/verdaccio/node_modules/.bin/JSONStream} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers" + 2023/12/21 11:16:47 warn rootless{opt/conda/share/jupyter/lab/staging/node_modules/verdaccio/node_modules/.bin/envinfo} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers" + 2023/12/21 11:16:47 warn rootless{opt/conda/share/jupyter/lab/staging/node_modules/verdaccio/node_modules/.bin/handlebars} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers" + 2023/12/21 11:16:47 warn rootless{opt/conda/share/jupyter/lab/staging/node_modules/verdaccio/node_modules/.bin/js-yaml} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers" + 2023/12/21 11:16:47 warn rootless{opt/conda/share/jupyter/lab/staging/node_modules/verdaccio/node_modules/.bin/marked} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers" + 2023/12/21 11:16:47 warn rootless{opt/conda/share/jupyter/lab/staging/node_modules/verdaccio/node_modules/.bin/mime} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers" + 2023/12/21 11:16:47 warn rootless{opt/conda/share/jupyter/lab/staging/node_modules/verdaccio/node_modules/.bin/mkdirp} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers" + 2023/12/21 11:16:47 warn rootless{opt/conda/share/jupyter/lab/staging/node_modules/verdaccio/node_modules/.bin/pino} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers" + 2023/12/21 11:16:47 warn rootless{opt/conda/share/jupyter/lab/staging/node_modules/verdaccio/node_modules/.bin/semver} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers" + 2023/12/21 11:16:47 warn rootless{opt/conda/share/jupyter/lab/staging/node_modules/verdaccio/node_modules/request/node_modules/.bin/uuid} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers" + 2023/12/21 11:16:47 warn rootless{opt/conda/share/jupyter/lab/staging/node_modules/warning/node_modules/.bin/loose-envify} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers" + 2023/12/21 11:16:47 warn rootless{opt/conda/share/jupyter/lab/staging/node_modules/webpack/node_modules/.bin/acorn} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers" + 2023/12/21 11:16:47 warn rootless{opt/conda/share/jupyter/lab/staging/node_modules/webpack/node_modules/.bin/browserslist} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers" + 2023/12/21 11:16:47 warn rootless{opt/conda/share/jupyter/lab/staging/node_modules/webpack/node_modules/terser-webpack-plugin/node_modules/.bin/terser} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers" + 2023/12/21 11:16:47 warn rootless{opt/conda/share/jupyter/lab/staging/node_modules/webpack/node_modules/terser-webpack-plugin/node_modules/.bin/webpack} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers" + 2023/12/21 11:16:47 warn rootless{opt/conda/share/jupyter/lab/staging/node_modules/webpack-bundle-analyzer/node_modules/.bin/acorn} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers" + 2023/12/21 11:16:47 warn rootless{opt/conda/share/jupyter/lab/staging/node_modules/webpack-bundle-analyzer/node_modules/.bin/mkdirp} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers" + 2023/12/21 11:16:47 warn rootless{opt/conda/share/jupyter/lab/staging/node_modules/webpack-bundle-analyzer/node_modules/.bin/opener} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers" + 2023/12/21 11:16:47 warn rootless{opt/conda/share/jupyter/lab/staging/node_modules/webpack-cli/node_modules/.bin/import-local-fixture} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers" + 2023/12/21 11:16:47 warn rootless{opt/conda/share/jupyter/lab/staging/node_modules/webpack-cli/node_modules/.bin/webpack} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers" + 2023/12/21 11:16:47 warn rootless{opt/conda/share/jupyter/lab/staging/node_modules/worker-loader/node_modules/.bin/webpack} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers" + 2023/12/21 11:16:47 warn rootless{opt/conda/share/jupyter/lab/staging/node_modules/yarn-deduplicate/node_modules/.bin/semver} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers" + 2023/12/21 11:16:47 info unpack layer: sha256:326593c72c7e03cb9871c467596fa18fc4b639718a0d9b6401fa9d10c651dfa2 + 2023/12/21 11:16:54 info unpack layer: sha256:5c0048263e6eb546f9aca8c9de2e27e5f27ab5d0f65b11ec910ec6c425aa89f3 + 2023/12/21 11:16:54 info unpack layer: sha256:179f754036e2ed1733341fa46bb1701649ad0cb671cbb1c27024e82c28a7c739 + 2023/12/21 11:16:54 info unpack layer: sha256:e0f250156230802be0037b5aab71a25983bc15fd6939ddfe860f8ca769aae171 + 2023/12/21 11:16:59 info unpack layer: sha256:722137bce0015e70b1c769e21a9cee9e2db0dad25023d40b7be5de4f6cd36043 + 2023/12/21 11:16:59 info unpack layer: sha256:afd4f74c025028308b4f876a35509b1c8e66fa6cae794cd7cd756fcf8b9b1538 + 2023/12/21 11:16:59 info unpack layer: sha256:1c44777d0dc4b79af795a0c0e36ff5c06de0ebf44d720d2362b736256cc6608e + 2023/12/21 11:16:59 info unpack layer: sha256:d1b34f82f8cdd5b8a1a4b02092856245514c8d9a8f4c6967ec3cec874d1b8829 + 2023/12/21 11:17:00 info unpack layer: sha256:7da56a27f920dac47ec7df9eabd6bd4512b38968bfce850ee97ead3a85060fca + 2023/12/21 11:17:01 info unpack layer: sha256:911b3fb18e1c56358fae9bede839bfc53d2f7c23074805b8e50682f3d127f72e + 2023/12/21 11:17:02 info unpack layer: sha256:a8aaeed7aa6f26e7a85826f0c81a19764c544860568bed2565f89aa5a8eccd91 + 2023/12/21 11:17:03 info unpack layer: sha256:4e55a6c82cf9816efd77f43fda4782801df62d729f37e64cf646c8c830c0202d + 2023/12/21 11:17:03 info unpack layer: sha256:4e8420ef9a6d15a7586ac7a35b91a85cba159711c14b6b3d6c12c60f802749c8 + 2023/12/21 11:17:03 info unpack layer: sha256:d2fcf73697ea247644f6b122f58d8388fbfa87727ef8640485cced96a764a24e + 2023/12/21 11:17:03 info unpack layer: sha256:1079f418230549427d87bfba97ed9f97f16ad8383d036bf380d48674f6ffbd43 + 2023/12/21 11:17:10 warn rootless{home/jovyan/.npm/_cacache/content-v2/sha512/74/32/89d2fc016984c6e2e7b9772ef32b1223783f6a7c6bf7c01cedf6b7198fcbdc968d0b79375939cda8169ccce2297c2dd698619b0b73fdbebe8638776523d1} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers" + 2023/12/21 11:17:10 warn rootless{home/jovyan/.npm/_cacache/content-v2/sha512/a0/42/8e68d25c890945d9f40cfa48df32f539fdb939bb3d2042a5fd60e02d47221c144de430001d63b6e0739ec4f32eb6a889f55b4b5595b4c88f7ddbe1bf78ce} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers" + 2023/12/21 11:17:10 warn rootless{home/jovyan/.npm/_cacache/content-v2/sha512/b7/f5/6b32b6a074deca06db865ddc0e078371033360075e39ce72e3a92ef0404452feed6e253ab67af36c76240b67aa82b76fdb60d594daa8cdfc5ccddd55815c} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers" + 2023/12/21 11:17:10 warn rootless{home/jovyan/.npm/_cacache/content-v2/sha512/c8/df/f4f9c4e46618c643c0951ecc7c3d15413af8b6a43f2bfb0d6236d0b02530812035365425fe75b9f42b732f9fd9208a4f97f6281f0b7761f368ef30799b22} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers" + 2023/12/21 11:17:10 warn rootless{home/jovyan/.npm/_cacache/content-v2/sha512/ff/46/32caf6c684788c1988aefa35dd857d3ebbc978564c6eaf7d91a8395d2b034695845f8dcf346925064e7f4c91f51e7a3637449a48cb0fcc7dc6cac7f66078} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers" + 2023/12/21 11:17:10 warn rootless{opt/conda/lib/libblas.so} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers" + 2023/12/21 11:17:10 warn rootless{opt/conda/lib/libcblas.so} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers" + 2023/12/21 11:17:10 warn rootless{opt/conda/lib/libcrypto.so} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers" + 2023/12/21 11:17:10 warn rootless{opt/conda/lib/libgfortran.so} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers" + 2023/12/21 11:17:10 warn rootless{opt/conda/lib/libgfortran.so.5} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers" + 2023/12/21 11:17:10 warn rootless{opt/conda/lib/liblapack.so} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers" + 2023/12/21 11:17:10 warn rootless{opt/conda/lib/libopenblas.so} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers" + 2023/12/21 11:17:10 warn rootless{opt/conda/lib/libopenblas.so.0} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers" + 2023/12/21 11:17:10 warn rootless{opt/conda/lib/libssl.so} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers" + 2023/12/21 11:17:17 warn rootless{opt/conda/ssl/cert.pem} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers" + 2023/12/21 11:17:17 warn rootless{opt/conda/ssl/misc/tsget} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers" + 2023/12/21 11:17:17 info unpack layer: sha256:72bfc406f45c190828445af56635120ccd43b4cc0ff96b1f4aed725fc31f3dab + 2023/12/21 11:17:20 info unpack layer: sha256:1cc47d4e243f5f6f7538d9d8c1849f08ae21c306410b0d8a8310a6d374a84a4d + INFO: Creating SIF file... + INFO: Build complete: braker3.sif + ``` +</details> + +This should result in a file called "braker3.sif" in the current directory. + +At this point, we should be able to run BRAKER3. Let's test it out: + +```bash +>niko@lisc:~/containers$ singularity exec braker3.sif braker.pl +``` + +This prints BRAKER's help message, so it seems to work. + +### Step 2: GeneMark-ETP license + +There are additional instructions for getting BRAKER3 to run, where a valid license for GeneMark-ETP +is required. BRAKER suggests you run the following command: + +```bash +>niko@lisc:~/containers$ singularity exec braker3.sif braker.pl print_braker3_setup.py +``` + +but for me this hangs _forever_, so I looked for the script itself. While it is not in the +repository any more, you can still find it in an old +[commit](https://github.com/Gaius-Augustus/BRAKER/commit/dfd688adcd71e96fb02cc784e2b05b307874377d). +I extrapolated from the script that this is what has to happen: + +Go to the [GeneMark website](http://exon.gatech.edu/GeneMark/license_download.cgi), and pretend you +want to download `GeneMark-ES/ET/EP ver 4.71_lic`. This will take you to a page where you can choose +to download the software (ignore) and the key. Copy the link for the 64bit key. + +Then, on LISC, do the following: + +```bash +>niko@lisc:~/containers$ cd ~ +>niko@lisc:~$ wget <link> # this is the link for the 64 bit key +>niko@lisc:~$ gunzip gm_key_64.gz +>niko@lisc:~$ mv gm_key_64 ~/.gm_key +``` + +This should be enough to get BRAKER3 with GeneMark-ETP to run. + +### Step 3: Test run & Augustus config + +Let's make a test directory: + +```bash +>niko@lisc:~/containers$ mkdir test +>niko@lisc:~/containers$ cd test +``` + +A critical part of the BRAKER3 pipeline is Augustus, another notoriously finicky piece of software. +If we don't say anything, Augustus will try to overwrite its default config files, somewhere in the +depths of the container, where it doesn't have write access. We need to provide a valid config +collection to Augustus in a writable location. It seems like the most prudent way to do this is to +clone the Augustus repository somewhere safe and then copy the config folder to the test directory: + +```bash +>niko@lisc:~/containers/test$ cd ~/repos +>niko@lisc:~/repos$ git clone git://github.com/Gaius-Augustus/Augustus.git +>niko@lisc:~/repos$ cd - +>niko@lisc:~/containers/test$ cp -r ~/repos/Augustus/config . +``` + +In the future, in the scripts we will write for slurm, we will have to copy the config folder to the +corresponding results directory on `/scratch/` or directly in the `$TMPDIR`. + +Now we can run BRAKER3. I am modifying the final call from the [test3 +script](https://github.com/Gaius-Augustus/BRAKER/blob/master/example/singularity-tests/test3.sh), +and benefitting from [najoshi](https://github.com/najoshi)'s [helpful GitHub +comment](https://github.com/Gaius-Augustus/BRAKER/issues/609#issuecomment-1571506735): + +```bash +>niko@lisc:~/containers/test$ singularity exec -B ${PWD} $HOME/containers/braker3.sif braker.pl \ +--AUGUSTUS_CONFIG_PATH=${PWD}/config \ +--genome=/opt/BRAKER/example/genome.fa \ +--prot_seq=/opt/BRAKER/example/proteins.fa \ +--bam=/opt/BRAKER/example/RNAseq.bam \ +--softmasking \ +--threads 8 \ +--gm_max_intergenic 10000 \ +--skipOptimize +``` + +This took less than 10 minutes for me, and the output should look something like that: + +```bash +INFO: Converting SIF file to temporary sandbox... +# Thu Dec 21 14:42:44 2023:Both protein and RNA-Seq data in input detected. BRAKER will be executed in ETP mode (BRAKER3). +#********* +# Thu Dec 21 14:42:44 2023: Log information is stored in file /lisc/user/papadopoulos/test/braker/braker.log +#********* +# WARNING: Number of reliable training genes is low (114). Recommended are at least 600 genes +#********* +INFO: Cleaning up image... +``` + +## Preparing the run + +It is now time to compose a script for our real-life use cases. We aim to run BRAKER with RNA-seq +and protein data, so we need to produce BRAKER's desired input files first. + +A good reference for this is the recent [TSEBRA +tutorial](https://github.com/KatharinaHoff/BRAKER-TSEBRA-Workshop/blob/main/GenomeAnnotation.ipynb) +offered by Katharina Hoff. This tutorial is pretty detailed, and if a user made it this far they can +probably follow it. I will just outline the steps I took. + +#### Step 0: repeat masking + +Repeats are a big problem for gene prediction; they may look coincidentally like protein-coding +genes, or may be actual protein-coding genes (such as transposases); both cases are usually not what +we are interested in. A genome should be repeat-masked prior to gene prediction. Paraphrasing from +the tutorial: + +```bash +T=72 # you need a large number of threads and fast i/o storage +GENOME=/path/to/your/genome.fa # the fasta file of your genome +DB=some_db_name_that_fits_to_species + +BuildDatabase -name ${DB} ${GENOME} +RepeatModeler -database ${DB} -pa ${T} -LTRStruct +RepeatMasker -pa 72 -lib ${DB}-families.fa -xsmall ${GENOME} +``` + +The `-xsmall` flag is important, as it means we are _soft-masking_ the genome, i.e. replacing the +repeats with lower-case letters instead of Ns. + +#### Step 1: RNA-seq alignment + +Spliced alignments of bulk RNA-seq data are invaluable for finding protein-coding genes, as they +can pinpoint the location of exons, introns, and UTRs. I am using STAR for this, but any other +aligner should work as well. The tutorial uses HISAT2. + +First create an index for your genome: + +```bash +STAR --runThreadN 30 \ # choose an appropriate number of threads +--runMode genomeGenerate \ +--genomeDir "$ASSEMBLY" \ # the path to the directory where you want the index to be +--genomeFastaFiles "$FASTA" \ # the input genome FASTA file +--genomeSAindexNbases 13 # 12 or 13 will make a little difference, we can use either +``` + +Then align your reads: + +```bash +STAR --runThreadN 30 \ # choose an appropriate number of threads +--outSAMstrandField intronMotif \ # very important for BRAKER +--genomeDir "$ASSEMBLY" \ # the path to the directory where the index is +--readFilesIn "$R1" "$R2" \ # Read 1 and Read 2 of your paired-end RNA-seq data +--readFilesCommand gunzip -c # if your reads are gzipped +``` + +This will generate a file called `Aligned.out.sam` in the current directory. We need to convert that +to BAM format and sort it: + +```bash +samtools view -bSh Aligned.out.sam > Aligned.out.bam +``` + +#### Step 2: protein alignment + +For that we need a protein database. I am using the [OrthoDB +database](https://bioinf.uni-greifswald.de/bioinf/partitioned_odb11/); in my case, it's Arthropods, +but use whatever fits. + +## Running BRAKER3 on LISC + +I used the following script, with a genome of ~530Mb and 8 RNA-seq datasets. It took about 20h to +run. RAM should not be the bottleneck for BRAKER3, as most of the steps are not so intensive (I also +am using pre-mapped RNA-seq data, so that alleviates the situation somewhat). There are some +guesstimates of performance on the AUGUSTUS repository that could help you decide how much +RAM/runtime to ask for. I was laughably wrong in my estimates. + +```bash +#!/usr/bin/env bash +# +#SBATCH --job-name=pycno_braker +#SBATCH --ntasks=1 +#SBATCH --cpus-per-task=32 # 60% CPU efficiency for me, no idea about peak consumption +#SBATCH --mem=100G # ended up needing 30G +#SBATCH --time=10-0 # ended up needing 20h +#SBATCH --mail-type=ALL +#SBATCH --mail-user=your.email@univie.ac.at +#SBATCH --output=/path/to/logfile-%j.out +#SBATCH --error=/path/to/logfile-%j.err + +# +module load conda +conda activate singularity + +# container location +BRAKER=~/containers/braker3.sif # or wherever you put it; I use absolute paths to be safe +# path to the output directory, where all the magic should happen +OUTPUT=/path/to/output/directory +mkdir "$OUTPUT" -p || exit +cd "$OUTPUT" || exit + +# copy AUGUSTUS config files for BRAKER +cp -r ~/repos/Augustus/config . # again, amend directories as needed + +# define input files +# the genome assembly +GENOME=/path/to/some/assembly.fasta # mine was about 530Mbase (and a file size of 514Mbyte) +# the BAMs of the RNA-seq datasets to use as references for AUGUSTUS +BASE=/some/base/directory/that/contains/all/of/them/ +DEV_TIMEPOINT1="$BASE/DEV_TIMEPOINT1/Aligned.out.bam" +DEV_TIMEPOINT2="$BASE/DEV_TIMEPOINT2/Aligned.out.bam" +DEV_TIMEPOINT3="$BASE/DEV_TIMEPOINT3/Aligned.out.bam" +# the corresponding OrthoDB protein sequences +ORTHODB=/path/to/fasta/file.fa + +# now call the container. Important to mount the current directory, /scratch/, and the $TMPDIR, +# in case it is needed. +singularity exec -B "$PWD,/lisc/scratch/partition/of/your/institute/where/all/your/files/live/,$TMPDIR" "$BRAKER" braker.pl \ +--AUGUSTUS_CONFIG_PATH="${PWD}"/config \ +--genome=$GENOME \ +--prot_seq=$ORTHODB \ +--bam="$DEV_TIMEPOINT1","$DEV_TIMEPOINT2","$DEV_TIMEPOINT3" \ +--softmasking \ +--threads 32 +``` \ No newline at end of file diff --git a/06-annotation/prep-RNA-bam.sh b/06-annotation/prep-RNA-bam.sh new file mode 100644 index 0000000000000000000000000000000000000000..6e65b2e89b665a45132b9276cd316b37a383464f --- /dev/null +++ b/06-annotation/prep-RNA-bam.sh @@ -0,0 +1,35 @@ +#!/usr/bin/env bash + +CONVERTER=/lisc/user/papadopoulos/repos/pycno-seq/annotate/prep-RNA-sam2bam.sh + +BASE="/lisc/scratch/zoology/pycnogonum/genome/draft/transcriptome" + +ZYGOTE="$BASE/ZYGOTE/Aligned.out.sam" +EARLY_CLEAVAGE="$BASE/EARLY_CLEAVAGE/Aligned.out.sam" +EMBRYO0_1="$BASE/EMBRYO0_1/Aligned.out.sam" +EMBRYO3_5="$BASE/EMBRYO3_5/Aligned.out.sam" +EMBRYO9_10="$BASE/EMBRYO9_10/Aligned.out.sam" +EMBRYO3="$BASE/EMBRYO3/Aligned.out.sam" +INSTAR1="$BASE/INSTAR1/Aligned.out.sam" +INSTAR2="$BASE/INSTAR2/Aligned.out.sam" +INSTAR3="$BASE/INSTAR3/Aligned.out.sam" +INSTAR4="$BASE/INSTAR4/Aligned.out.sam" +INSTAR5="$BASE/INSTAR5/Aligned.out.sam" +INSTAR6="$BASE/INSTAR6/Aligned.out.sam" +JUV1="$BASE/JUV1/Aligned.out.sam" +SUBADULT="$BASE/SUBADULT/Aligned.out.sam" + +sbatch $CONVERTER $ZYGOTE +sbatch $CONVERTER $EARLY_CLEAVAGE +sbatch $CONVERTER $EMBRYO0_1 +sbatch $CONVERTER $EMBRYO3_5 +sbatch $CONVERTER $EMBRYO9_10 +sbatch $CONVERTER $EMBRYO3 +sbatch $CONVERTER $INSTAR1 +sbatch $CONVERTER $INSTAR2 +sbatch $CONVERTER $INSTAR3 +sbatch $CONVERTER $INSTAR4 +sbatch $CONVERTER $INSTAR5 +sbatch $CONVERTER $INSTAR6 +sbatch $CONVERTER $JUV1 +sbatch $CONVERTER $SUBADULT \ No newline at end of file diff --git a/06-annotation/prep-RNA-index_bam.sh b/06-annotation/prep-RNA-index_bam.sh new file mode 100644 index 0000000000000000000000000000000000000000..2f770f30b4e6c692df0b1ac66cb59011fd370807 --- /dev/null +++ b/06-annotation/prep-RNA-index_bam.sh @@ -0,0 +1,21 @@ +#!/usr/bin/env bash +# +#SBATCH --job-name=converter +#SBATCH --ntasks=1 +#SBATCH --cpus-per-task=6 +#SBATCH --mem=16G +#SBATCH --time=15:00 +#SBATCH --mail-type=ALL +#SBATCH --mail-user=nikolaos.papadopoulos@univie.ac.at +#SBATCH --output=/lisc/user/papadopoulos/log/pycno-indexbam-%j.out +#SBATCH --error=/lisc/user/papadopoulos/log/pycno-indexbam-%j.err + +module load samtools + +INPUT=$1 +BASENAME=$(dirname "$INPUT") +OUTPUT=$BASENAME/Aligned.out.sorted.bam + +cd "$TMPDIR" || exit +samtools sort -@ 8 "$INPUT" > "$OUTPUT" +samtools index -@ 8 -b "$OUTPUT" \ No newline at end of file diff --git a/06-annotation/prep-RNA-map-all.sh b/06-annotation/prep-RNA-map-all.sh new file mode 100644 index 0000000000000000000000000000000000000000..85389b5b4aa3a92654631ce91ed2fabe78793014 --- /dev/null +++ b/06-annotation/prep-RNA-map-all.sh @@ -0,0 +1,42 @@ +#!/usr/bin/env bash + +MAPPER=/lisc/user/papadopoulos/repos/pycno-seq/annotate/prep-RNA-map-single.sh +ASSEMBLY=/lisc/scratch/zoology/pycnogonum/genome/draft/ + +# early development +BASE="/lisc/scratch/zoology/pycnogonum/raw/early_RNA" +ZYGOTE="$BASE/Plit35_S8" +EARLY_CLEAVAGE="$BASE/Plit31_S6" +EMBRYO0_1="$BASE/Plit40_S11" +EMBRYO3_5="$BASE/Plit32_S7" +EMBRYO9_10="$BASE/Plit36_S9" +# INSTAR2_5="$BASE/Plit39_S10" + +sbatch $MAPPER $ASSEMBLY ZYGOTE "$ZYGOTE"_L001 +sbatch $MAPPER $ASSEMBLY EARLY_CLEAVAGE "$EARLY_CLEAVAGE"_L001 +sbatch $MAPPER $ASSEMBLY EMBRYO0_1 "$EMBRYO0_1"_L001 +sbatch $MAPPER $ASSEMBLY EMBRYO3_5 "$EMBRYO3_5"_L001 +sbatch $MAPPER $ASSEMBLY EMBRYO9_10 "$EMBRYO9_10"_L001 + +# instars +BASE="/lisc/scratch/zoology/pycnogonum/raw/HTT33DSX5_4_R15615_20230713/demultiplexed" + +EMBRYO3="$BASE/235255/235255_S49" +INSTAR1="$BASE/235253/235253_S47" +INSTAR2="$BASE/235256/235256_S50" +INSTAR3="$BASE/235257/235257_S51" +INSTAR4="$BASE/235258/235258_S52" +INSTAR5="$BASE/235259/235259_S53" +INSTAR6="$BASE/235254/235254_S48" +JUV1="$BASE/235260/235260_S54" +SUBADULT="$BASE/235261/235261_S55" + +sbatch $MAPPER $ASSEMBLY EMBRYO3 "$EMBRYO3"_L004 +sbatch $MAPPER $ASSEMBLY INSTAR1 "$INSTAR1"_L004 +sbatch $MAPPER $ASSEMBLY INSTAR2 "$INSTAR2"_L004 +sbatch $MAPPER $ASSEMBLY INSTAR3 "$INSTAR3"_L004 +sbatch $MAPPER $ASSEMBLY INSTAR4 "$INSTAR4"_L004 +sbatch $MAPPER $ASSEMBLY INSTAR5 "$INSTAR5"_L004 +sbatch $MAPPER $ASSEMBLY INSTAR6 "$INSTAR6"_L004 +sbatch $MAPPER $ASSEMBLY JUV1 "$JUV1"_L004 +sbatch $MAPPER $ASSEMBLY SUBADULT "$SUBADULT"_L004 \ No newline at end of file diff --git a/06-annotation/prep-RNA-map-single.sh b/06-annotation/prep-RNA-map-single.sh new file mode 100644 index 0000000000000000000000000000000000000000..ead0fda848349f5be4928230ed97890324c21b4e --- /dev/null +++ b/06-annotation/prep-RNA-map-single.sh @@ -0,0 +1,73 @@ +#!/usr/bin/env bash +# +#SBATCH --job-name=star_pycno +#SBATCH --ntasks=1 +#SBATCH --cpus-per-task=12 +#SBATCH --mem=32G +#SBATCH --time=2:00:00 +#SBATCH --mail-type=ALL +#SBATCH --mail-user=nikolaos.papadopoulos@univie.ac.at +#SBATCH --output=/lisc/user/papadopoulos/log/pycno-star-worker-%j.out +#SBATCH --error=/lisc/user/papadopoulos/log/pycno-star-worker-%j.err + +# GeneMark-ETP utilizes Stringtie2 to assemble RNA-Seq data, which requires that the aligned reads +# (BAM files) contain the XS (strand) tag for spliced reads. Therefore, when aligning with STAR we +# must use the --outSAMstrandField intronMotif option. + +# rule minimap2_txome_to_genome: +# input: +# assem = config["REF"], +# txome = config["TXOME"] +# output: +# bam = config["tool"] + "/output/bams/txome_to_ref.filtered.sorted.bam" +# threads: workflow.cores - 1 +# params: +# sort_threads = int(workflow.cores/5) +# shell: +# """ +# minimap2 -t {threads} -ax splice {input.assem} {input.txome} | \ +# samtools view -F 4 -q 10 -hb -@ {params.sort_threads} - | \ +# samtools sort -@ {params.sort_threads} - > {output.bam} +# """ + +# rule minimap2_long_to_genome: +# input: +# assem = config["REF"], +# txome = config["LONGREADS"] +# output: +# bam = config["tool"] + "/output/bams/long_to_ref.filtered.sorted.bam" +# threads: workflow.cores - 1 +# params: +# sort_threads = int(workflow.cores/5) +# shell: +# """ +# minimap2 -t {threads} -ax splice:hq {input.assem} {input.txome} | \ +# samtools view -F 4 -q 10 -hb -@ {params.sort_threads} - | \ +# samtools sort -@ {params.sort_threads} - > {output.bam} +# """ + +module load ngstools/star/2.7.11b + +ASSEMBLY=$1 +STAGENAME=$2 +STAGE=$3 +OUTPUT=$ASSEMBLY/transcriptome/$STAGENAME + +mkdir "$OUTPUT" -p || exit + +R1="$STAGE"_R1_001.fastq.gz +R2="$STAGE"_R2_001.fastq.gz + +cd "$TMPDIR" || exit + +STAR --runThreadN 30 \ +--outSAMstrandField intronMotif \ +--genomeDir "$ASSEMBLY" \ +--readFilesIn "$R1" "$R2" \ +--readFilesCommand gunzip -c + +cp -r "$TMPDIR"/Log.* "$OUTPUT"/ +cp -r "$TMPDIR"/Aligned.* "$OUTPUT"/ + +# remove $TMPDIR +rm -rf "$TMPDIR" \ No newline at end of file diff --git a/06-annotation/prep-RNA-mkindex.sh b/06-annotation/prep-RNA-mkindex.sh new file mode 100644 index 0000000000000000000000000000000000000000..45cf65d4687a85d32c058bfab2eed9ed6a67d46c --- /dev/null +++ b/06-annotation/prep-RNA-mkindex.sh @@ -0,0 +1,33 @@ +#!/usr/bin/env bash +# +#SBATCH --job-name=star_pycno +#SBATCH --ntasks=1 +#SBATCH --cpus-per-task=30 +#SBATCH --mem=20G +#SBATCH --time=15:00 +#SBATCH --mail-type=ALL +#SBATCH --mail-user=nikolaos.papadopoulos@univie.ac.at +#SBATCH --output=/lisc/user/papadopoulos/log/pycno-star-%j.out +#SBATCH --error=/lisc/user/papadopoulos/log/pycno-star-%j.err + +# GeneMark-ETP utilizes Stringtie2 to assemble RNA-Seq data, which requires that the aligned reads +# (BAM files) contain the XS (strand) tag for spliced reads. Therefore, when aligning with STAR we +# must use the --outSAMstrandField intronMotif option. + +module load ngstools/star/2.7.11b + +FASTA=$1 +ASSEMBLY=$(dirname "$FASTA") # path to the genome directory where the STAR index will be generated +OUTPUT=$ASSEMBLY/transcriptome + +mkdir "$OUTPUT" -p || exit + +# first go into the assembly directory and create the STAR index +cd "$ASSEMBLY" || exit + +# create the STAR index +STAR --runThreadN 30 \ +--runMode genomeGenerate \ +--genomeDir "$ASSEMBLY" \ +--genomeFastaFiles "$FASTA" \ +--genomeSAindexNbases 13 \ No newline at end of file diff --git a/06-annotation/prep-RNA-sam2bam.sh b/06-annotation/prep-RNA-sam2bam.sh new file mode 100644 index 0000000000000000000000000000000000000000..17f6d97658b634517f9d57d8253e30ee16525991 --- /dev/null +++ b/06-annotation/prep-RNA-sam2bam.sh @@ -0,0 +1,19 @@ +#!/usr/bin/env bash +# +#SBATCH --job-name=converter +#SBATCH --ntasks=1 +#SBATCH --cpus-per-task=4 +#SBATCH --mem=100M +#SBATCH --time=90:00 +#SBATCH --mail-type=ALL +#SBATCH --mail-user=nikolaos.papadopoulos@univie.ac.at +#SBATCH --output=/lisc/user/papadopoulos/log/pycno-sam2bam-%j.out +#SBATCH --error=/lisc/user/papadopoulos/log/pycno-sam2bam-%j.err + +module load samtools + +INPUT=$1 +BASENAME=$(dirname "$INPUT") +OUTPUT=$BASENAME/Aligned.out.bam + +samtools view -bSh --threads 4 "$INPUT" > "$OUTPUT" \ No newline at end of file diff --git a/06-annotation/tRNAscan_to_gff3.ipynb b/06-annotation/tRNAscan_to_gff3.ipynb new file mode 100644 index 0000000000000000000000000000000000000000..ee76b0889c9f00c5c04eca53cc93aeedeb45e6a4 --- /dev/null +++ b/06-annotation/tRNAscan_to_gff3.ipynb @@ -0,0 +1,146 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Converting tRNAscan-SE output to GFF3 format\n", + "\n", + "tRNAscan-SE produces multiple output files:\n", + "\n", + "- trnascan.bed: a BED file with the coordinates of the tRNAs\n", + "- trnascan.fasta: a FASTA file with the sequences of the tRNAs (and their coordinates in the headers)\n", + "- trnascan.stats: a human-readable file with summary statistics from the program output\n", + "- trnascan.out: a human-readable table with lots of metadata for each tRNA\n", + "\n", + "For our resulting GFF3 file we would like to keep all of the predicted tRNAs that \n", + "\n", + "1. are not pseudogenes\n", + "2. don't have introns\n", + "3. don't overlap with any protein-coding genes\n", + "\n", + "Info for points 1 and 2 can be found in the `trnascan.out` file, while point 3 requires a separate\n", + "GFF3 file with the protein-coding genes." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "scan_loc = \"/Volumes/scratch/pycnogonum/genome/draft/trnascan/\"\n", + "out = scan_loc + 'trnascan.out'\n", + "\n", + "trna = pd.read_csv(out, sep='\\t', header=None, skiprows=3)\n", + "trna.columns = ['seqname', 'tRNA', 'start', 'end', 'type', 'anti', 'intron_start', 'intron_end', 'inf score', 'origin', 'note']\n", + "\n", + "trna['id'] = trna['seqname'].str.strip() + '.tRNA' + trna['tRNA'].astype(str) + '-' + trna['type'] + trna['anti']\n", + "trna.set_index('id', inplace=True)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "For point 3 we need to check the overlap of the tRNAs with the protein-coding genes. We can use\n", + "`bedtools intersect` and the draft gff file:\n", + "\n", + "```bash\n", + "$ bedtools intersect -wao -v -a trnascan.bed -b ../annot_merge/merged_sorted.gff3 > no_overlap.bed\n", + "$ bedtools intersect -wao -S -a trnascan.bed -b ../annot_merge/merged_sorted.gff3 | grep -e \"-1\" -v > overlap_otherstrand.bed\n", + "$ cat overlap_otherstrand.bed > keep.bed\n", + "$ cat no_overlap.bed >> keep.bed\n", + "$ cut -d\" \" -f4,6 keep.bed | sort -u > keep_unique.txt # the delimiter should be a tab character; try ctrl+v tab\n", + "```" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "trna_ids = pd.read_table(scan_loc + \"keep_unique.txt\", header=None, index_col=0)\n", + "trna_ids.columns = ['strand']" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [], + "source": [ + "keep = trna.join(trna_ids, how='inner')\n", + "not_pseudo = keep['note'].fillna(\"\") == \"\"\n", + "not_intron = (keep['intron_start'] == 0) & (keep['intron_end'] == 0)\n", + "\n", + "gff = keep[not_pseudo & not_intron]" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [], + "source": [ + "def gff_string(index, name, row):\n", + " seqid = row['seqname'].strip()\n", + " source = \"tRNAscan-SE\"\n", + " seq_type = \"tRNA\"\n", + " start = min(row['start'], row['end'])\n", + " stop = max(row['start'], row['end'])\n", + " score = \".\"\n", + " strand = row['strand']\n", + " phase = \".\"\n", + " # :Gly-GCC-1-1\n", + " attributes = f\"ID=tRNA{index+1};gene_id={name};name=tRNA:{row['type']}-{row['anti']};score={row['inf score']}\"\n", + " gene = f\"{seqid}\\t{source}\\tgene\\t{start}\\t{stop}\\t{score}\\t{strand}\\t{phase}\\t{attributes};\\n\"\n", + " attributes = f\"ID=tRNAscan{index+1};name=tRNA:{row['type']}-{row['anti']}\"\n", + " trna = f\"{seqid}\\t{source}\\ttRNA\\t{start}\\t{stop}\\t{score}\\t{strand}\\t{phase}\\t{attributes};\\n\"\n", + " return gene + trna" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [], + "source": [ + "with open(scan_loc + 'trnascan.gff3', 'w') as f:\n", + " for i, (name, row) in enumerate(gff.iterrows()):\n", + " f.write(gff_string(i, name, row))" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "ascc24", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.19" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +}