genome annotation

2e4251db · Niko (Nikolaos) Papadopoulos · 119cf2be · 2e4251db · 2e4251db · 2e4251db
Commit 2e4251db authored 9 months ago by Niko (Nikolaos) Papadopoulos
--- a/06-annotation/.DS_Store
+++ b/06-annotation/.DS_Store
--- a/06-annotation/README.md
+++ b/06-annotation/README.md
+# Annotation
+
+Code in this folder covers the annotation of the genome with gene models, mostly for protein-coding
+genes. The starting point is the scaffolded, decontaminated draft assembly.
+
+## Annotating protein-coding genes on the _Pycnogonum_ genome
+
+### BRAKER3 with developmental RNA-seq data
+
+BRAKER3 is one of the most widely used tools for eukaryotic gene prediction, and one that
+consistently performs well in benchmarks. However, installing the software is a non-trivial
+endeavour in a HPC system without admin privileges. We provide here a summary of our experience in
+the hopes that it is useful to other hopeful users.
+
+- [Installing BRAKER3 in a shared system without admin access](install_braker.md)
+
+We generated RNA-seq data for a time series spanning development from late embryogenesis until the
+subadult stage (embryo 3-4, instars 1 to 6, juvenile 1, subadult), and our collaborators provided
+data for even earlier timepoints (zygote, early cleavage, embryo 3-5). This data was mapped to the
+draft genome (average: 72.8% uniquely mapping, 15% multimapping, so pretty good completeness) and
+the resulting BAM files used as input to BRAKER3.
+
+- [`prep-RNA-mkindex.sh`](prep-RNA-mkindex.sh) - before mapping the RNA we need to index the draft
+  genome so that STAR can actually map to it.
+- [`prep-RNA-map-all.sh`](prep-RNA-map-all.sh) - main script that will submit all mapping jobs.
+- [`prep-RNA-map-single.sh`](prep-RNA-map-single.sh) - worker script that will map the RNA reads
+  from a developmental timepoint to the draft genome.
+- [`prep-RNA-bam.sh`](prep-RNA-bam.sh) - main script that will submit the SAM-to-BAM conversion jobs
+  to the cluster.
+- [`prep-RNA-sam2bam.sh`](prep-RNA-sam2bam.sh) - worker script that will convert a SAM file to a BAM
+  file in the same location.
+- [`prep-RNA-index_bam.sh`](prep-RNA-index_bam.sh) - index the resulting BAM files
+
+BRAKER3 also expects a FASTA file with protein sequences expected to be found in the genome. For
+this, we used the Arthropoda partition from [orthoDB](https://www.orthodb.org).
+
+BRAKER3 was run from a container on the Vienna LiSC (Life Science Cluster; LiSC for short).
+
+- [annotate/annot-braker.sh](./annot1-braker.sh)
+
+This predicted 11,451 gene models. However, manual inspection of the RNA-seq mapping on the draft
+genome revealed that BRAKER3 had been rather conservative in its predictions, and that many loci
+that looked like they contained exon structures did not receive any annotation.
+
+### using Iso-seq collapsed transcripts as gene models
+
+We generated a long-read transcriptome using PacBio Iso-seq technology. After collapsing isoforms by
+mapping onto the draft genome, we were left with 8,904 transcript families, which we interpreted as
+genes. By comparing the Iso-seq "genes" with the BRAKER3 models we were able to identify multiple
+cases where BRAKER3 occasionally merged gene models if they were too close to each other.
+
+We decided to combine the Iso-seq and BRAKER3 annotation by giving precedence to Iso-seq. We
+filtered the BRAKER3 gene models to remove any that overlapped (same strand, any overlap) with the
+Iso-seq genes, and were left with 3,596 gene models, raising our total to 3,596 + 8,904 = 12,500.
+
+- [Map](annot1-isoseq-map.sh) the Iso-seq collapsed transcripts to the draft genome
+- [isoseq_confirm.ipynb](./annot1-isoseq_confirm.ipynb) - cross-reference the Iso-seq mapping
+  locations with BRAKER predictions and produce a new GFF file that contains their union.
+
+Unfortunately, the Iso-seq transcriptome we generated was part of a multiplexed run. This meant that
+we "lost" almost 60% of our effective sequence depth on other organisms, making the _Pycnogonum_
+Iso-seq less effective for gene annotation. Manual inspection confirmed what the numbers suggested:
+there were still plenty of loci with RNA-seq peaks that looked like _bona fide_ genes but received
+no annotation from either PacBio or BRAKER3.
+
+We consider Iso-seq+BRAKER3 "round 1" of the genome annotation. In the final GFF file, gene models
+that are derived from Iso-seq are have "PacBio" in the `source` column, and their IDs are in the
+form `PB.XXXX`, where `X` are digits. BRAKER3 gene models have "AUGUSTUS" listed as their `source`
+and their IDs have the form `gXXXXX`, with `X` again being digits.
+
+### Re-using unannotated loci for a second round of annotation
+
+There were a couple of indications that the genome annotation was not yet complete. First, BUSCO
+completeness of the gene models was lower than the entire genome; second, the mapping rates of the
+RNA-seq data was much lower for gene models (avg. uniquely mapping: 48.69%) compared to the entire
+genome (avg. uniquely mapping: 72.8%); finally, manual inspection easily identified loci that looked
+like genes but had no annotation.
+
+We proceeded to a second round of annotation. We filtered the RNA-seq data to only keep the mapped
+reads that did _not_ overlap with any of the gene models of the first BRAKER/Iso-seq round:
+
+- [annot2-RNA-extracts.sh](./annot2-RNA-extracts.sh) - main script that will submit worker scripts
+  for each transcriptome.
+- [text](annot2-RNA-extract_uncovered.sh) - worker script that, for a bulk transcriptome will
+  extract the reads that map on un-modelled regions.
+
+...and supplied the same orthoDB arthropod partition as protein hints:
+
+- [annot2-braker.sh](./annot2-braker.sh)
+
+As the protein hints overlapped with round 1, it was to be expected that BRAKER3 would also produce
+gene models that would overlap with round 1. To safely exclude these cases we used _exons_ to test
+for gene overlap. Using entire gene models would preclude genes that overlap with other genes (but
+on the other strand), or genes that lie in another gene's intron (such as the CHAT/VCHAT genes,
+conserved in _Platynereis_).
+
+This task was mostly achieved on the command line, using a combination of `bedtools intersect` to
+identify overlaps between GFF files, `AGAT` to convert from GTF to GFF, and bash utility tools
+`grep`, `cut`, `sort`, and `sed` to extract and lightly manipulate lines from the resulting files.
+
+We consider this "round 2" of the genome annotation. In the final GFF file, gene models from round 2
+have "AGAT" as `source` for gene/mRNA entries or "AUGUSTUS" for exons/CDS. Gene IDs are in the form
+`r2_gXXXX`, with `r2` standing for "round 2" and the rest being the usual BRAKER3 gene ID. Round 2
+produced an additional 2,223 gene models, bringing our total to 14,723.
+
+<details>
+<summary>Click here for details</summary>
+
+First extract the round 2 exons that don't overlap:
+
+```bash
+$ grep exon braker/braker.gtf > braker_exons.gtf
+$ grep exon ../annot_merge/merged.gff3 > draft_exons.gtf
+$ bedtools intersect -v -b draft_exons.gtf -a braker_exons.gtf > unique_exons.gtf
+```
+
+Keep track of which genes these come from:
+
+```bash
+$ cat unique_exons.gtf | cut -f9 | cut -d" " -f 4 | sort -u > keep_candidates.txt
+```
+
+At the same time, there might be round 2 genes that are a bit bigger than round 1 genes, and thus
+mostly overlap but have some unique exons. We'd like to avoid those:
+
+```bash
+$ bedtools intersect -wa -b draft_exons.gtf -a braker_exons.gtf > shared_exons.gtf
+$ cat shared_exons.gtf | cut -f9 | cut -f4 -d" " | sort -u > overlap.txt
+```
+
+We can now filter the candidate genes by those that have some overlap:
+
+```bash
+$ grep -v -f overlap.txt keep_candidates.txt > keep_genes.txt
+```
+
+And finally, we can filter the GTF file:
+    
+```bash
+$ grep -f keep_genes.txt braker/braker.gtf > braker2_unique.gtf
+```
+
+Now, before we can use this GTF file we need to make the gene names unique so that they don't
+overlap with the round 1 gene names. We can do this by adding a suffix to the gene names:
+
+```bash
+$ sed -r 's/(g[[:digit:]])/r2_\1/g' braker2_unique.gtf > braker2_unique_renamed.gtf
+```
+
+Finally, we can convert to a GFF3 file and append to our existing annotation:
+
+```bash
+$ module load conda
+$ conda activate agat-1.4.0
+$ agat_convert_sp_gxf2gxf.pl -g braker2_unique_renamed.gtf -o ./braker2_unique_renamed.gff3
+```
+
+Let's copy that over to the annotation merge site:
+
+```bash 
+$ cp braker2_unique_renamed.gff3 ../annot_merge/braker2_unique.gff3
+```
+
+Now we can remove the lines that contain introns or start/stop codons:
+
+```bash
+$ cd ../annot_merge
+$ grep -v -E -i "(intron)|(codon)" braker2_unique_renamed.gff3 > braker2_unique_renamed_nocodon_intron.gff3
+```
+
+</details>
+
+### Using _de novo_ transcriptomes as additional evidence
+
+Despite the improvements, manual inspection continued to suggest that there were loci with real
+RNA-seq peaks and exon-like structures that did not receive gene models. We reasoned that loci that
+represented real genes would be more likely to be present in multiple time points. Since BRAKER3 had
+already seen these loci and didn't identify them as genes, we hypothesized that the signal-to-noise
+ratio might have played a role. Instead of using raw reads we therefore turned to _de novo_
+assembled transcripts, and tried to find loci where transcripts from multiple timepoints mapped onto
+the draft genome.
+
+We used [Trinity](https://github.com/trinityrnaseq/trinityrnaseq) to perform _de novo_ assemblies
+for the in-house RNA-seq datasets, as they had been sequenced at a much deeper level than the
+transcriptomes provided by our collaborators, giving the tool better chances to assemble real
+sequences (also refer to the [corresponding section](../05-transcriptomes/README.md)) We kept
+complete ORFs, as annotated by [TransDecoder](https://github.com/TransDecoder/TransDecoder), and
+mapped those against the draft genome.
+
+- [`filter_assemblies.sh`](./annot3-RNA-01-filter_assemblies.sh) - main script that submits
+  filtering jobs to the cluster.
+- [`extract_complete_ORFs.sh`](annot3-RNA-01-extract_complete_ORFs.sh) - worker script that will
+  extract coding regions that represent complete ORFs
+- [`map_assemblies.sh`](./annot3-RNA-02-map_assemblies.sh) - main script that will submit
+  mapping jobs.
+- [map complete ORFs to genome](annot3-RNA-02-map_assembly_single.sh) - worker script
+- [`pack_sams.sh`](./annot3-RNA-03-pack_sams.sh) - main script that will submit formatting jobs to
+  the cluster.
+- [convert to GFF3](annot3-RNA-03-pack_sam.sh) - worker script that will extract reads that mapped
+  well and save them in GFF3 format.
+
+We chose to use the instar 3 transcriptome, which had 99% BUSCO completeness (metazoa), as a
+reference. We used `bedtools intersect` to find all draft genome loci where an instar 3 transcript
+mapped and overlapped (at 90% of its length) with a mapped transcript from another time point (also
+at 90% of its length), and excluded all loci that overlapped with gene models from rounds 1 and 2.
+We then translated the cDNA match parts of each locus to exons. Whenever there was disagreement
+between the different _de novo_ transcriptomes about exon boundaries, we picked the most common exon
+boundary.
+
+- [mining de novo assembled transcriptomes](annot3-RNA-04-denovo_mining.ipynb)
+
+While this approach is less sensitive than BRAKER3 and is incapable of distinguishing between
+isoforms, it is fairly robust in finding gene and exon boundaries, at least so far as the coding
+regions are concerned.
+
+We consider this "round 3" of the genome annotation. In the final GFF file, gene models from round 3
+have "Trinity" as `source`. Gene IDs are in the form `at_DNXXXX`, with `at` standing for "assembled
+transcriptomes", `DN` being a tribute to Trinity transcript naming, and `X` being digits. Round 3
+produced an additional 774 gene models, bringing our total to 15,497.
+
+### Consolidating results
+
+Finally, we can form the GFF3 file and sort it:
+
+```bash
+$ cat isoseq.gff > merged.gff3
+$ cat braker.gff >> merged.gff3
+$ cat braker2_unique_renamed_nocodon_intron.gff3 >> merged.gff3
+$ cat denovo_txomes/overlap_translated.gff3 >> merged.gff3
+$ cat ../trnascan/trnascan.gff3 >> merged.gff3
+$ module load genometools/
+$ gt gff3 -tidy -retainids -o merged_sorted.gff3 -force merged.gff3
+```
+
+### Testing
+
+First, convert the GFF to GTF for convenience:
+
+```bash
+$ gt gff3_to_gtf -o merged_sorted.gtf merged_sorted.gff3
+```
+
+Now we can use [AGAT](https://agat.readthedocs.io/en/latest/tools/agat_sp_extract_sequences.html) to
+extract transcripts from the genome:
+
+```bash
+$ agat_sp_extract_sequences.pl -g merged_sorted.gff3 -f ../draft_softmasked.fasta -t exon --merge -o transcripts.fa
+```
+
+We can then use this file as input for TransDecoder...
+
+```bash
+$ TransDecoder.LongOrfs -t transcripts.fa
+$ TransDecoder.Predict -t transcripts.fa
+```
+
+...and finally look at BUSCO completeness:
+
+```bash
+$ conda deactivate
+$ conda activate busco-5.7.1
+$ busco -i transcripts.fa.transdecoder.pep -l metazoa -m protein -o metazoa -r -c 4 --offline --download_path ../../busco/busco_downloads/
+```
+
+Which returns:
+
+```
+---------------------------------------------------
+|Results from dataset metazoa_odb10                |
+---------------------------------------------------
+|C:96.5%[S:40.7%,D:55.8%],F:1.5%,M:2.0%,n:954      |
+|920    Complete BUSCOs (C)                        |
+|388    Complete and single-copy BUSCOs (S)        |
+|532    Complete and duplicated BUSCOs (D)         |
+|14    Fragmented BUSCOs (F)                       |
+|20    Missing BUSCOs (M)                          |
+|954    Total BUSCO groups searched                |
+---------------------------------------------------
+```
+
+This is very, very close to what we had for the entire genome (except for the high prevalence of
+duplicated BUSCOs, which can be explained by the fact that we included all isoforms). In particular,
+it is worth noting that, for the entire genome, we had 1.6% fragmented and 1.7% missing BUSCOs.
+Taken together, this suggests that we have a fairly complete set of gene models.
+
+We can do the same with the arthropod BUSCO set:
+
+```bash
+$ busco -i transcripts.fa.transdecoder.pep -l arthropoda -m protein -o arthropoda -r -c 4 --offline --download_path ../../busco/busco_downloads/
+```
+
+Which returns:
+
+```
+---------------------------------------------------
+|Results from dataset arthropoda_odb10             |
+---------------------------------------------------
+|C:95.8%[S:37.3%,D:58.5%],F:2.0%,M:2.2%,n:1013     |
+|971    Complete BUSCOs (C)                        |
+|378    Complete and single-copy BUSCOs (S)        |
+|593    Complete and duplicated BUSCOs (D)         |
+|20    Fragmented BUSCOs (F)                       |
+|22    Missing BUSCOs (M)                          |
+|1013    Total BUSCO groups searched               |
+---------------------------------------------------
+```
+
+Very similar results; again suggesting that we have a fairly complete set of gene models.
+
+## tRNAscan
+
+We also used tRNAscan to predict tRNAs in the draft genome. 
+
+- [run the tool](annot-trnascan.sh)
+- [export in GFF3 format](tRNAscan_to_gff3.ipynb)
\ No newline at end of file
--- a/06-annotation/annot-trnascan.sh
+++ b/06-annotation/annot-trnascan.sh
+#!/usr/bin/env bash
+#
+#SBATCH --job-name=pycno_trnascan
+#SBATCH --ntasks=1
+#SBATCH --cpus-per-task=2
+#SBATCH --mem=500M
+#SBATCH --time=1:00:00
+#SBATCH --mail-type=ALL
+#SBATCH --mail-user=nikolaos.papadopoulos@univie.ac.at
+#SBATCH --output=/lisc/user/papadopoulos/log/pycno-trnascan-%j.out
+#SBATCH --error=/lisc/user/papadopoulos/log/pycno-trnascan-%j.err
+
+module load trnascan
+
+# Testing was performed with a FASTA file with 60 chars/line.
+
+# | CPUs | input size (lines) | real time | max memory | CPU util. |
+# |------|--------------------|-----------|------------|-----------|
+# | 1    | 78753              | 45s       | 57.5Mb     | 97%       | 
+# | 2    | 78753              | 31s       | 97Mb       | 142%      | 
+# | 4    | 78753              | 30s       | 16.8Mb     | 182%      | (testing)
+# | 1    | 787530             | 7:04      | 172Mb      | 98%       |
+# | 2    | 787530             | 4:58      | 225Mb      | 145%      |
+# | 4    | 787530             | 4:12      | 336Mb      | 188%      |
+# |------|--------------------|-----------|------------|-----------|
+# | 2    | 7875382            | 53:40     | 304Mb      | 60%       | (run)
+
+ASSEMBLY=/lisc/scratch/zoology/pycnogonum/genome/draft/draft.fasta
+OUTPUT=/lisc/scratch/zoology/pycnogonum/genome/draft/trnascan/
+mkdir "$OUTPUT" -p || exit
+
+cd "$OUTPUT" || exit
+
+tRNAscan-SE \
+-o trnascan.out \
+-f trnascan.fasta \
+-b trnascan.bed \
+--stats trnascan.stats \
+--progress \
+--hitsrc \
+--thread 2 \
+"$ASSEMBLY"
\ No newline at end of file
--- a/06-annotation/annot1-braker.sh
+++ b/06-annotation/annot1-braker.sh
+#!/usr/bin/env bash
+#
+#SBATCH --job-name=pycno_braker
+#SBATCH --ntasks=1
+#SBATCH --cpus-per-task=18
+#SBATCH --mem=40G
+#SBATCH --time=20:00:00
+#SBATCH --mail-type=ALL
+#SBATCH --mail-user=nikolaos.papadopoulos@univie.ac.at
+#SBATCH --output=/lisc/user/papadopoulos/log/pycno-braker-%j.out
+#SBATCH --error=/lisc/user/papadopoulos/log/pycno-braker-%j.err
+
+module load conda
+conda activate singularity
+
+# container location
+BRAKER=/lisc/user/papadopoulos/containers/braker3.sif
+# path to the output directory, where all the magic should happen
+OUTPUT=/lisc/scratch/zoology/pycnogonum/genome/draft/braker
+mkdir "$OUTPUT" -p || exit
+cd "$OUTPUT" || exit
+
+# copy AUGUSTUS config files for BRAKER
+cp -r ~/repos/Augustus/config .
+
+# define input files
+# the genome assembly
+GENOME=/lisc/scratch/zoology/pycnogonum/genome/draft/draft_softmasked.fasta
+# the RNA-seq dataset BAMs
+BASE=/lisc/scratch/zoology/pycnogonum/genome/draft/transcriptome
+ZYGOTE="$BASE/ZYGOTE/Aligned.out.bam"
+EARLY_CLEAVAGE="$BASE/EARLY_CLEAVAGE/Aligned.out.bam"
+EMBRYO0_1="$BASE/EMBRYO0_1/Aligned.out.bam"
+EMBRYO3_5="$BASE/EMBRYO3_5/Aligned.out.bam"
+EMBRYO9_10="$BASE/EMBRYO9_10/Aligned.out.bam"
+EMBRYO3="$BASE/EMBRYO3/Aligned.out.bam"
+INSTAR1="$BASE/INSTAR1/Aligned.out.bam"
+INSTAR2="$BASE/INSTAR2/Aligned.out.bam"
+INSTAR3="$BASE/INSTAR3/Aligned.out.bam"
+INSTAR4="$BASE/INSTAR4/Aligned.out.bam"
+INSTAR5="$BASE/INSTAR5/Aligned.out.bam"
+INSTAR6="$BASE/INSTAR6/Aligned.out.bam"
+JUV1="$BASE/JUV1/Aligned.out.bam"
+SUBADULT="$BASE/SUBADULT/Aligned.out.bam"
+# the Arthropoda OrthoDB protein sequences
+ARTHROPODA=/lisc/scratch/zoology/db/Arthropoda.fa
+
+# now call the container. Important to mount the current directory, /scratch/, and the $TMPDIR,
+# in case it is needed.
+singularity exec -B "$PWD,/lisc/scratch/zoology/,$TMPDIR" "$BRAKER" braker.pl \
+--AUGUSTUS_CONFIG_PATH="${PWD}"/config \
+--genome=$GENOME \
+--prot_seq=$ARTHROPODA \
+--bam="$ZYGOTE","$EARLY_CLEAVAGE","$EMBRYO0_1","$EMBRYO3_5","$EMBRYO9_10","$EMBRYO3","$INSTAR1","$INSTAR2","$INSTAR3","$INSTAR4","$INSTAR5","$INSTAR6","$JUV1","$SUBADULT" \
+--softmasking \
+--threads 18
\ No newline at end of file
--- a/06-annotation/annot1-isoseq-map.sh
+++ b/06-annotation/annot1-isoseq-map.sh
+#!/usr/bin/env bash
+#
+#SBATCH --job-name=mmap_pycno
+#SBATCH --ntasks=1
+#SBATCH --cpus-per-task=16
+#SBATCH --mem=15G
+#SBATCH --time=15:00
+#SBATCH --mail-type=ALL
+#SBATCH --mail-user=nikolaos.papadopoulos@univie.ac.at
+#SBATCH --output=/lisc/user/papadopoulos/log/pycno-mmap-%j.out
+#SBATCH --error=/lisc/user/papadopoulos/log/pycno-mmap-%j.err
+
+module load conda
+conda activate minimap2-2.28
+
+DRAFT=/lisc/scratch/zoology/pycnogonum/genome/draft/draft.fasta
+ISOSEQ=/lisc/scratch/zoology/pycnogonum/genome/draft/transcriptome/isoseq/isoseq.fastq
+/usr/bin/time minimap2 -ax splice -t16 -uf -C5 $DRAFT $ISOSEQ | samtools view -bS -o isoseq.bam
\ No newline at end of file
--- a/06-annotation/annot1-isoseq_confirm.ipynb
+++ b/06-annotation/annot1-isoseq_confirm.ipynb
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Confirming gene models with isoform-level RNA-seq data\n",
+    "\n",
+    "## Motivation\n",
+    "\n",
+    "Our gene prediction with BRAKER3 produced a low number of gene models (~11.5k). We suspect this\n",
+    "might be because GeneMark-ET/StringTie are too aggressive with exons/genes that are close to each\n",
+    "other, merging them and messing up the gene models. We will use the isoform-level RNA-seq data to\n",
+    "confirm some of the gene models."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import numpy as np\n",
+    "import pandas as pd\n",
+    "\n",
+    "from tqdm import tqdm"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "I extracted all transcript lines from the corresponding gff/gtf files and used bedtools to intersect\n",
+    "them:\n",
+    "\n",
+    "```bash\n",
+    "$ ISOSEQ=/lisc/scratch/zoology/pycnogonum/transcriptome/isoseq/isoseq_confirmed.gff\n",
+    "$ BRAKER=/lisc/scratch/zoology/pycnogonum/genome/draft/braker/braker/braker_proposed.gtf\n",
+    "$ bedtools intersect -wao -b $ISOSEQ -a $BRAKER > brak_iso.bed\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "gtf = [\"seqname\", \"source\", \"feature\", \"start\", \"end\", \"score\", \"strand\", \"frame\", \"attributes\"]\n",
+    "\n",
+    "# iso2bra = pd.read_csv(\"/Volumes/scratch/pycnogonum/genome/draft/annot_merge/iso_brak.bed\", sep=\"\\t\", header=None)\n",
+    "# iso2bra.columns = [\"iso_\" + x for x in gtf] + [\"bra_\" + x for x in gtf] + [\"overlap\"]\n",
+    "# iso2bra[\"iso_gene\"] = iso2bra[\"iso_attributes\"].str.extract(r'gene_id \"(.*?)\";')\n",
+    "# iso2bra[\"bra_gene\"] = iso2bra[\"bra_attributes\"].str.split(\"\\.\").str[0]\n",
+    "# iso2bra[\"bra_gene\"] = iso2bra[\"bra_gene\"].replace(\"\", np.nan)\n",
+    "\n",
+    "bra2iso = pd.read_csv(\"/Volumes/scratch/pycnogonum/genome/draft/annot_merge/brak_iso.bed\", sep=\"\\t\", header=None)\n",
+    "bra2iso.columns = [\"bra_\" + x for x in gtf] + [\"iso_\" + x for x in gtf] + [\"overlap\"]\n",
+    "bra2iso[\"bra_gene\"] = bra2iso[\"bra_attributes\"].str.split(\".\").str[0]\n",
+    "bra2iso[\"iso_gene\"] = bra2iso[\"iso_attributes\"].str.extract(r'gene_id \"(.*?)\";')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now to find overlaps:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def contained(df, a_start=\"bra_start\", a_end=\"bra_end\", b_start=\"iso_start\", b_end=\"iso_end\"):\n",
+    "    \"\"\"Checks if sequence A is completely contained within sequence B or vice versa.\n",
+    "\n",
+    "    Parameters\n",
+    "    ----------\n",
+    "    df : pd.DataFrame\n",
+    "        A dataframe of a bedtools intersect file.\n",
+    "    a_start : str, optional\n",
+    "        the start position of sequence A, by default \"bra_start\"\n",
+    "    a_end : str, optional\n",
+    "        the end position of sequence B, by default \"bra_end\"\n",
+    "    b_start : str, optional\n",
+    "        The start position of sequence B, by default \"iso_start\"\n",
+    "    b_end : str, optional\n",
+    "        the end position of sequence B, by default \"iso_end\"\n",
+    "\n",
+    "    Returns\n",
+    "    -------\n",
+    "    bool\n",
+    "        returns True if the sequence is contained, False otherwise\n",
+    "    \"\"\"\n",
+    "    if df[\"bra_strand\"] != df[\"iso_strand\"]:\n",
+    "        return False # make sure that we are comparing genes on the same strand\n",
+    "    if df[b_start] < 0 or df[b_end] < 0 or df[a_start] < 0 or df[a_end] < 0:\n",
+    "        return False\n",
+    "    # first check if the braker gene is completely contained within the isoseq one\n",
+    "    if df[a_start] >= df[b_start] and df[a_end] <= df[b_end]:\n",
+    "        return True\n",
+    "    # then check if the braker gene completely contains the isoseq one\n",
+    "    if df[a_start] <= df[b_start] and df[a_end] >= df[b_end]:\n",
+    "        return True\n",
+    "    return False"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "genes = bra2iso.groupby(\"bra_gene\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def calc_overlap(df):\n",
+    "    bra_len = df.iloc[0][\"bra_end\"] - df.iloc[0][\"bra_start\"]\n",
+    "    # print(bra_len)\n",
+    "    seq = np.zeros(bra_len, dtype=bool)\n",
+    "    for i, transcript in df.iterrows():\n",
+    "        start = transcript[\"iso_start\"] - transcript[\"bra_start\"]\n",
+    "        start = np.max((0, start))\n",
+    "        end = transcript[\"iso_end\"] - transcript[\"bra_start\"]\n",
+    "        # print(end, len(seq))\n",
+    "        end = np.min([bra_len, end])\n",
+    "        seq[start:end] = True\n",
+    "    return np.sum(seq) / bra_len"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "100%|██████████| 11451/11451 [00:03<00:00, 2913.12it/s]\n"
+     ]
+    }
+   ],
+   "source": [
+    "braker_unique = []\n",
+    "significant = []\n",
+    "unknown = []\n",
+    "truncate = []\n",
+    "\n",
+    "\n",
+    "for gene in tqdm(genes.groups):\n",
+    "    try:\n",
+    "        transcripts = genes.get_group(gene)\n",
+    "    except KeyError:\n",
+    "        unknown.append(gene)\n",
+    "        continue\n",
+    "    # first see how many candidate ISO-seq genes this BRAKER gene overlaps with\n",
+    "    # if there is zero, then it means no possible corresponding ISO-seq gene was found\n",
+    "    no_iso = transcripts.groupby(\"iso_gene\").size().shape[0] # how many ISO-seq transcripts match this BRAKER gene?\n",
+    "    if no_iso == 0: \n",
+    "        braker_unique.append(gene)\n",
+    "        continue\n",
+    "    \n",
+    "    # if there is one it means we have an 1-to-1 correspondence\n",
+    "    if no_iso == 1:\n",
+    "        # try to verify that most BRAKER isoforms are contained within the ISO-seq isoforms\n",
+    "        # (or vice versa)\n",
+    "        no_contained = transcripts.apply(contained, axis=1).sum()\n",
+    "        if no_contained / transcripts.shape[0] >= 0.4:\n",
+    "            significant.append(gene)\n",
+    "            continue\n",
+    "\n",
+    "        # test if the ISO-seq isoforms cumulatively cover at least 50% of the BRAKER gene\n",
+    "        total_overlap = transcripts.groupby(\"bra_attributes\").apply(calc_overlap, include_groups=False)\n",
+    "        if np.mean(total_overlap) >= 0.5:\n",
+    "            significant.append(gene)\n",
+    "            continue\n",
+    "\n",
+    "    # if we have arrived here, all else failed and this gene needs to go for manual curation\n",
+    "    unknown.append(gene)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "(7591, 3596, 264, np.int64(11451))"
+      ]
+     },
+     "execution_count": 7,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "len(significant), len(braker_unique), len(unknown), np.sum([len(significant), len(braker_unique), len(unknown)])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "At this point the strategy should be to copy the ISO-seq and \"braker_unique\" gene models from their\n",
+    "respective GTF files to a new, merged one. This should already take us from the 11,451 BRAKER gene\n",
+    "models to 8,904 + 3,442 = 12,346 gene models. We can then screen the unknown modules further."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "isoseq_loc = \"/Volumes/scratch/pycnogonum/transcriptome/isoseq/collapsed.gff\"\n",
+    "braker_loc = \"/Volumes/scratch/pycnogonum/genome/draft/braker/braker/braker.gtf\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 17,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def gene_string(gene_id, start, stop, last_line):\n",
+    "    chunks = last_line.split(\"\\t\")\n",
+    "\n",
+    "    seqid = chunks[0]\n",
+    "    source = chunks[1]\n",
+    "    seq_type = \"gene\"\n",
+    "    score = \".\"\n",
+    "    strand = chunks[6]\n",
+    "    phase = \".\"\n",
+    "    attributes = f\"ID={gene_id};gene_id={gene_id};\"\n",
+    "    string = f\"{seqid}\\t{source}\\t{seq_type}\\t{start}\\t{stop}\\t{score}\\t{strand}\\t{phase}\\t{attributes};\\n\"\n",
+    "    return string\n",
+    "\n",
+    "def transcript_string(gene_id, transcript_id, line):\n",
+    "    chunks = line.split(\"\\t\")\n",
+    "\n",
+    "    seqid = chunks[0]\n",
+    "    source = chunks[1]\n",
+    "    seq_type = \"mRNA\"\n",
+    "    start = chunks[3]\n",
+    "    stop = chunks[4]\n",
+    "    score = chunks[5]\n",
+    "    strand = chunks[6]\n",
+    "    phase = \".\"\n",
+    "    attributes = f\"ID={transcript_id};Parent={gene_id}\"\n",
+    "    string = f\"{seqid}\\t{source}\\t{seq_type}\\t{start}\\t{stop}\\t{score}\\t{strand}\\t{phase}\\t{attributes};\\n\"\n",
+    "    return string\n",
+    "\n",
+    "def exon_string(transcript_id, line, exon_counter):\n",
+    "    chunks = line.split(\"\\t\")\n",
+    "\n",
+    "    seqid = chunks[0]\n",
+    "    source = chunks[1]\n",
+    "    start = chunks[3]\n",
+    "    stop = chunks[4]\n",
+    "    score = chunks[5]\n",
+    "    strand = chunks[6]\n",
+    "\n",
+    "    seq_type = \"exon\"\n",
+    "    phase = \".\"\n",
+    "    attributes = f\"ID={transcript_id}.exon{exon_counter};Parent={transcript_id}\"\n",
+    "    exon_string = f\"{seqid}\\t{source}\\t{seq_type}\\t{start}\\t{stop}\\t{score}\\t{strand}\\t{phase}\\t{attributes};\\n\"\n",
+    "    \n",
+    "    seq_type = \"CDS\"\n",
+    "    phase=\"0\"\n",
+    "    attributes = f\"ID={transcript_id}.CDS{exon_counter};Parent={transcript_id}\"\n",
+    "    CDS_string = f\"{seqid}\\t{source}\\t{seq_type}\\t{start}\\t{stop}\\t{score}\\t{strand}\\t{phase}\\t{attributes};\\n\"\n",
+    "    \n",
+    "    return exon_string + CDS_string"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 18,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def write_gene(current_gene_id, gene_lines, edit, start, end):\n",
+    "    chunks = gene_lines[0].split(\"\\t\")\n",
+    "    # write the gene line first\n",
+    "    gene = gene_string(current_gene_id, start, end, gene_lines[0])\n",
+    "    edit.write(gene)\n",
+    "    # now write each of the transcripts\n",
+    "    exon_counter = 1\n",
+    "    for l in gene_lines:\n",
+    "        chunks = l.split(\"\\t\")\n",
+    "        if chunks[2] == \"transcript\":\n",
+    "            transcript_id = chunks[8].split(\";\")[1].split(\" \")[2][1:-1]\n",
+    "            transcript = transcript_string(current_gene_id, transcript_id, l)\n",
+    "            edit.write(transcript)\n",
+    "        else:\n",
+    "            exon = exon_string(transcript_id, l, exon_counter)\n",
+    "            edit.write(exon)\n",
+    "            exon_counter += 1\n",
+    "\n",
+    "    edit.write(\"###\\n\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "gene_lines = []\n",
+    "current_gene_id = None\n",
+    "start = np.inf\n",
+    "end = 0\n",
+    "\n",
+    "with open(\"/Volumes/scratch/pycnogonum/transcriptome/isoseq/collapsed.gff\", \"r\") as isoseq:\n",
+    "    with open(\"/Volumes/scratch/pycnogonum/genome/draft/annot_merge/isoseq.gff\", \"w\") as edit:\n",
+    "        for line in isoseq:\n",
+    "            if line.startswith(\"#\"):\n",
+    "                edit.write(line)\n",
+    "                continue\n",
+    "            chunks = line.split(\"\\t\")\n",
+    "            entry_type = chunks[2]\n",
+    "            # print(line)\n",
+    "            if entry_type == \"transcript\":\n",
+    "                gene_id = chunks[8].split(\";\")[0].split(\" \")[1][1:-1]\n",
+    "                if gene_id != current_gene_id:\n",
+    "                    if current_gene_id is None: # first gene\n",
+    "                        current_gene_id = gene_id\n",
+    "                        gene_lines.append(line)\n",
+    "                        continue\n",
+    "                    # new gene! Write it out:\n",
+    "                    write_gene(current_gene_id, gene_lines, edit, start, end)\n",
+    "                    # reset the gene\n",
+    "                    current_gene_id = gene_id\n",
+    "                    gene_lines = []\n",
+    "                    start = int(chunks[3])\n",
+    "                    end = int(chunks[4])\n",
+    "                else:\n",
+    "                    start = min(start, int(chunks[3]))\n",
+    "                    end = max(end, int(chunks[4]))\n",
+    "            gene_lines.append(line)\n",
+    "        # write last gene\n",
+    "        write_gene(current_gene_id, gene_lines, edit, start, end)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 19,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "gene_lines = []\n",
+    "current_gene_id = None\n",
+    "\n",
+    "with open(\"/Volumes/scratch/pycnogonum/genome/draft/braker/braker/braker.gff3\", \"r\") as braker:\n",
+    "    with open(\"/Volumes/scratch/pycnogonum/genome/draft/annot_merge/braker.gff\", \"w\") as edit:\n",
+    "        for line in braker:\n",
+    "            chunks = line.split(\"\\t\")\n",
+    "            entry_type = chunks[2]\n",
+    "\n",
+    "            if entry_type in [\"start_codon\", \"stop_codon\", \"intron\"]:\n",
+    "                continue\n",
+    "\n",
+    "            if entry_type == \"gene\":\n",
+    "                current_gene_id = chunks[8].split(\";\")[0].split(\"=\")[1]\n",
+    "                line = line.rstrip() + \"gene_id=\" + current_gene_id + \";\\n\"\n",
+    "            \n",
+    "            if current_gene_id in braker_unique:\n",
+    "                edit.write(line)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 13,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import matplotlib.pyplot as plt"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 14,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def plot_overlap(df):\n",
+    "    # plot the overlapping regions as straight lines, color by transcript ID\n",
+    "    start = np.min((df[\"bra_start\"].min(), df[\"iso_start\"].min()))\n",
+    "    end = np.max((df[\"bra_end\"].max(), df[\"iso_end\"].max()))\n",
+    "\n",
+    "    fig, ax = plt.subplots()\n",
+    "\n",
+    "    # for i, transcript in df.iterrows():\n",
+    "    #     ax.hlines(i*2, transcript[\"iso_start\"], transcript[\"iso_end\"], label=transcript[\"iso_gene\"])\n",
+    "    #     ax.hlines(i*2+1, transcript[\"bra_start\"], transcript[\"bra_end\"], label=transcript[\"bra_gene\"])\n",
+    "    ax.set_xlim(start-1000, end+1000)\n",
+    "    ax.legend()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 15,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# a function that plots all the unique transcripts in a dataframe, coloring them by gene ID\n",
+    "def plot_overlap(df):\n",
+    "    # first find out the minimum start and maximum end positions of all the real BRAKER genes\n",
+    "    # (i.e. not the -1 values)\n",
+    "    bra_start_min = np.min(df[\"bra_start\"][df[\"bra_start\"] > 0])\n",
+    "    bra_end_max = np.max(df[\"bra_end\"][df[\"bra_end\"] > 0])\n",
+    "    # now do the same for the iso-seq genes\n",
+    "    iso_start_min = np.min(df[\"iso_start\"][df[\"iso_start\"] > 0])\n",
+    "    iso_end_max = np.max(df[\"iso_end\"][df[\"iso_end\"] > 0])\n",
+    "    # the plot should start 1000 bp before the minimum start and end 1000 bp after the maximum end\n",
+    "    start = np.min((bra_start_min, iso_start_min)) - 1000\n",
+    "    end = np.max((bra_end_max, iso_end_max)) + 1000\n",
+    "\n",
+    "    fig, ax = plt.subplots()\n",
+    "\n",
+    "    # get a list of all the unique gene IDs\n",
+    "    gene_list = np.unique(df[\"iso_gene\"].tolist() + df[\"bra_gene\"].tolist())\n",
+    "    # generate a color map for the gene IDs - this should be always less than 10, so we can take\n",
+    "    # the Set1 colormap\n",
+    "    colors = plt.cm.Set1(np.linspace(0, 1, len(gene_list)))\n",
+    "    # create a dictionary that maps gene IDs to colors\n",
+    "    cmap = dict(zip(gene_list, colors))\n",
+    "\n",
+    "    # each transcript should be a straight line, colored by the gene ID. Let's do BRAKER first.\n",
+    "    braker = df.groupby(\"bra_attributes\").first()\n",
+    "    for i, row in braker.iterrows():\n",
+    "        ax.hlines(i, row[\"bra_start\"], row[\"bra_end\"], color=cmap[row[\"bra_gene\"]])\n",
+    "\n",
+    "    # now do the same for the iso-seq genes\n",
+    "    iso = df.groupby(\"iso_attributes\").first()\n",
+    "    for i, row in iso.iterrows():\n",
+    "        try:\n",
+    "            ax.hlines(i, row[\"iso_start\"], row[\"iso_end\"], color=cmap[row[\"iso_gene\"]])\n",
+    "        except KeyError:\n",
+    "            # there might be some empty IDs here if there is no overlap for this transcript\n",
+    "            # ax.hlines(i, start, end, linestyles=\"dashed\", color=\"black\")\n",
+    "            continue\n",
+    "    \n",
+    "    # set the limits of the plot\n",
+    "    ax.set_xlim(start, end)\n",
+    "    # keep track of what chromosome we're in in case we want to go to the genome browser\n",
+    "    ax.set_title(f\"{df['bra_seqname'].iloc[0]}\")\n",
+    "    # save the plot for manual inspection\n",
+    "    plt.savefig(f\"figs/{df['bra_gene'].iloc[0]}.png\")\n",
+    "    # close plot to save memory\n",
+    "    plt.close()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 16,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "100%|██████████| 264/264 [00:14<00:00, 18.59it/s]\n"
+     ]
+    }
+   ],
+   "source": [
+    "for gene in tqdm(unknown):\n",
+    "    keep = bra2iso[\"bra_gene\"] == gene\n",
+    "    plot_overlap(bra2iso[keep])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Manual inspection of the ISO-seq/BRAKER overlap for these 264 genes identifies another 49 that need\n",
+    "further examination. Those were aligned against uniref90 to check whether the BRAKER portions\n",
+    "correspond to real genes or are spurious."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "ascc24",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.12.4"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
+%% Cell type:markdown id: tags:
+
+# Confirming gene models with isoform-level RNA-seq data
+
+## Motivation
+
+Our gene prediction with BRAKER3 produced a low number of gene models (~11.5k). We suspect this
+might be because GeneMark-ET/StringTie are too aggressive with exons/genes that are close to each
+other, merging them and messing up the gene models. We will use the isoform-level RNA-seq data to
+confirm some of the gene models.
+
+%% Cell type:code id: tags:
+
+``` python
+import numpy as np
+import pandas as pd
+
+from tqdm import tqdm
+```
+
+%% Cell type:markdown id: tags:
+
+I extracted all transcript lines from the corresponding gff/gtf files and used bedtools to intersect
+them:
+
+```bash
+$ ISOSEQ=/lisc/scratch/zoology/pycnogonum/transcriptome/isoseq/isoseq_confirmed.gff
+$ BRAKER=/lisc/scratch/zoology/pycnogonum/genome/draft/braker/braker/braker_proposed.gtf
+$ bedtools intersect -wao -b $ISOSEQ -a $BRAKER > brak_iso.bed
+```
+
+%% Cell type:code id: tags:
+
+``` python
+gtf = ["seqname", "source", "feature", "start", "end", "score", "strand", "frame", "attributes"]
+
+# iso2bra = pd.read_csv("/Volumes/scratch/pycnogonum/genome/draft/annot_merge/iso_brak.bed", sep="\t", header=None)
+# iso2bra.columns = ["iso_" + x for x in gtf] + ["bra_" + x for x in gtf] + ["overlap"]
+# iso2bra["iso_gene"] = iso2bra["iso_attributes"].str.extract(r'gene_id "(.*?)";')
+# iso2bra["bra_gene"] = iso2bra["bra_attributes"].str.split("\.").str[0]
+# iso2bra["bra_gene"] = iso2bra["bra_gene"].replace("", np.nan)
+
+bra2iso = pd.read_csv("/Volumes/scratch/pycnogonum/genome/draft/annot_merge/brak_iso.bed", sep="\t", header=None)
+bra2iso.columns = ["bra_" + x for x in gtf] + ["iso_" + x for x in gtf] + ["overlap"]
+bra2iso["bra_gene"] = bra2iso["bra_attributes"].str.split(".").str[0]
+bra2iso["iso_gene"] = bra2iso["iso_attributes"].str.extract(r'gene_id "(.*?)";')
+```
+
+%% Cell type:markdown id: tags:
+
+Now to find overlaps:
+
+%% Cell type:code id: tags:
+
+``` python
+def contained(df, a_start="bra_start", a_end="bra_end", b_start="iso_start", b_end="iso_end"):
+    """Checks if sequence A is completely contained within sequence B or vice versa.
+
+    Parameters
+    ----------
+    df : pd.DataFrame
+        A dataframe of a bedtools intersect file.
+    a_start : str, optional
+        the start position of sequence A, by default "bra_start"
+    a_end : str, optional
+        the end position of sequence B, by default "bra_end"
+    b_start : str, optional
+        The start position of sequence B, by default "iso_start"
+    b_end : str, optional
+        the end position of sequence B, by default "iso_end"
+
+    Returns
+    -------
+    bool
+        returns True if the sequence is contained, False otherwise
+    """
+    if df["bra_strand"] != df["iso_strand"]:
+        return False # make sure that we are comparing genes on the same strand
+    if df[b_start] < 0 or df[b_end] < 0 or df[a_start] < 0 or df[a_end] < 0:
+        return False
+    # first check if the braker gene is completely contained within the isoseq one
+    if df[a_start] >= df[b_start] and df[a_end] <= df[b_end]:
+        return True
+    # then check if the braker gene completely contains the isoseq one
+    if df[a_start] <= df[b_start] and df[a_end] >= df[b_end]:
+        return True
+    return False
+```
+
+%% Cell type:code id: tags:
+
+``` python
+genes = bra2iso.groupby("bra_gene")
+```
+
+%% Cell type:code id: tags:
+
+``` python
+def calc_overlap(df):
+    bra_len = df.iloc[0]["bra_end"] - df.iloc[0]["bra_start"]
+    # print(bra_len)
+    seq = np.zeros(bra_len, dtype=bool)
+    for i, transcript in df.iterrows():
+        start = transcript["iso_start"] - transcript["bra_start"]
+        start = np.max((0, start))
+        end = transcript["iso_end"] - transcript["bra_start"]
+        # print(end, len(seq))
+        end = np.min([bra_len, end])
+        seq[start:end] = True
+    return np.sum(seq) / bra_len
+```
+
+%% Cell type:code id: tags:
+
+``` python
+braker_unique = []
+significant = []
+unknown = []
+truncate = []
+
+
+for gene in tqdm(genes.groups):
+    try:
+        transcripts = genes.get_group(gene)
+    except KeyError:
+        unknown.append(gene)
+        continue
+    # first see how many candidate ISO-seq genes this BRAKER gene overlaps with
+    # if there is zero, then it means no possible corresponding ISO-seq gene was found
+    no_iso = transcripts.groupby("iso_gene").size().shape[0] # how many ISO-seq transcripts match this BRAKER gene?
+    if no_iso == 0:
+        braker_unique.append(gene)
+        continue
+
+    # if there is one it means we have an 1-to-1 correspondence
+    if no_iso == 1:
+        # try to verify that most BRAKER isoforms are contained within the ISO-seq isoforms
+        # (or vice versa)
+        no_contained = transcripts.apply(contained, axis=1).sum()
+        if no_contained / transcripts.shape[0] >= 0.4:
+            significant.append(gene)
+            continue
+
+        # test if the ISO-seq isoforms cumulatively cover at least 50% of the BRAKER gene
+        total_overlap = transcripts.groupby("bra_attributes").apply(calc_overlap, include_groups=False)
+        if np.mean(total_overlap) >= 0.5:
+            significant.append(gene)
+            continue
+
+    # if we have arrived here, all else failed and this gene needs to go for manual curation
+    unknown.append(gene)
+```
+
+%% Output
+
+    100%|██████████| 11451/11451 [00:03<00:00, 2913.12it/s]
+
+%% Cell type:code id: tags:
+
+``` python
+len(significant), len(braker_unique), len(unknown), np.sum([len(significant), len(braker_unique), len(unknown)])
+```
+
+%% Output
+
+    (7591, 3596, 264, np.int64(11451))
+
+%% Cell type:markdown id: tags:
+
+At this point the strategy should be to copy the ISO-seq and "braker_unique" gene models from their
+respective GTF files to a new, merged one. This should already take us from the 11,451 BRAKER gene
+models to 8,904 + 3,442 = 12,346 gene models. We can then screen the unknown modules further.
+
+%% Cell type:code id: tags:
+
+``` python
+isoseq_loc = "/Volumes/scratch/pycnogonum/transcriptome/isoseq/collapsed.gff"
+braker_loc = "/Volumes/scratch/pycnogonum/genome/draft/braker/braker/braker.gtf"
+```
+
+%% Cell type:code id: tags:
+
+``` python
+def gene_string(gene_id, start, stop, last_line):
+    chunks = last_line.split("\t")
+
+    seqid = chunks[0]
+    source = chunks[1]
+    seq_type = "gene"
+    score = "."
+    strand = chunks[6]
+    phase = "."
+    attributes = f"ID={gene_id};gene_id={gene_id};"
+    string = f"{seqid}\t{source}\t{seq_type}\t{start}\t{stop}\t{score}\t{strand}\t{phase}\t{attributes};\n"
+    return string
+
+def transcript_string(gene_id, transcript_id, line):
+    chunks = line.split("\t")
+
+    seqid = chunks[0]
+    source = chunks[1]
+    seq_type = "mRNA"
+    start = chunks[3]
+    stop = chunks[4]
+    score = chunks[5]
+    strand = chunks[6]
+    phase = "."
+    attributes = f"ID={transcript_id};Parent={gene_id}"
+    string = f"{seqid}\t{source}\t{seq_type}\t{start}\t{stop}\t{score}\t{strand}\t{phase}\t{attributes};\n"
+    return string
+
+def exon_string(transcript_id, line, exon_counter):
+    chunks = line.split("\t")
+
+    seqid = chunks[0]
+    source = chunks[1]
+    start = chunks[3]
+    stop = chunks[4]
+    score = chunks[5]
+    strand = chunks[6]
+
+    seq_type = "exon"
+    phase = "."
+    attributes = f"ID={transcript_id}.exon{exon_counter};Parent={transcript_id}"
+    exon_string = f"{seqid}\t{source}\t{seq_type}\t{start}\t{stop}\t{score}\t{strand}\t{phase}\t{attributes};\n"
+
+    seq_type = "CDS"
+    phase="0"
+    attributes = f"ID={transcript_id}.CDS{exon_counter};Parent={transcript_id}"
+    CDS_string = f"{seqid}\t{source}\t{seq_type}\t{start}\t{stop}\t{score}\t{strand}\t{phase}\t{attributes};\n"
+
+    return exon_string + CDS_string
+```
+
+%% Cell type:code id: tags:
+
+``` python
+def write_gene(current_gene_id, gene_lines, edit, start, end):
+    chunks = gene_lines[0].split("\t")
+    # write the gene line first
+    gene = gene_string(current_gene_id, start, end, gene_lines[0])
+    edit.write(gene)
+    # now write each of the transcripts
+    exon_counter = 1
+    for l in gene_lines:
+        chunks = l.split("\t")
+        if chunks[2] == "transcript":
+            transcript_id = chunks[8].split(";")[1].split(" ")[2][1:-1]
+            transcript = transcript_string(current_gene_id, transcript_id, l)
+            edit.write(transcript)
+        else:
+            exon = exon_string(transcript_id, l, exon_counter)
+            edit.write(exon)
+            exon_counter += 1
+
+    edit.write("###\n")
+```
+
+%% Cell type:code id: tags:
+
+``` python
+gene_lines = []
+current_gene_id = None
+start = np.inf
+end = 0
+
+with open("/Volumes/scratch/pycnogonum/transcriptome/isoseq/collapsed.gff", "r") as isoseq:
+    with open("/Volumes/scratch/pycnogonum/genome/draft/annot_merge/isoseq.gff", "w") as edit:
+        for line in isoseq:
+            if line.startswith("#"):
+                edit.write(line)
+                continue
+            chunks = line.split("\t")
+            entry_type = chunks[2]
+            # print(line)
+            if entry_type == "transcript":
+                gene_id = chunks[8].split(";")[0].split(" ")[1][1:-1]
+                if gene_id != current_gene_id:
+                    if current_gene_id is None: # first gene
+                        current_gene_id = gene_id
+                        gene_lines.append(line)
+                        continue
+                    # new gene! Write it out:
+                    write_gene(current_gene_id, gene_lines, edit, start, end)
+                    # reset the gene
+                    current_gene_id = gene_id
+                    gene_lines = []
+                    start = int(chunks[3])
+                    end = int(chunks[4])
+                else:
+                    start = min(start, int(chunks[3]))
+                    end = max(end, int(chunks[4]))
+            gene_lines.append(line)
+        # write last gene
+        write_gene(current_gene_id, gene_lines, edit, start, end)
+```
+
+%% Cell type:code id: tags:
+
+``` python
+gene_lines = []
+current_gene_id = None
+
+with open("/Volumes/scratch/pycnogonum/genome/draft/braker/braker/braker.gff3", "r") as braker:
+    with open("/Volumes/scratch/pycnogonum/genome/draft/annot_merge/braker.gff", "w") as edit:
+        for line in braker:
+            chunks = line.split("\t")
+            entry_type = chunks[2]
+
+            if entry_type in ["start_codon", "stop_codon", "intron"]:
+                continue
+
+            if entry_type == "gene":
+                current_gene_id = chunks[8].split(";")[0].split("=")[1]
+                line = line.rstrip() + "gene_id=" + current_gene_id + ";\n"
+
+            if current_gene_id in braker_unique:
+                edit.write(line)
+```
+
+%% Cell type:code id: tags:
+
+``` python
+import matplotlib.pyplot as plt
+```
+
+%% Cell type:code id: tags:
+
+``` python
+def plot_overlap(df):
+    # plot the overlapping regions as straight lines, color by transcript ID
+    start = np.min((df["bra_start"].min(), df["iso_start"].min()))
+    end = np.max((df["bra_end"].max(), df["iso_end"].max()))
+
+    fig, ax = plt.subplots()
+
+    # for i, transcript in df.iterrows():
+    #     ax.hlines(i*2, transcript["iso_start"], transcript["iso_end"], label=transcript["iso_gene"])
+    #     ax.hlines(i*2+1, transcript["bra_start"], transcript["bra_end"], label=transcript["bra_gene"])
+    ax.set_xlim(start-1000, end+1000)
+    ax.legend()
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# a function that plots all the unique transcripts in a dataframe, coloring them by gene ID
+def plot_overlap(df):
+    # first find out the minimum start and maximum end positions of all the real BRAKER genes
+    # (i.e. not the -1 values)
+    bra_start_min = np.min(df["bra_start"][df["bra_start"] > 0])
+    bra_end_max = np.max(df["bra_end"][df["bra_end"] > 0])
+    # now do the same for the iso-seq genes
+    iso_start_min = np.min(df["iso_start"][df["iso_start"] > 0])
+    iso_end_max = np.max(df["iso_end"][df["iso_end"] > 0])
+    # the plot should start 1000 bp before the minimum start and end 1000 bp after the maximum end
+    start = np.min((bra_start_min, iso_start_min)) - 1000
+    end = np.max((bra_end_max, iso_end_max)) + 1000
+
+    fig, ax = plt.subplots()
+
+    # get a list of all the unique gene IDs
+    gene_list = np.unique(df["iso_gene"].tolist() + df["bra_gene"].tolist())
+    # generate a color map for the gene IDs - this should be always less than 10, so we can take
+    # the Set1 colormap
+    colors = plt.cm.Set1(np.linspace(0, 1, len(gene_list)))
+    # create a dictionary that maps gene IDs to colors
+    cmap = dict(zip(gene_list, colors))
+
+    # each transcript should be a straight line, colored by the gene ID. Let's do BRAKER first.
+    braker = df.groupby("bra_attributes").first()
+    for i, row in braker.iterrows():
+        ax.hlines(i, row["bra_start"], row["bra_end"], color=cmap[row["bra_gene"]])
+
+    # now do the same for the iso-seq genes
+    iso = df.groupby("iso_attributes").first()
+    for i, row in iso.iterrows():
+        try:
+            ax.hlines(i, row["iso_start"], row["iso_end"], color=cmap[row["iso_gene"]])
+        except KeyError:
+            # there might be some empty IDs here if there is no overlap for this transcript
+            # ax.hlines(i, start, end, linestyles="dashed", color="black")
+            continue
+
+    # set the limits of the plot
+    ax.set_xlim(start, end)
+    # keep track of what chromosome we're in in case we want to go to the genome browser
+    ax.set_title(f"{df['bra_seqname'].iloc[0]}")
+    # save the plot for manual inspection
+    plt.savefig(f"figs/{df['bra_gene'].iloc[0]}.png")
+    # close plot to save memory
+    plt.close()
+```
+
+%% Cell type:code id: tags:
+
+``` python
+for gene in tqdm(unknown):
+    keep = bra2iso["bra_gene"] == gene
+    plot_overlap(bra2iso[keep])
+```
+
+%% Output
+
+    100%|██████████| 264/264 [00:14<00:00, 18.59it/s]
+
+%% Cell type:markdown id: tags:
+
+Manual inspection of the ISO-seq/BRAKER overlap for these 264 genes identifies another 49 that need
+further examination. Those were aligned against uniref90 to check whether the BRAKER portions
+correspond to real genes or are spurious.
--- a/06-annotation/annot2-RNA-extract_uncovered.sh
+++ b/06-annotation/annot2-RNA-extract_uncovered.sh
+#!/usr/bin/env bash
+#
+#SBATCH --job-name=extr_unannot
+#SBATCH --ntasks=1
+#SBATCH --cpus-per-task=1
+#SBATCH --mem=2G
+#SBATCH --time=2:00:00
+#SBATCH --mail-type=ALL
+#SBATCH --mail-user=nikolaos.papadopoulos@univie.ac.at
+#SBATCH --output=/lisc/user/papadopoulos/log/pycno-subset_bam-%j.out
+#SBATCH --error=/lisc/user/papadopoulos/log/pycno-subset_bam-%j.err
+
+module load bedtools
+
+GFF=/lisc/scratch/zoology/pycnogonum/genome/draft/annot_merge/merged.gff3
+OUTPUT=$1
+cd "$OUTPUT" || exit
+
+bedtools intersect -v -a Aligned.out.bam -b $GFF > unannotated.bam
\ No newline at end of file
--- a/06-annotation/annot2-RNA-extracts.sh
+++ b/06-annotation/annot2-RNA-extracts.sh
+#!/usr/bin/env bash
+
+EXTRACT=/lisc/user/papadopoulos/repos/pycno-seq/annotate/annot2-RNA-extract_uncovered.sh
+
+BASE=/lisc/scratch/zoology/pycnogonum/genome/draft/transcriptome/
+
+sbatch $EXTRACT "$BASE/ZYGOTE/"
+sbatch $EXTRACT "$BASE/EARLY_CLEAVAGE/"
+sbatch $EXTRACT "$BASE/EMBRYO0_1/"
+sbatch $EXTRACT "$BASE/EMBRYO3_5/"
+sbatch $EXTRACT "$BASE/EMBRYO9_10/"
+sbatch $EXTRACT "$BASE/EMBRYO3/"
+sbatch $EXTRACT "$BASE/INSTAR1/"
+sbatch $EXTRACT "$BASE/INSTAR2/"
+sbatch $EXTRACT "$BASE/INSTAR3/"
+sbatch $EXTRACT "$BASE/INSTAR4/"
+sbatch $EXTRACT "$BASE/INSTAR5/"
+sbatch $EXTRACT "$BASE/INSTAR6/"
+sbatch $EXTRACT "$BASE/JUV1/"
+sbatch $EXTRACT "$BASE/SUBADULT/"
\ No newline at end of file
--- a/06-annotation/annot2-braker.sh
+++ b/06-annotation/annot2-braker.sh
+#!/usr/bin/env bash
+#
+#SBATCH --job-name=pycno_braker
+#SBATCH --ntasks=1
+#SBATCH --cpus-per-task=10
+#SBATCH --mem=40G
+#SBATCH --time=20:00:00
+#SBATCH --mail-type=ALL
+#SBATCH --mail-user=nikolaos.papadopoulos@univie.ac.at
+#SBATCH --output=/lisc/user/papadopoulos/log/pycno-braker-%j.out
+#SBATCH --error=/lisc/user/papadopoulos/log/pycno-braker-%j.err
+
+module load conda
+conda activate singularity
+
+# container location
+BRAKER=/lisc/user/papadopoulos/containers/braker3.sif
+# path to the output directory, where all the magic should happen
+OUTPUT=/lisc/scratch/zoology/pycnogonum/genome/draft/braker2
+mkdir "$OUTPUT" -p || exit
+cd "$OUTPUT" || exit
+
+# copy AUGUSTUS config files for BRAKER
+cp -r ~/repos/Augustus/config .
+
+# define input files
+# the genome assembly
+GENOME=/lisc/scratch/zoology/pycnogonum/genome/draft/draft_softmasked.fasta
+# the RNA-seq dataset BAMs
+BASE=/lisc/scratch/zoology/pycnogonum/genome/draft/transcriptome
+ZYGOTE="$BASE/ZYGOTE/unannotated.bam"
+EARLY_CLEAVAGE="$BASE/EARLY_CLEAVAGE/unannotated.bam"
+EMBRYO0_1="$BASE/EMBRYO0_1/unannotated.bam"
+EMBRYO3_5="$BASE/EMBRYO3_5/unannotated.bam"
+EMBRYO9_10="$BASE/EMBRYO9_10/unannotated.bam"
+EMBRYO3="$BASE/EMBRYO3/unannotated.bam"
+INSTAR1="$BASE/INSTAR1/unannotated.bam"
+INSTAR2="$BASE/INSTAR2/unannotated.bam"
+INSTAR3="$BASE/INSTAR3/unannotated.bam"
+INSTAR4="$BASE/INSTAR4/unannotated.bam"
+INSTAR5="$BASE/INSTAR5/unannotated.bam"
+INSTAR6="$BASE/INSTAR6/unannotated.bam"
+JUV1="$BASE/JUV1/unannotated.bam"
+SUBADULT="$BASE/SUBADULT/unannotated.bam"
+# the Arthropoda OrthoDB protein sequences
+ARTHROPODA=/lisc/scratch/zoology/db/Arthropoda.fa
+
+# now call the container. Important to mount the current directory, /scratch/, and the $TMPDIR,
+# in case it is needed.
+singularity exec -B "$PWD,/lisc/scratch/zoology/,$TMPDIR" "$BRAKER" braker.pl \
+--AUGUSTUS_CONFIG_PATH="${PWD}"/config \
+--genome=$GENOME \
+--prot_seq=$ARTHROPODA \
+--bam="$ZYGOTE","$EARLY_CLEAVAGE","$EMBRYO0_1","$EMBRYO3_5","$EMBRYO9_10","$EMBRYO3","$INSTAR1","$INSTAR2","$INSTAR3","$INSTAR4","$INSTAR5","$INSTAR6","$JUV1","$SUBADULT" \
+--softmasking \
+--threads 10
\ No newline at end of file
--- a/06-annotation/annot3-RNA-01-extract_complete_ORFs.sh
+++ b/06-annotation/annot3-RNA-01-extract_complete_ORFs.sh
+#!/usr/bin/env bash
+#
+#SBATCH --job-name=extr_unannot
+#SBATCH --ntasks=1
+#SBATCH --cpus-per-task=1
+#SBATCH --mem=500M
+#SBATCH --time=17:00
+#SBATCH --mail-type=ALL
+#SBATCH --mail-user=nikolaos.papadopoulos@univie.ac.at
+#SBATCH --output=/lisc/user/papadopoulos/log/pycno-extract_orfs-%j.out
+#SBATCH --error=/lisc/user/papadopoulos/log/pycno-extract_orfs-%j.err
+
+module load perl/5.40.0
+module load transdecoder/5.7.1
+
+DENOVO=$1
+OUTPUT=$2
+
+# 228.06user
+# 747.90user
+# say 1000s = 16-17min
+
+mkdir -p "$OUTPUT" || exit 1
+cd "$OUTPUT" || exit 1
+
+# first we need to extract ORFs
+TransDecoder.LongOrfs -t "$DENOVO"
+# now predict the coding regions
+TransDecoder.Predict -t "$DENOVO"
+# now convert the resulting .cds file to a single-line fasta
+# almost instantaneous
+perl -pe '$. > 1 and /^>/ ? print "\n" : chomp' Trinity.fasta.transdecoder.cds > oneline.fasta
+# extract complete ORFs. Almost instantaneous.
+grep -A1 complete oneline.fasta > complete.fasta
\ No newline at end of file
--- a/06-annotation/annot3-RNA-01-filter_assemblies.sh
+++ b/06-annotation/annot3-RNA-01-filter_assemblies.sh
+#!/usr/bin/env bash
+
+FILTER=/lisc/user/papadopoulos/repos/pycno-seq/annotate/annot3-RNA-extract_complete_ORFs.sh
+
+IN=/lisc/project/zoology/pycnogonum/transcriptome/development
+
+EMBRYO3_5_IN=$IN/embryonic_stage3-4/Trinity.fasta
+INSTAR1_IN=$IN/instarI-protonymphon/Trinity.fasta
+INSTAR2_IN=$IN/instar_II/Trinity.fasta
+INSTAR3_IN=$IN/instar_III/Trinity.fasta
+INSTAR4_IN=$IN/instar_IV/Trinity.fasta
+INSTAR5_IN=$IN/instar_V/Trinity.fasta
+INSTAR6_IN=$IN/instar_VI/Trinity.fasta
+JUV1_IN=$IN/juvenile_I/Trinity.fasta
+SUBADULT_IN=$IN/subadult/Trinity.fasta
+
+OUT=/lisc/scratch/zoology/pycnogonum/genome/draft/transcriptome/
+
+EMBRYO3_5_OUT=$OUT/EMBRYO3/
+INSTAR1_OUT=$OUT/INSTAR1/
+INSTAR2_OUT=$OUT/INSTAR2/
+INSTAR3_OUT=$OUT/INSTAR3/
+INSTAR4_OUT=$OUT/INSTAR4/
+INSTAR5_OUT=$OUT/INSTAR5/
+INSTAR6_OUT=$OUT/INSTAR6/
+JUV1_OUT=$OUT/JUV1/
+SUBADULT_OUT=$OUT/SUBADULT/
+
+sbatch $FILTER $EMBRYO3_5_IN $EMBRYO3_5_OUT
+sbatch $FILTER $INSTAR1_IN $INSTAR1_OUT
+sbatch $FILTER $INSTAR2_IN $INSTAR2_OUT
+sbatch $FILTER $INSTAR3_IN $INSTAR3_OUT
+sbatch $FILTER $INSTAR4_IN $INSTAR4_OUT
+sbatch $FILTER $INSTAR5_IN $INSTAR5_OUT
+sbatch $FILTER $INSTAR6_IN $INSTAR6_OUT
+sbatch $FILTER $JUV1_IN $JUV1_OUT
+sbatch $FILTER $SUBADULT_IN $SUBADULT_OUT
\ No newline at end of file
--- a/06-annotation/annot3-RNA-02-map_assemblies.sh
+++ b/06-annotation/annot3-RNA-02-map_assemblies.sh
+#!/usr/bin/env bash
+
+MAP=/lisc/user/papadopoulos/repos/pycno-seq/annotate/annot3-RNA-02-map_assembly_single.sh
+
+OUT=/lisc/scratch/zoology/pycnogonum/genome/draft/transcriptome/
+
+EMBRYO3_5=$OUT/EMBRYO3
+INSTAR1=$OUT/INSTAR1
+INSTAR2=$OUT/INSTAR2
+INSTAR3=$OUT/INSTAR3
+INSTAR4=$OUT/INSTAR4
+INSTAR5=$OUT/INSTAR5
+INSTAR6=$OUT/INSTAR6
+JUV1=$OUT/JUV1
+SUBADULT=$OUT/SUBADULT
+
+sbatch $MAP $EMBRYO3_5/complete.fasta $EMBRYO3_5
+sbatch $MAP $INSTAR1/complete.fasta $INSTAR1
+sbatch $MAP $INSTAR2/complete.fasta $INSTAR2
+sbatch $MAP $INSTAR3/complete.fasta $INSTAR3
+sbatch $MAP $INSTAR4/complete.fasta $INSTAR4
+sbatch $MAP $INSTAR5/complete.fasta $INSTAR5
+sbatch $MAP $INSTAR6/complete.fasta $INSTAR6
+sbatch $MAP $JUV1/complete.fasta $JUV1
+sbatch $MAP $SUBADULT/complete.fasta $SUBADULT
\ No newline at end of file
--- a/06-annotation/annot3-RNA-02-map_assembly_single.sh
+++ b/06-annotation/annot3-RNA-02-map_assembly_single.sh
+#!/usr/bin/env bash
+#
+#SBATCH --job-name=mmap_pycno
+#SBATCH --ntasks=1
+#SBATCH --cpus-per-task=3
+#SBATCH --mem=10G
+#SBATCH --time=5:00
+#SBATCH --mail-type=ALL
+#SBATCH --mail-user=nikolaos.papadopoulos@univie.ac.at
+#SBATCH --output=/lisc/user/papadopoulos/log/pycno-map-txome-%j.out
+#SBATCH --error=/lisc/user/papadopoulos/log/pycno-map-txome-%j.err
+
+# with 1% of data and 3 CPUs:
+# [M::main] Real time: 448.742 sec; CPU: 1260.678 sec; Peak RSS: 1.830 GB
+# 1251.43user 9.26system 7:28.76elapsed 280%CPU (0avgtext+0avgdata 1919296maxresident)k
+
+# with 10% of data and 3 CPUs:
+# [M::main] Real time: 1528.465 sec; CPU: 3772.846 sec; Peak RSS: 2.758 GB
+# 3743.86user 29.02system 25:28.51elapsed 246%CPU (0avgtext+0avgdata 2892008maxresident)k
+# BUT THAT WAS WITH THE WRONG ARGUMENT ORDER, it is much faster the other way round
+
+module load conda
+conda activate minimap2-2.28
+
+TXOME=$1
+OUTPUT=$2
+DRAFT=/lisc/scratch/zoology/pycnogonum/genome/draft/draft.fasta
+
+cd "$TMPDIR" || exit
+minimap2 -ax splice:hq $DRAFT "$TXOME" > "$OUTPUT"/txome_assembly.sam
\ No newline at end of file
--- a/06-annotation/annot3-RNA-03-pack_sam.sh
+++ b/06-annotation/annot3-RNA-03-pack_sam.sh
+#!/usr/bin/env bash
+#
+#SBATCH --job-name=pycno_filter
+#SBATCH --ntasks=1
+#SBATCH --cpus-per-task=2
+#SBATCH --mem=2G
+#SBATCH --time=5:00
+#SBATCH --mail-type=ALL
+#SBATCH --mail-user=nikolaos.papadopoulos@univie.ac.at
+#SBATCH --output=/lisc/user/papadopoulos/log/pycno-filter-%j.out
+#SBATCH --error=/lisc/user/papadopoulos/log/pycno-filter-%j.err
+
+module load samtools
+
+TXOME=$1
+
+cd "$TXOME" || exit
+samtools view -@2 -F 0x900 -q 30 -S -b txome_assembly.sam > txome_assembly.hd.bam
+
+
+module load conda
+conda activate agat-1.4.0
+
+agat_convert_minimap2_bam2gff.pl -i txome_assembly.hd.bam -o txome_assembly.gff
\ No newline at end of file
--- a/06-annotation/annot3-RNA-03-pack_sams.sh
+++ b/06-annotation/annot3-RNA-03-pack_sams.sh
+#!/usr/bin/env bash
+
+FILTER=/lisc/user/papadopoulos/repos/pycno-seq/annotate/annot3-RNA-03-pack_sam.sh
+
+OUT=/lisc/scratch/zoology/pycnogonum/genome/draft/transcriptome/
+
+EMBRYO3_OUT=$OUT/EMBRYO3/
+INSTAR1_OUT=$OUT/INSTAR1/
+INSTAR2_OUT=$OUT/INSTAR2/
+INSTAR3_OUT=$OUT/INSTAR3/
+INSTAR4_OUT=$OUT/INSTAR4/
+INSTAR5_OUT=$OUT/INSTAR5/
+INSTAR6_OUT=$OUT/INSTAR6/
+JUV1_OUT=$OUT/JUV1/
+SUBADULT_OUT=$OUT/SUBADULT/
+
+sbatch $FILTER $EMBRYO3_OUT
+sbatch $FILTER $INSTAR1_OUT
+sbatch $FILTER $INSTAR2_OUT
+sbatch $FILTER $INSTAR3_OUT
+sbatch $FILTER $INSTAR4_OUT
+sbatch $FILTER $INSTAR5_OUT
+sbatch $FILTER $INSTAR6_OUT
+sbatch $FILTER $JUV1_OUT
+sbatch $FILTER $SUBADULT_OUT
\ No newline at end of file
--- a/06-annotation/annot3-RNA-04-denovo_mining.ipynb
+++ b/06-annotation/annot3-RNA-04-denovo_mining.ipynb
--- a/06-annotation/install_braker.md
+++ b/06-annotation/install_braker.md
--- a/06-annotation/prep-RNA-bam.sh
+++ b/06-annotation/prep-RNA-bam.sh
+#!/usr/bin/env bash
+
+CONVERTER=/lisc/user/papadopoulos/repos/pycno-seq/annotate/prep-RNA-sam2bam.sh
+
+BASE="/lisc/scratch/zoology/pycnogonum/genome/draft/transcriptome"
+
+ZYGOTE="$BASE/ZYGOTE/Aligned.out.sam"
+EARLY_CLEAVAGE="$BASE/EARLY_CLEAVAGE/Aligned.out.sam"
+EMBRYO0_1="$BASE/EMBRYO0_1/Aligned.out.sam"
+EMBRYO3_5="$BASE/EMBRYO3_5/Aligned.out.sam"
+EMBRYO9_10="$BASE/EMBRYO9_10/Aligned.out.sam"
+EMBRYO3="$BASE/EMBRYO3/Aligned.out.sam"
+INSTAR1="$BASE/INSTAR1/Aligned.out.sam"
+INSTAR2="$BASE/INSTAR2/Aligned.out.sam"
+INSTAR3="$BASE/INSTAR3/Aligned.out.sam"
+INSTAR4="$BASE/INSTAR4/Aligned.out.sam"
+INSTAR5="$BASE/INSTAR5/Aligned.out.sam"
+INSTAR6="$BASE/INSTAR6/Aligned.out.sam"
+JUV1="$BASE/JUV1/Aligned.out.sam"
+SUBADULT="$BASE/SUBADULT/Aligned.out.sam"
+
+sbatch $CONVERTER $ZYGOTE
+sbatch $CONVERTER $EARLY_CLEAVAGE
+sbatch $CONVERTER $EMBRYO0_1
+sbatch $CONVERTER $EMBRYO3_5
+sbatch $CONVERTER $EMBRYO9_10
+sbatch $CONVERTER $EMBRYO3
+sbatch $CONVERTER $INSTAR1
+sbatch $CONVERTER $INSTAR2
+sbatch $CONVERTER $INSTAR3
+sbatch $CONVERTER $INSTAR4
+sbatch $CONVERTER $INSTAR5
+sbatch $CONVERTER $INSTAR6
+sbatch $CONVERTER $JUV1
+sbatch $CONVERTER $SUBADULT
\ No newline at end of file
--- a/06-annotation/prep-RNA-index_bam.sh
+++ b/06-annotation/prep-RNA-index_bam.sh
+#!/usr/bin/env bash
+#
+#SBATCH --job-name=converter
+#SBATCH --ntasks=1
+#SBATCH --cpus-per-task=6
+#SBATCH --mem=16G
+#SBATCH --time=15:00
+#SBATCH --mail-type=ALL
+#SBATCH --mail-user=nikolaos.papadopoulos@univie.ac.at
+#SBATCH --output=/lisc/user/papadopoulos/log/pycno-indexbam-%j.out
+#SBATCH --error=/lisc/user/papadopoulos/log/pycno-indexbam-%j.err
+
+module load samtools
+
+INPUT=$1
+BASENAME=$(dirname "$INPUT")
+OUTPUT=$BASENAME/Aligned.out.sorted.bam
+
+cd "$TMPDIR" || exit
+samtools sort -@ 8 "$INPUT" > "$OUTPUT"
+samtools index -@ 8 -b "$OUTPUT"
\ No newline at end of file
--- a/06-annotation/prep-RNA-map-all.sh
+++ b/06-annotation/prep-RNA-map-all.sh
+#!/usr/bin/env bash
+
+MAPPER=/lisc/user/papadopoulos/repos/pycno-seq/annotate/prep-RNA-map-single.sh
+ASSEMBLY=/lisc/scratch/zoology/pycnogonum/genome/draft/
+
+# early development
+BASE="/lisc/scratch/zoology/pycnogonum/raw/early_RNA"
+ZYGOTE="$BASE/Plit35_S8"
+EARLY_CLEAVAGE="$BASE/Plit31_S6"
+EMBRYO0_1="$BASE/Plit40_S11"
+EMBRYO3_5="$BASE/Plit32_S7"
+EMBRYO9_10="$BASE/Plit36_S9"
+# INSTAR2_5="$BASE/Plit39_S10"
+
+sbatch $MAPPER $ASSEMBLY ZYGOTE "$ZYGOTE"_L001
+sbatch $MAPPER $ASSEMBLY EARLY_CLEAVAGE "$EARLY_CLEAVAGE"_L001
+sbatch $MAPPER $ASSEMBLY EMBRYO0_1 "$EMBRYO0_1"_L001
+sbatch $MAPPER $ASSEMBLY EMBRYO3_5 "$EMBRYO3_5"_L001
+sbatch $MAPPER $ASSEMBLY EMBRYO9_10 "$EMBRYO9_10"_L001
+
+# instars
+BASE="/lisc/scratch/zoology/pycnogonum/raw/HTT33DSX5_4_R15615_20230713/demultiplexed"
+
+EMBRYO3="$BASE/235255/235255_S49"
+INSTAR1="$BASE/235253/235253_S47"
+INSTAR2="$BASE/235256/235256_S50"
+INSTAR3="$BASE/235257/235257_S51"
+INSTAR4="$BASE/235258/235258_S52"
+INSTAR5="$BASE/235259/235259_S53"
+INSTAR6="$BASE/235254/235254_S48"
+JUV1="$BASE/235260/235260_S54"
+SUBADULT="$BASE/235261/235261_S55"
+
+sbatch $MAPPER $ASSEMBLY EMBRYO3 "$EMBRYO3"_L004
+sbatch $MAPPER $ASSEMBLY INSTAR1 "$INSTAR1"_L004
+sbatch $MAPPER $ASSEMBLY INSTAR2 "$INSTAR2"_L004
+sbatch $MAPPER $ASSEMBLY INSTAR3 "$INSTAR3"_L004
+sbatch $MAPPER $ASSEMBLY INSTAR4 "$INSTAR4"_L004
+sbatch $MAPPER $ASSEMBLY INSTAR5 "$INSTAR5"_L004
+sbatch $MAPPER $ASSEMBLY INSTAR6 "$INSTAR6"_L004
+sbatch $MAPPER $ASSEMBLY JUV1 "$JUV1"_L004
+sbatch $MAPPER $ASSEMBLY SUBADULT "$SUBADULT"_L004
\ No newline at end of file