diff --git a/06-annotation/.DS_Store b/06-annotation/.DS_Store
new file mode 100644
index 0000000000000000000000000000000000000000..5008ddfcf53c02e82d7eee2e57c38e5672ef89f6
Binary files /dev/null and b/06-annotation/.DS_Store differ
diff --git a/06-annotation/README.md b/06-annotation/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..e55eefdc3595759fd25dc0b8540f6a59ad41064f
--- /dev/null
+++ b/06-annotation/README.md
@@ -0,0 +1,315 @@
+# Annotation
+
+Code in this folder covers the annotation of the genome with gene models, mostly for protein-coding
+genes. The starting point is the scaffolded, decontaminated draft assembly.
+
+## Annotating protein-coding genes on the _Pycnogonum_ genome
+
+### BRAKER3 with developmental RNA-seq data
+
+BRAKER3 is one of the most widely used tools for eukaryotic gene prediction, and one that
+consistently performs well in benchmarks. However, installing the software is a non-trivial
+endeavour in a HPC system without admin privileges. We provide here a summary of our experience in
+the hopes that it is useful to other hopeful users.
+
+- [Installing BRAKER3 in a shared system without admin access](install_braker.md)
+
+We generated RNA-seq data for a time series spanning development from late embryogenesis until the
+subadult stage (embryo 3-4, instars 1 to 6, juvenile 1, subadult), and our collaborators provided
+data for even earlier timepoints (zygote, early cleavage, embryo 3-5). This data was mapped to the
+draft genome (average: 72.8% uniquely mapping, 15% multimapping, so pretty good completeness) and
+the resulting BAM files used as input to BRAKER3.
+
+- [`prep-RNA-mkindex.sh`](prep-RNA-mkindex.sh) - before mapping the RNA we need to index the draft
+  genome so that STAR can actually map to it.
+- [`prep-RNA-map-all.sh`](prep-RNA-map-all.sh) - main script that will submit all mapping jobs.
+- [`prep-RNA-map-single.sh`](prep-RNA-map-single.sh) - worker script that will map the RNA reads
+  from a developmental timepoint to the draft genome.
+- [`prep-RNA-bam.sh`](prep-RNA-bam.sh) - main script that will submit the SAM-to-BAM conversion jobs
+  to the cluster.
+- [`prep-RNA-sam2bam.sh`](prep-RNA-sam2bam.sh) - worker script that will convert a SAM file to a BAM
+  file in the same location.
+- [`prep-RNA-index_bam.sh`](prep-RNA-index_bam.sh) - index the resulting BAM files
+
+BRAKER3 also expects a FASTA file with protein sequences expected to be found in the genome. For
+this, we used the Arthropoda partition from [orthoDB](https://www.orthodb.org).
+
+BRAKER3 was run from a container on the Vienna LiSC (Life Science Cluster; LiSC for short).
+
+- [annotate/annot-braker.sh](./annot1-braker.sh)
+
+This predicted 11,451 gene models. However, manual inspection of the RNA-seq mapping on the draft
+genome revealed that BRAKER3 had been rather conservative in its predictions, and that many loci
+that looked like they contained exon structures did not receive any annotation.
+
+### using Iso-seq collapsed transcripts as gene models
+
+We generated a long-read transcriptome using PacBio Iso-seq technology. After collapsing isoforms by
+mapping onto the draft genome, we were left with 8,904 transcript families, which we interpreted as
+genes. By comparing the Iso-seq "genes" with the BRAKER3 models we were able to identify multiple
+cases where BRAKER3 occasionally merged gene models if they were too close to each other.
+
+We decided to combine the Iso-seq and BRAKER3 annotation by giving precedence to Iso-seq. We
+filtered the BRAKER3 gene models to remove any that overlapped (same strand, any overlap) with the
+Iso-seq genes, and were left with 3,596 gene models, raising our total to 3,596 + 8,904 = 12,500.
+
+- [Map](annot1-isoseq-map.sh) the Iso-seq collapsed transcripts to the draft genome
+- [isoseq_confirm.ipynb](./annot1-isoseq_confirm.ipynb) - cross-reference the Iso-seq mapping
+  locations with BRAKER predictions and produce a new GFF file that contains their union.
+
+Unfortunately, the Iso-seq transcriptome we generated was part of a multiplexed run. This meant that
+we "lost" almost 60% of our effective sequence depth on other organisms, making the _Pycnogonum_
+Iso-seq less effective for gene annotation. Manual inspection confirmed what the numbers suggested:
+there were still plenty of loci with RNA-seq peaks that looked like _bona fide_ genes but received
+no annotation from either PacBio or BRAKER3.
+
+We consider Iso-seq+BRAKER3 "round 1" of the genome annotation. In the final GFF file, gene models
+that are derived from Iso-seq are have "PacBio" in the `source` column, and their IDs are in the
+form `PB.XXXX`, where `X` are digits. BRAKER3 gene models have "AUGUSTUS" listed as their `source`
+and their IDs have the form `gXXXXX`, with `X` again being digits.
+
+### Re-using unannotated loci for a second round of annotation
+
+There were a couple of indications that the genome annotation was not yet complete. First, BUSCO
+completeness of the gene models was lower than the entire genome; second, the mapping rates of the
+RNA-seq data was much lower for gene models (avg. uniquely mapping: 48.69%) compared to the entire
+genome (avg. uniquely mapping: 72.8%); finally, manual inspection easily identified loci that looked
+like genes but had no annotation.
+
+We proceeded to a second round of annotation. We filtered the RNA-seq data to only keep the mapped
+reads that did _not_ overlap with any of the gene models of the first BRAKER/Iso-seq round:
+
+- [annot2-RNA-extracts.sh](./annot2-RNA-extracts.sh) - main script that will submit worker scripts
+  for each transcriptome.
+- [text](annot2-RNA-extract_uncovered.sh) - worker script that, for a bulk transcriptome will
+  extract the reads that map on un-modelled regions.
+
+...and supplied the same orthoDB arthropod partition as protein hints:
+
+- [annot2-braker.sh](./annot2-braker.sh)
+
+As the protein hints overlapped with round 1, it was to be expected that BRAKER3 would also produce
+gene models that would overlap with round 1. To safely exclude these cases we used _exons_ to test
+for gene overlap. Using entire gene models would preclude genes that overlap with other genes (but
+on the other strand), or genes that lie in another gene's intron (such as the CHAT/VCHAT genes,
+conserved in _Platynereis_).
+
+This task was mostly achieved on the command line, using a combination of `bedtools intersect` to
+identify overlaps between GFF files, `AGAT` to convert from GTF to GFF, and bash utility tools
+`grep`, `cut`, `sort`, and `sed` to extract and lightly manipulate lines from the resulting files.
+
+We consider this "round 2" of the genome annotation. In the final GFF file, gene models from round 2
+have "AGAT" as `source` for gene/mRNA entries or "AUGUSTUS" for exons/CDS. Gene IDs are in the form
+`r2_gXXXX`, with `r2` standing for "round 2" and the rest being the usual BRAKER3 gene ID. Round 2
+produced an additional 2,223 gene models, bringing our total to 14,723.
+
+<details>
+<summary>Click here for details</summary>
+
+First extract the round 2 exons that don't overlap:
+
+```bash
+$ grep exon braker/braker.gtf > braker_exons.gtf
+$ grep exon ../annot_merge/merged.gff3 > draft_exons.gtf
+$ bedtools intersect -v -b draft_exons.gtf -a braker_exons.gtf > unique_exons.gtf
+```
+
+Keep track of which genes these come from:
+
+```bash
+$ cat unique_exons.gtf | cut -f9 | cut -d" " -f 4 | sort -u > keep_candidates.txt
+```
+
+At the same time, there might be round 2 genes that are a bit bigger than round 1 genes, and thus
+mostly overlap but have some unique exons. We'd like to avoid those:
+
+```bash
+$ bedtools intersect -wa -b draft_exons.gtf -a braker_exons.gtf > shared_exons.gtf
+$ cat shared_exons.gtf | cut -f9 | cut -f4 -d" " | sort -u > overlap.txt
+```
+
+We can now filter the candidate genes by those that have some overlap:
+
+```bash
+$ grep -v -f overlap.txt keep_candidates.txt > keep_genes.txt
+```
+
+And finally, we can filter the GTF file:
+    
+```bash
+$ grep -f keep_genes.txt braker/braker.gtf > braker2_unique.gtf
+```
+
+Now, before we can use this GTF file we need to make the gene names unique so that they don't
+overlap with the round 1 gene names. We can do this by adding a suffix to the gene names:
+
+```bash
+$ sed -r 's/(g[[:digit:]])/r2_\1/g' braker2_unique.gtf > braker2_unique_renamed.gtf
+```
+
+Finally, we can convert to a GFF3 file and append to our existing annotation:
+
+```bash
+$ module load conda
+$ conda activate agat-1.4.0
+$ agat_convert_sp_gxf2gxf.pl -g braker2_unique_renamed.gtf -o ./braker2_unique_renamed.gff3
+```
+
+Let's copy that over to the annotation merge site:
+
+```bash 
+$ cp braker2_unique_renamed.gff3 ../annot_merge/braker2_unique.gff3
+```
+
+Now we can remove the lines that contain introns or start/stop codons:
+
+```bash
+$ cd ../annot_merge
+$ grep -v -E -i "(intron)|(codon)" braker2_unique_renamed.gff3 > braker2_unique_renamed_nocodon_intron.gff3
+```
+
+</details>
+
+### Using _de novo_ transcriptomes as additional evidence
+
+Despite the improvements, manual inspection continued to suggest that there were loci with real
+RNA-seq peaks and exon-like structures that did not receive gene models. We reasoned that loci that
+represented real genes would be more likely to be present in multiple time points. Since BRAKER3 had
+already seen these loci and didn't identify them as genes, we hypothesized that the signal-to-noise
+ratio might have played a role. Instead of using raw reads we therefore turned to _de novo_
+assembled transcripts, and tried to find loci where transcripts from multiple timepoints mapped onto
+the draft genome.
+
+We used [Trinity](https://github.com/trinityrnaseq/trinityrnaseq) to perform _de novo_ assemblies
+for the in-house RNA-seq datasets, as they had been sequenced at a much deeper level than the
+transcriptomes provided by our collaborators, giving the tool better chances to assemble real
+sequences (also refer to the [corresponding section](../05-transcriptomes/README.md)) We kept
+complete ORFs, as annotated by [TransDecoder](https://github.com/TransDecoder/TransDecoder), and
+mapped those against the draft genome.
+
+- [`filter_assemblies.sh`](./annot3-RNA-01-filter_assemblies.sh) - main script that submits
+  filtering jobs to the cluster.
+- [`extract_complete_ORFs.sh`](annot3-RNA-01-extract_complete_ORFs.sh) - worker script that will
+  extract coding regions that represent complete ORFs
+- [`map_assemblies.sh`](./annot3-RNA-02-map_assemblies.sh) - main script that will submit
+  mapping jobs.
+- [map complete ORFs to genome](annot3-RNA-02-map_assembly_single.sh) - worker script
+- [`pack_sams.sh`](./annot3-RNA-03-pack_sams.sh) - main script that will submit formatting jobs to
+  the cluster.
+- [convert to GFF3](annot3-RNA-03-pack_sam.sh) - worker script that will extract reads that mapped
+  well and save them in GFF3 format.
+
+We chose to use the instar 3 transcriptome, which had 99% BUSCO completeness (metazoa), as a
+reference. We used `bedtools intersect` to find all draft genome loci where an instar 3 transcript
+mapped and overlapped (at 90% of its length) with a mapped transcript from another time point (also
+at 90% of its length), and excluded all loci that overlapped with gene models from rounds 1 and 2.
+We then translated the cDNA match parts of each locus to exons. Whenever there was disagreement
+between the different _de novo_ transcriptomes about exon boundaries, we picked the most common exon
+boundary.
+
+- [mining de novo assembled transcriptomes](annot3-RNA-04-denovo_mining.ipynb)
+
+While this approach is less sensitive than BRAKER3 and is incapable of distinguishing between
+isoforms, it is fairly robust in finding gene and exon boundaries, at least so far as the coding
+regions are concerned.
+
+We consider this "round 3" of the genome annotation. In the final GFF file, gene models from round 3
+have "Trinity" as `source`. Gene IDs are in the form `at_DNXXXX`, with `at` standing for "assembled
+transcriptomes", `DN` being a tribute to Trinity transcript naming, and `X` being digits. Round 3
+produced an additional 774 gene models, bringing our total to 15,497.
+
+### Consolidating results
+
+Finally, we can form the GFF3 file and sort it:
+
+```bash
+$ cat isoseq.gff > merged.gff3
+$ cat braker.gff >> merged.gff3
+$ cat braker2_unique_renamed_nocodon_intron.gff3 >> merged.gff3
+$ cat denovo_txomes/overlap_translated.gff3 >> merged.gff3
+$ cat ../trnascan/trnascan.gff3 >> merged.gff3
+$ module load genometools/
+$ gt gff3 -tidy -retainids -o merged_sorted.gff3 -force merged.gff3
+```
+
+### Testing
+
+First, convert the GFF to GTF for convenience:
+
+```bash
+$ gt gff3_to_gtf -o merged_sorted.gtf merged_sorted.gff3
+```
+
+Now we can use [AGAT](https://agat.readthedocs.io/en/latest/tools/agat_sp_extract_sequences.html) to
+extract transcripts from the genome:
+
+```bash
+$ agat_sp_extract_sequences.pl -g merged_sorted.gff3 -f ../draft_softmasked.fasta -t exon --merge -o transcripts.fa
+```
+
+We can then use this file as input for TransDecoder...
+
+```bash
+$ TransDecoder.LongOrfs -t transcripts.fa
+$ TransDecoder.Predict -t transcripts.fa
+```
+
+...and finally look at BUSCO completeness:
+
+```bash
+$ conda deactivate
+$ conda activate busco-5.7.1
+$ busco -i transcripts.fa.transdecoder.pep -l metazoa -m protein -o metazoa -r -c 4 --offline --download_path ../../busco/busco_downloads/
+```
+
+Which returns:
+
+```
+---------------------------------------------------
+|Results from dataset metazoa_odb10                |
+---------------------------------------------------
+|C:96.5%[S:40.7%,D:55.8%],F:1.5%,M:2.0%,n:954      |
+|920    Complete BUSCOs (C)                        |
+|388    Complete and single-copy BUSCOs (S)        |
+|532    Complete and duplicated BUSCOs (D)         |
+|14    Fragmented BUSCOs (F)                       |
+|20    Missing BUSCOs (M)                          |
+|954    Total BUSCO groups searched                |
+---------------------------------------------------
+```
+
+This is very, very close to what we had for the entire genome (except for the high prevalence of
+duplicated BUSCOs, which can be explained by the fact that we included all isoforms). In particular,
+it is worth noting that, for the entire genome, we had 1.6% fragmented and 1.7% missing BUSCOs.
+Taken together, this suggests that we have a fairly complete set of gene models.
+
+We can do the same with the arthropod BUSCO set:
+
+```bash
+$ busco -i transcripts.fa.transdecoder.pep -l arthropoda -m protein -o arthropoda -r -c 4 --offline --download_path ../../busco/busco_downloads/
+```
+
+Which returns:
+
+```
+---------------------------------------------------
+|Results from dataset arthropoda_odb10             |
+---------------------------------------------------
+|C:95.8%[S:37.3%,D:58.5%],F:2.0%,M:2.2%,n:1013     |
+|971    Complete BUSCOs (C)                        |
+|378    Complete and single-copy BUSCOs (S)        |
+|593    Complete and duplicated BUSCOs (D)         |
+|20    Fragmented BUSCOs (F)                       |
+|22    Missing BUSCOs (M)                          |
+|1013    Total BUSCO groups searched               |
+---------------------------------------------------
+```
+
+Very similar results; again suggesting that we have a fairly complete set of gene models.
+
+## tRNAscan
+
+We also used tRNAscan to predict tRNAs in the draft genome. 
+
+- [run the tool](annot-trnascan.sh)
+- [export in GFF3 format](tRNAscan_to_gff3.ipynb)
\ No newline at end of file
diff --git a/06-annotation/annot-trnascan.sh b/06-annotation/annot-trnascan.sh
new file mode 100644
index 0000000000000000000000000000000000000000..68d13a6c99bb685c57bfa4d58a5616df3e29a29c
--- /dev/null
+++ b/06-annotation/annot-trnascan.sh
@@ -0,0 +1,42 @@
+#!/usr/bin/env bash
+#
+#SBATCH --job-name=pycno_trnascan
+#SBATCH --ntasks=1
+#SBATCH --cpus-per-task=2
+#SBATCH --mem=500M
+#SBATCH --time=1:00:00
+#SBATCH --mail-type=ALL
+#SBATCH --mail-user=nikolaos.papadopoulos@univie.ac.at
+#SBATCH --output=/lisc/user/papadopoulos/log/pycno-trnascan-%j.out
+#SBATCH --error=/lisc/user/papadopoulos/log/pycno-trnascan-%j.err
+
+module load trnascan
+
+# Testing was performed with a FASTA file with 60 chars/line.
+
+# | CPUs | input size (lines) | real time | max memory | CPU util. |
+# |------|--------------------|-----------|------------|-----------|
+# | 1    | 78753              | 45s       | 57.5Mb     | 97%       | 
+# | 2    | 78753              | 31s       | 97Mb       | 142%      | 
+# | 4    | 78753              | 30s       | 16.8Mb     | 182%      | (testing)
+# | 1    | 787530             | 7:04      | 172Mb      | 98%       |
+# | 2    | 787530             | 4:58      | 225Mb      | 145%      |
+# | 4    | 787530             | 4:12      | 336Mb      | 188%      |
+# |------|--------------------|-----------|------------|-----------|
+# | 2    | 7875382            | 53:40     | 304Mb      | 60%       | (run)
+
+ASSEMBLY=/lisc/scratch/zoology/pycnogonum/genome/draft/draft.fasta
+OUTPUT=/lisc/scratch/zoology/pycnogonum/genome/draft/trnascan/
+mkdir "$OUTPUT" -p || exit
+
+cd "$OUTPUT" || exit
+
+tRNAscan-SE \
+-o trnascan.out \
+-f trnascan.fasta \
+-b trnascan.bed \
+--stats trnascan.stats \
+--progress \
+--hitsrc \
+--thread 2 \
+"$ASSEMBLY"
\ No newline at end of file
diff --git a/06-annotation/annot1-braker.sh b/06-annotation/annot1-braker.sh
new file mode 100644
index 0000000000000000000000000000000000000000..74c07ed17334508b95b96bfbf5b45fcaf037f9cb
--- /dev/null
+++ b/06-annotation/annot1-braker.sh
@@ -0,0 +1,56 @@
+#!/usr/bin/env bash
+#
+#SBATCH --job-name=pycno_braker
+#SBATCH --ntasks=1
+#SBATCH --cpus-per-task=18
+#SBATCH --mem=40G
+#SBATCH --time=20:00:00
+#SBATCH --mail-type=ALL
+#SBATCH --mail-user=nikolaos.papadopoulos@univie.ac.at
+#SBATCH --output=/lisc/user/papadopoulos/log/pycno-braker-%j.out
+#SBATCH --error=/lisc/user/papadopoulos/log/pycno-braker-%j.err
+
+module load conda
+conda activate singularity
+
+# container location
+BRAKER=/lisc/user/papadopoulos/containers/braker3.sif
+# path to the output directory, where all the magic should happen
+OUTPUT=/lisc/scratch/zoology/pycnogonum/genome/draft/braker
+mkdir "$OUTPUT" -p || exit
+cd "$OUTPUT" || exit
+
+# copy AUGUSTUS config files for BRAKER
+cp -r ~/repos/Augustus/config .
+
+# define input files
+# the genome assembly
+GENOME=/lisc/scratch/zoology/pycnogonum/genome/draft/draft_softmasked.fasta
+# the RNA-seq dataset BAMs
+BASE=/lisc/scratch/zoology/pycnogonum/genome/draft/transcriptome
+ZYGOTE="$BASE/ZYGOTE/Aligned.out.bam"
+EARLY_CLEAVAGE="$BASE/EARLY_CLEAVAGE/Aligned.out.bam"
+EMBRYO0_1="$BASE/EMBRYO0_1/Aligned.out.bam"
+EMBRYO3_5="$BASE/EMBRYO3_5/Aligned.out.bam"
+EMBRYO9_10="$BASE/EMBRYO9_10/Aligned.out.bam"
+EMBRYO3="$BASE/EMBRYO3/Aligned.out.bam"
+INSTAR1="$BASE/INSTAR1/Aligned.out.bam"
+INSTAR2="$BASE/INSTAR2/Aligned.out.bam"
+INSTAR3="$BASE/INSTAR3/Aligned.out.bam"
+INSTAR4="$BASE/INSTAR4/Aligned.out.bam"
+INSTAR5="$BASE/INSTAR5/Aligned.out.bam"
+INSTAR6="$BASE/INSTAR6/Aligned.out.bam"
+JUV1="$BASE/JUV1/Aligned.out.bam"
+SUBADULT="$BASE/SUBADULT/Aligned.out.bam"
+# the Arthropoda OrthoDB protein sequences
+ARTHROPODA=/lisc/scratch/zoology/db/Arthropoda.fa
+
+# now call the container. Important to mount the current directory, /scratch/, and the $TMPDIR,
+# in case it is needed.
+singularity exec -B "$PWD,/lisc/scratch/zoology/,$TMPDIR" "$BRAKER" braker.pl \
+--AUGUSTUS_CONFIG_PATH="${PWD}"/config \
+--genome=$GENOME \
+--prot_seq=$ARTHROPODA \
+--bam="$ZYGOTE","$EARLY_CLEAVAGE","$EMBRYO0_1","$EMBRYO3_5","$EMBRYO9_10","$EMBRYO3","$INSTAR1","$INSTAR2","$INSTAR3","$INSTAR4","$INSTAR5","$INSTAR6","$JUV1","$SUBADULT" \
+--softmasking \
+--threads 18
\ No newline at end of file
diff --git a/06-annotation/annot1-isoseq-map.sh b/06-annotation/annot1-isoseq-map.sh
new file mode 100644
index 0000000000000000000000000000000000000000..628e64f612c6cd77aa42043fe8f890a981a11948
--- /dev/null
+++ b/06-annotation/annot1-isoseq-map.sh
@@ -0,0 +1,18 @@
+#!/usr/bin/env bash
+#
+#SBATCH --job-name=mmap_pycno
+#SBATCH --ntasks=1
+#SBATCH --cpus-per-task=16
+#SBATCH --mem=15G
+#SBATCH --time=15:00
+#SBATCH --mail-type=ALL
+#SBATCH --mail-user=nikolaos.papadopoulos@univie.ac.at
+#SBATCH --output=/lisc/user/papadopoulos/log/pycno-mmap-%j.out
+#SBATCH --error=/lisc/user/papadopoulos/log/pycno-mmap-%j.err
+
+module load conda
+conda activate minimap2-2.28
+
+DRAFT=/lisc/scratch/zoology/pycnogonum/genome/draft/draft.fasta
+ISOSEQ=/lisc/scratch/zoology/pycnogonum/genome/draft/transcriptome/isoseq/isoseq.fastq
+/usr/bin/time minimap2 -ax splice -t16 -uf -C5 $DRAFT $ISOSEQ | samtools view -bS -o isoseq.bam
\ No newline at end of file
diff --git a/06-annotation/annot1-isoseq_confirm.ipynb b/06-annotation/annot1-isoseq_confirm.ipynb
new file mode 100644
index 0000000000000000000000000000000000000000..1894b7dc75403f2fba166965600d016b9fa1b473
--- /dev/null
+++ b/06-annotation/annot1-isoseq_confirm.ipynb
@@ -0,0 +1,515 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Confirming gene models with isoform-level RNA-seq data\n",
+    "\n",
+    "## Motivation\n",
+    "\n",
+    "Our gene prediction with BRAKER3 produced a low number of gene models (~11.5k). We suspect this\n",
+    "might be because GeneMark-ET/StringTie are too aggressive with exons/genes that are close to each\n",
+    "other, merging them and messing up the gene models. We will use the isoform-level RNA-seq data to\n",
+    "confirm some of the gene models."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import numpy as np\n",
+    "import pandas as pd\n",
+    "\n",
+    "from tqdm import tqdm"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "I extracted all transcript lines from the corresponding gff/gtf files and used bedtools to intersect\n",
+    "them:\n",
+    "\n",
+    "```bash\n",
+    "$ ISOSEQ=/lisc/scratch/zoology/pycnogonum/transcriptome/isoseq/isoseq_confirmed.gff\n",
+    "$ BRAKER=/lisc/scratch/zoology/pycnogonum/genome/draft/braker/braker/braker_proposed.gtf\n",
+    "$ bedtools intersect -wao -b $ISOSEQ -a $BRAKER > brak_iso.bed\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "gtf = [\"seqname\", \"source\", \"feature\", \"start\", \"end\", \"score\", \"strand\", \"frame\", \"attributes\"]\n",
+    "\n",
+    "# iso2bra = pd.read_csv(\"/Volumes/scratch/pycnogonum/genome/draft/annot_merge/iso_brak.bed\", sep=\"\\t\", header=None)\n",
+    "# iso2bra.columns = [\"iso_\" + x for x in gtf] + [\"bra_\" + x for x in gtf] + [\"overlap\"]\n",
+    "# iso2bra[\"iso_gene\"] = iso2bra[\"iso_attributes\"].str.extract(r'gene_id \"(.*?)\";')\n",
+    "# iso2bra[\"bra_gene\"] = iso2bra[\"bra_attributes\"].str.split(\"\\.\").str[0]\n",
+    "# iso2bra[\"bra_gene\"] = iso2bra[\"bra_gene\"].replace(\"\", np.nan)\n",
+    "\n",
+    "bra2iso = pd.read_csv(\"/Volumes/scratch/pycnogonum/genome/draft/annot_merge/brak_iso.bed\", sep=\"\\t\", header=None)\n",
+    "bra2iso.columns = [\"bra_\" + x for x in gtf] + [\"iso_\" + x for x in gtf] + [\"overlap\"]\n",
+    "bra2iso[\"bra_gene\"] = bra2iso[\"bra_attributes\"].str.split(\".\").str[0]\n",
+    "bra2iso[\"iso_gene\"] = bra2iso[\"iso_attributes\"].str.extract(r'gene_id \"(.*?)\";')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now to find overlaps:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def contained(df, a_start=\"bra_start\", a_end=\"bra_end\", b_start=\"iso_start\", b_end=\"iso_end\"):\n",
+    "    \"\"\"Checks if sequence A is completely contained within sequence B or vice versa.\n",
+    "\n",
+    "    Parameters\n",
+    "    ----------\n",
+    "    df : pd.DataFrame\n",
+    "        A dataframe of a bedtools intersect file.\n",
+    "    a_start : str, optional\n",
+    "        the start position of sequence A, by default \"bra_start\"\n",
+    "    a_end : str, optional\n",
+    "        the end position of sequence B, by default \"bra_end\"\n",
+    "    b_start : str, optional\n",
+    "        The start position of sequence B, by default \"iso_start\"\n",
+    "    b_end : str, optional\n",
+    "        the end position of sequence B, by default \"iso_end\"\n",
+    "\n",
+    "    Returns\n",
+    "    -------\n",
+    "    bool\n",
+    "        returns True if the sequence is contained, False otherwise\n",
+    "    \"\"\"\n",
+    "    if df[\"bra_strand\"] != df[\"iso_strand\"]:\n",
+    "        return False # make sure that we are comparing genes on the same strand\n",
+    "    if df[b_start] < 0 or df[b_end] < 0 or df[a_start] < 0 or df[a_end] < 0:\n",
+    "        return False\n",
+    "    # first check if the braker gene is completely contained within the isoseq one\n",
+    "    if df[a_start] >= df[b_start] and df[a_end] <= df[b_end]:\n",
+    "        return True\n",
+    "    # then check if the braker gene completely contains the isoseq one\n",
+    "    if df[a_start] <= df[b_start] and df[a_end] >= df[b_end]:\n",
+    "        return True\n",
+    "    return False"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "genes = bra2iso.groupby(\"bra_gene\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def calc_overlap(df):\n",
+    "    bra_len = df.iloc[0][\"bra_end\"] - df.iloc[0][\"bra_start\"]\n",
+    "    # print(bra_len)\n",
+    "    seq = np.zeros(bra_len, dtype=bool)\n",
+    "    for i, transcript in df.iterrows():\n",
+    "        start = transcript[\"iso_start\"] - transcript[\"bra_start\"]\n",
+    "        start = np.max((0, start))\n",
+    "        end = transcript[\"iso_end\"] - transcript[\"bra_start\"]\n",
+    "        # print(end, len(seq))\n",
+    "        end = np.min([bra_len, end])\n",
+    "        seq[start:end] = True\n",
+    "    return np.sum(seq) / bra_len"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "100%|██████████| 11451/11451 [00:03<00:00, 2913.12it/s]\n"
+     ]
+    }
+   ],
+   "source": [
+    "braker_unique = []\n",
+    "significant = []\n",
+    "unknown = []\n",
+    "truncate = []\n",
+    "\n",
+    "\n",
+    "for gene in tqdm(genes.groups):\n",
+    "    try:\n",
+    "        transcripts = genes.get_group(gene)\n",
+    "    except KeyError:\n",
+    "        unknown.append(gene)\n",
+    "        continue\n",
+    "    # first see how many candidate ISO-seq genes this BRAKER gene overlaps with\n",
+    "    # if there is zero, then it means no possible corresponding ISO-seq gene was found\n",
+    "    no_iso = transcripts.groupby(\"iso_gene\").size().shape[0] # how many ISO-seq transcripts match this BRAKER gene?\n",
+    "    if no_iso == 0: \n",
+    "        braker_unique.append(gene)\n",
+    "        continue\n",
+    "    \n",
+    "    # if there is one it means we have an 1-to-1 correspondence\n",
+    "    if no_iso == 1:\n",
+    "        # try to verify that most BRAKER isoforms are contained within the ISO-seq isoforms\n",
+    "        # (or vice versa)\n",
+    "        no_contained = transcripts.apply(contained, axis=1).sum()\n",
+    "        if no_contained / transcripts.shape[0] >= 0.4:\n",
+    "            significant.append(gene)\n",
+    "            continue\n",
+    "\n",
+    "        # test if the ISO-seq isoforms cumulatively cover at least 50% of the BRAKER gene\n",
+    "        total_overlap = transcripts.groupby(\"bra_attributes\").apply(calc_overlap, include_groups=False)\n",
+    "        if np.mean(total_overlap) >= 0.5:\n",
+    "            significant.append(gene)\n",
+    "            continue\n",
+    "\n",
+    "    # if we have arrived here, all else failed and this gene needs to go for manual curation\n",
+    "    unknown.append(gene)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "(7591, 3596, 264, np.int64(11451))"
+      ]
+     },
+     "execution_count": 7,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "len(significant), len(braker_unique), len(unknown), np.sum([len(significant), len(braker_unique), len(unknown)])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "At this point the strategy should be to copy the ISO-seq and \"braker_unique\" gene models from their\n",
+    "respective GTF files to a new, merged one. This should already take us from the 11,451 BRAKER gene\n",
+    "models to 8,904 + 3,442 = 12,346 gene models. We can then screen the unknown modules further."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "isoseq_loc = \"/Volumes/scratch/pycnogonum/transcriptome/isoseq/collapsed.gff\"\n",
+    "braker_loc = \"/Volumes/scratch/pycnogonum/genome/draft/braker/braker/braker.gtf\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 17,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def gene_string(gene_id, start, stop, last_line):\n",
+    "    chunks = last_line.split(\"\\t\")\n",
+    "\n",
+    "    seqid = chunks[0]\n",
+    "    source = chunks[1]\n",
+    "    seq_type = \"gene\"\n",
+    "    score = \".\"\n",
+    "    strand = chunks[6]\n",
+    "    phase = \".\"\n",
+    "    attributes = f\"ID={gene_id};gene_id={gene_id};\"\n",
+    "    string = f\"{seqid}\\t{source}\\t{seq_type}\\t{start}\\t{stop}\\t{score}\\t{strand}\\t{phase}\\t{attributes};\\n\"\n",
+    "    return string\n",
+    "\n",
+    "def transcript_string(gene_id, transcript_id, line):\n",
+    "    chunks = line.split(\"\\t\")\n",
+    "\n",
+    "    seqid = chunks[0]\n",
+    "    source = chunks[1]\n",
+    "    seq_type = \"mRNA\"\n",
+    "    start = chunks[3]\n",
+    "    stop = chunks[4]\n",
+    "    score = chunks[5]\n",
+    "    strand = chunks[6]\n",
+    "    phase = \".\"\n",
+    "    attributes = f\"ID={transcript_id};Parent={gene_id}\"\n",
+    "    string = f\"{seqid}\\t{source}\\t{seq_type}\\t{start}\\t{stop}\\t{score}\\t{strand}\\t{phase}\\t{attributes};\\n\"\n",
+    "    return string\n",
+    "\n",
+    "def exon_string(transcript_id, line, exon_counter):\n",
+    "    chunks = line.split(\"\\t\")\n",
+    "\n",
+    "    seqid = chunks[0]\n",
+    "    source = chunks[1]\n",
+    "    start = chunks[3]\n",
+    "    stop = chunks[4]\n",
+    "    score = chunks[5]\n",
+    "    strand = chunks[6]\n",
+    "\n",
+    "    seq_type = \"exon\"\n",
+    "    phase = \".\"\n",
+    "    attributes = f\"ID={transcript_id}.exon{exon_counter};Parent={transcript_id}\"\n",
+    "    exon_string = f\"{seqid}\\t{source}\\t{seq_type}\\t{start}\\t{stop}\\t{score}\\t{strand}\\t{phase}\\t{attributes};\\n\"\n",
+    "    \n",
+    "    seq_type = \"CDS\"\n",
+    "    phase=\"0\"\n",
+    "    attributes = f\"ID={transcript_id}.CDS{exon_counter};Parent={transcript_id}\"\n",
+    "    CDS_string = f\"{seqid}\\t{source}\\t{seq_type}\\t{start}\\t{stop}\\t{score}\\t{strand}\\t{phase}\\t{attributes};\\n\"\n",
+    "    \n",
+    "    return exon_string + CDS_string"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 18,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def write_gene(current_gene_id, gene_lines, edit, start, end):\n",
+    "    chunks = gene_lines[0].split(\"\\t\")\n",
+    "    # write the gene line first\n",
+    "    gene = gene_string(current_gene_id, start, end, gene_lines[0])\n",
+    "    edit.write(gene)\n",
+    "    # now write each of the transcripts\n",
+    "    exon_counter = 1\n",
+    "    for l in gene_lines:\n",
+    "        chunks = l.split(\"\\t\")\n",
+    "        if chunks[2] == \"transcript\":\n",
+    "            transcript_id = chunks[8].split(\";\")[1].split(\" \")[2][1:-1]\n",
+    "            transcript = transcript_string(current_gene_id, transcript_id, l)\n",
+    "            edit.write(transcript)\n",
+    "        else:\n",
+    "            exon = exon_string(transcript_id, l, exon_counter)\n",
+    "            edit.write(exon)\n",
+    "            exon_counter += 1\n",
+    "\n",
+    "    edit.write(\"###\\n\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "gene_lines = []\n",
+    "current_gene_id = None\n",
+    "start = np.inf\n",
+    "end = 0\n",
+    "\n",
+    "with open(\"/Volumes/scratch/pycnogonum/transcriptome/isoseq/collapsed.gff\", \"r\") as isoseq:\n",
+    "    with open(\"/Volumes/scratch/pycnogonum/genome/draft/annot_merge/isoseq.gff\", \"w\") as edit:\n",
+    "        for line in isoseq:\n",
+    "            if line.startswith(\"#\"):\n",
+    "                edit.write(line)\n",
+    "                continue\n",
+    "            chunks = line.split(\"\\t\")\n",
+    "            entry_type = chunks[2]\n",
+    "            # print(line)\n",
+    "            if entry_type == \"transcript\":\n",
+    "                gene_id = chunks[8].split(\";\")[0].split(\" \")[1][1:-1]\n",
+    "                if gene_id != current_gene_id:\n",
+    "                    if current_gene_id is None: # first gene\n",
+    "                        current_gene_id = gene_id\n",
+    "                        gene_lines.append(line)\n",
+    "                        continue\n",
+    "                    # new gene! Write it out:\n",
+    "                    write_gene(current_gene_id, gene_lines, edit, start, end)\n",
+    "                    # reset the gene\n",
+    "                    current_gene_id = gene_id\n",
+    "                    gene_lines = []\n",
+    "                    start = int(chunks[3])\n",
+    "                    end = int(chunks[4])\n",
+    "                else:\n",
+    "                    start = min(start, int(chunks[3]))\n",
+    "                    end = max(end, int(chunks[4]))\n",
+    "            gene_lines.append(line)\n",
+    "        # write last gene\n",
+    "        write_gene(current_gene_id, gene_lines, edit, start, end)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 19,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "gene_lines = []\n",
+    "current_gene_id = None\n",
+    "\n",
+    "with open(\"/Volumes/scratch/pycnogonum/genome/draft/braker/braker/braker.gff3\", \"r\") as braker:\n",
+    "    with open(\"/Volumes/scratch/pycnogonum/genome/draft/annot_merge/braker.gff\", \"w\") as edit:\n",
+    "        for line in braker:\n",
+    "            chunks = line.split(\"\\t\")\n",
+    "            entry_type = chunks[2]\n",
+    "\n",
+    "            if entry_type in [\"start_codon\", \"stop_codon\", \"intron\"]:\n",
+    "                continue\n",
+    "\n",
+    "            if entry_type == \"gene\":\n",
+    "                current_gene_id = chunks[8].split(\";\")[0].split(\"=\")[1]\n",
+    "                line = line.rstrip() + \"gene_id=\" + current_gene_id + \";\\n\"\n",
+    "            \n",
+    "            if current_gene_id in braker_unique:\n",
+    "                edit.write(line)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 13,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import matplotlib.pyplot as plt"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 14,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def plot_overlap(df):\n",
+    "    # plot the overlapping regions as straight lines, color by transcript ID\n",
+    "    start = np.min((df[\"bra_start\"].min(), df[\"iso_start\"].min()))\n",
+    "    end = np.max((df[\"bra_end\"].max(), df[\"iso_end\"].max()))\n",
+    "\n",
+    "    fig, ax = plt.subplots()\n",
+    "\n",
+    "    # for i, transcript in df.iterrows():\n",
+    "    #     ax.hlines(i*2, transcript[\"iso_start\"], transcript[\"iso_end\"], label=transcript[\"iso_gene\"])\n",
+    "    #     ax.hlines(i*2+1, transcript[\"bra_start\"], transcript[\"bra_end\"], label=transcript[\"bra_gene\"])\n",
+    "    ax.set_xlim(start-1000, end+1000)\n",
+    "    ax.legend()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 15,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# a function that plots all the unique transcripts in a dataframe, coloring them by gene ID\n",
+    "def plot_overlap(df):\n",
+    "    # first find out the minimum start and maximum end positions of all the real BRAKER genes\n",
+    "    # (i.e. not the -1 values)\n",
+    "    bra_start_min = np.min(df[\"bra_start\"][df[\"bra_start\"] > 0])\n",
+    "    bra_end_max = np.max(df[\"bra_end\"][df[\"bra_end\"] > 0])\n",
+    "    # now do the same for the iso-seq genes\n",
+    "    iso_start_min = np.min(df[\"iso_start\"][df[\"iso_start\"] > 0])\n",
+    "    iso_end_max = np.max(df[\"iso_end\"][df[\"iso_end\"] > 0])\n",
+    "    # the plot should start 1000 bp before the minimum start and end 1000 bp after the maximum end\n",
+    "    start = np.min((bra_start_min, iso_start_min)) - 1000\n",
+    "    end = np.max((bra_end_max, iso_end_max)) + 1000\n",
+    "\n",
+    "    fig, ax = plt.subplots()\n",
+    "\n",
+    "    # get a list of all the unique gene IDs\n",
+    "    gene_list = np.unique(df[\"iso_gene\"].tolist() + df[\"bra_gene\"].tolist())\n",
+    "    # generate a color map for the gene IDs - this should be always less than 10, so we can take\n",
+    "    # the Set1 colormap\n",
+    "    colors = plt.cm.Set1(np.linspace(0, 1, len(gene_list)))\n",
+    "    # create a dictionary that maps gene IDs to colors\n",
+    "    cmap = dict(zip(gene_list, colors))\n",
+    "\n",
+    "    # each transcript should be a straight line, colored by the gene ID. Let's do BRAKER first.\n",
+    "    braker = df.groupby(\"bra_attributes\").first()\n",
+    "    for i, row in braker.iterrows():\n",
+    "        ax.hlines(i, row[\"bra_start\"], row[\"bra_end\"], color=cmap[row[\"bra_gene\"]])\n",
+    "\n",
+    "    # now do the same for the iso-seq genes\n",
+    "    iso = df.groupby(\"iso_attributes\").first()\n",
+    "    for i, row in iso.iterrows():\n",
+    "        try:\n",
+    "            ax.hlines(i, row[\"iso_start\"], row[\"iso_end\"], color=cmap[row[\"iso_gene\"]])\n",
+    "        except KeyError:\n",
+    "            # there might be some empty IDs here if there is no overlap for this transcript\n",
+    "            # ax.hlines(i, start, end, linestyles=\"dashed\", color=\"black\")\n",
+    "            continue\n",
+    "    \n",
+    "    # set the limits of the plot\n",
+    "    ax.set_xlim(start, end)\n",
+    "    # keep track of what chromosome we're in in case we want to go to the genome browser\n",
+    "    ax.set_title(f\"{df['bra_seqname'].iloc[0]}\")\n",
+    "    # save the plot for manual inspection\n",
+    "    plt.savefig(f\"figs/{df['bra_gene'].iloc[0]}.png\")\n",
+    "    # close plot to save memory\n",
+    "    plt.close()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 16,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "100%|██████████| 264/264 [00:14<00:00, 18.59it/s]\n"
+     ]
+    }
+   ],
+   "source": [
+    "for gene in tqdm(unknown):\n",
+    "    keep = bra2iso[\"bra_gene\"] == gene\n",
+    "    plot_overlap(bra2iso[keep])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Manual inspection of the ISO-seq/BRAKER overlap for these 264 genes identifies another 49 that need\n",
+    "further examination. Those were aligned against uniref90 to check whether the BRAKER portions\n",
+    "correspond to real genes or are spurious."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "ascc24",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.12.4"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/06-annotation/annot2-RNA-extract_uncovered.sh b/06-annotation/annot2-RNA-extract_uncovered.sh
new file mode 100644
index 0000000000000000000000000000000000000000..ab9f0fcc1d986672e1163c07850a6446d2af8883
--- /dev/null
+++ b/06-annotation/annot2-RNA-extract_uncovered.sh
@@ -0,0 +1,19 @@
+#!/usr/bin/env bash
+#
+#SBATCH --job-name=extr_unannot
+#SBATCH --ntasks=1
+#SBATCH --cpus-per-task=1
+#SBATCH --mem=2G
+#SBATCH --time=2:00:00
+#SBATCH --mail-type=ALL
+#SBATCH --mail-user=nikolaos.papadopoulos@univie.ac.at
+#SBATCH --output=/lisc/user/papadopoulos/log/pycno-subset_bam-%j.out
+#SBATCH --error=/lisc/user/papadopoulos/log/pycno-subset_bam-%j.err
+
+module load bedtools
+
+GFF=/lisc/scratch/zoology/pycnogonum/genome/draft/annot_merge/merged.gff3
+OUTPUT=$1
+cd "$OUTPUT" || exit
+
+bedtools intersect -v -a Aligned.out.bam -b $GFF > unannotated.bam
\ No newline at end of file
diff --git a/06-annotation/annot2-RNA-extracts.sh b/06-annotation/annot2-RNA-extracts.sh
new file mode 100644
index 0000000000000000000000000000000000000000..6c2ee1c46fd9be49eb498bc2f3ed2d1fe972dd70
--- /dev/null
+++ b/06-annotation/annot2-RNA-extracts.sh
@@ -0,0 +1,20 @@
+#!/usr/bin/env bash
+
+EXTRACT=/lisc/user/papadopoulos/repos/pycno-seq/annotate/annot2-RNA-extract_uncovered.sh
+
+BASE=/lisc/scratch/zoology/pycnogonum/genome/draft/transcriptome/
+
+sbatch $EXTRACT "$BASE/ZYGOTE/"
+sbatch $EXTRACT "$BASE/EARLY_CLEAVAGE/"
+sbatch $EXTRACT "$BASE/EMBRYO0_1/"
+sbatch $EXTRACT "$BASE/EMBRYO3_5/"
+sbatch $EXTRACT "$BASE/EMBRYO9_10/"
+sbatch $EXTRACT "$BASE/EMBRYO3/"
+sbatch $EXTRACT "$BASE/INSTAR1/"
+sbatch $EXTRACT "$BASE/INSTAR2/"
+sbatch $EXTRACT "$BASE/INSTAR3/"
+sbatch $EXTRACT "$BASE/INSTAR4/"
+sbatch $EXTRACT "$BASE/INSTAR5/"
+sbatch $EXTRACT "$BASE/INSTAR6/"
+sbatch $EXTRACT "$BASE/JUV1/"
+sbatch $EXTRACT "$BASE/SUBADULT/"
\ No newline at end of file
diff --git a/06-annotation/annot2-braker.sh b/06-annotation/annot2-braker.sh
new file mode 100644
index 0000000000000000000000000000000000000000..f90d87970caa68c55feae981984bdcf085c2179c
--- /dev/null
+++ b/06-annotation/annot2-braker.sh
@@ -0,0 +1,56 @@
+#!/usr/bin/env bash
+#
+#SBATCH --job-name=pycno_braker
+#SBATCH --ntasks=1
+#SBATCH --cpus-per-task=10
+#SBATCH --mem=40G
+#SBATCH --time=20:00:00
+#SBATCH --mail-type=ALL
+#SBATCH --mail-user=nikolaos.papadopoulos@univie.ac.at
+#SBATCH --output=/lisc/user/papadopoulos/log/pycno-braker-%j.out
+#SBATCH --error=/lisc/user/papadopoulos/log/pycno-braker-%j.err
+
+module load conda
+conda activate singularity
+
+# container location
+BRAKER=/lisc/user/papadopoulos/containers/braker3.sif
+# path to the output directory, where all the magic should happen
+OUTPUT=/lisc/scratch/zoology/pycnogonum/genome/draft/braker2
+mkdir "$OUTPUT" -p || exit
+cd "$OUTPUT" || exit
+
+# copy AUGUSTUS config files for BRAKER
+cp -r ~/repos/Augustus/config .
+
+# define input files
+# the genome assembly
+GENOME=/lisc/scratch/zoology/pycnogonum/genome/draft/draft_softmasked.fasta
+# the RNA-seq dataset BAMs
+BASE=/lisc/scratch/zoology/pycnogonum/genome/draft/transcriptome
+ZYGOTE="$BASE/ZYGOTE/unannotated.bam"
+EARLY_CLEAVAGE="$BASE/EARLY_CLEAVAGE/unannotated.bam"
+EMBRYO0_1="$BASE/EMBRYO0_1/unannotated.bam"
+EMBRYO3_5="$BASE/EMBRYO3_5/unannotated.bam"
+EMBRYO9_10="$BASE/EMBRYO9_10/unannotated.bam"
+EMBRYO3="$BASE/EMBRYO3/unannotated.bam"
+INSTAR1="$BASE/INSTAR1/unannotated.bam"
+INSTAR2="$BASE/INSTAR2/unannotated.bam"
+INSTAR3="$BASE/INSTAR3/unannotated.bam"
+INSTAR4="$BASE/INSTAR4/unannotated.bam"
+INSTAR5="$BASE/INSTAR5/unannotated.bam"
+INSTAR6="$BASE/INSTAR6/unannotated.bam"
+JUV1="$BASE/JUV1/unannotated.bam"
+SUBADULT="$BASE/SUBADULT/unannotated.bam"
+# the Arthropoda OrthoDB protein sequences
+ARTHROPODA=/lisc/scratch/zoology/db/Arthropoda.fa
+
+# now call the container. Important to mount the current directory, /scratch/, and the $TMPDIR,
+# in case it is needed.
+singularity exec -B "$PWD,/lisc/scratch/zoology/,$TMPDIR" "$BRAKER" braker.pl \
+--AUGUSTUS_CONFIG_PATH="${PWD}"/config \
+--genome=$GENOME \
+--prot_seq=$ARTHROPODA \
+--bam="$ZYGOTE","$EARLY_CLEAVAGE","$EMBRYO0_1","$EMBRYO3_5","$EMBRYO9_10","$EMBRYO3","$INSTAR1","$INSTAR2","$INSTAR3","$INSTAR4","$INSTAR5","$INSTAR6","$JUV1","$SUBADULT" \
+--softmasking \
+--threads 10
\ No newline at end of file
diff --git a/06-annotation/annot3-RNA-01-extract_complete_ORFs.sh b/06-annotation/annot3-RNA-01-extract_complete_ORFs.sh
new file mode 100644
index 0000000000000000000000000000000000000000..89bcdcbf726e2222aa22fca6e9243edc3b4498ce
--- /dev/null
+++ b/06-annotation/annot3-RNA-01-extract_complete_ORFs.sh
@@ -0,0 +1,34 @@
+#!/usr/bin/env bash
+#
+#SBATCH --job-name=extr_unannot
+#SBATCH --ntasks=1
+#SBATCH --cpus-per-task=1
+#SBATCH --mem=500M
+#SBATCH --time=17:00
+#SBATCH --mail-type=ALL
+#SBATCH --mail-user=nikolaos.papadopoulos@univie.ac.at
+#SBATCH --output=/lisc/user/papadopoulos/log/pycno-extract_orfs-%j.out
+#SBATCH --error=/lisc/user/papadopoulos/log/pycno-extract_orfs-%j.err
+
+module load perl/5.40.0
+module load transdecoder/5.7.1
+
+DENOVO=$1
+OUTPUT=$2
+
+# 228.06user
+# 747.90user
+# say 1000s = 16-17min
+
+mkdir -p "$OUTPUT" || exit 1
+cd "$OUTPUT" || exit 1
+
+# first we need to extract ORFs
+TransDecoder.LongOrfs -t "$DENOVO"
+# now predict the coding regions
+TransDecoder.Predict -t "$DENOVO"
+# now convert the resulting .cds file to a single-line fasta
+# almost instantaneous
+perl -pe '$. > 1 and /^>/ ? print "\n" : chomp' Trinity.fasta.transdecoder.cds > oneline.fasta
+# extract complete ORFs. Almost instantaneous.
+grep -A1 complete oneline.fasta > complete.fasta
\ No newline at end of file
diff --git a/06-annotation/annot3-RNA-01-filter_assemblies.sh b/06-annotation/annot3-RNA-01-filter_assemblies.sh
new file mode 100644
index 0000000000000000000000000000000000000000..fcaddea2962360afd4e8b7f08f71f1223743558c
--- /dev/null
+++ b/06-annotation/annot3-RNA-01-filter_assemblies.sh
@@ -0,0 +1,37 @@
+#!/usr/bin/env bash
+
+FILTER=/lisc/user/papadopoulos/repos/pycno-seq/annotate/annot3-RNA-extract_complete_ORFs.sh
+
+IN=/lisc/project/zoology/pycnogonum/transcriptome/development
+
+EMBRYO3_5_IN=$IN/embryonic_stage3-4/Trinity.fasta
+INSTAR1_IN=$IN/instarI-protonymphon/Trinity.fasta
+INSTAR2_IN=$IN/instar_II/Trinity.fasta
+INSTAR3_IN=$IN/instar_III/Trinity.fasta
+INSTAR4_IN=$IN/instar_IV/Trinity.fasta
+INSTAR5_IN=$IN/instar_V/Trinity.fasta
+INSTAR6_IN=$IN/instar_VI/Trinity.fasta
+JUV1_IN=$IN/juvenile_I/Trinity.fasta
+SUBADULT_IN=$IN/subadult/Trinity.fasta
+
+OUT=/lisc/scratch/zoology/pycnogonum/genome/draft/transcriptome/
+
+EMBRYO3_5_OUT=$OUT/EMBRYO3/
+INSTAR1_OUT=$OUT/INSTAR1/
+INSTAR2_OUT=$OUT/INSTAR2/
+INSTAR3_OUT=$OUT/INSTAR3/
+INSTAR4_OUT=$OUT/INSTAR4/
+INSTAR5_OUT=$OUT/INSTAR5/
+INSTAR6_OUT=$OUT/INSTAR6/
+JUV1_OUT=$OUT/JUV1/
+SUBADULT_OUT=$OUT/SUBADULT/
+
+sbatch $FILTER $EMBRYO3_5_IN $EMBRYO3_5_OUT
+sbatch $FILTER $INSTAR1_IN $INSTAR1_OUT
+sbatch $FILTER $INSTAR2_IN $INSTAR2_OUT
+sbatch $FILTER $INSTAR3_IN $INSTAR3_OUT
+sbatch $FILTER $INSTAR4_IN $INSTAR4_OUT
+sbatch $FILTER $INSTAR5_IN $INSTAR5_OUT
+sbatch $FILTER $INSTAR6_IN $INSTAR6_OUT
+sbatch $FILTER $JUV1_IN $JUV1_OUT
+sbatch $FILTER $SUBADULT_IN $SUBADULT_OUT
\ No newline at end of file
diff --git a/06-annotation/annot3-RNA-02-map_assemblies.sh b/06-annotation/annot3-RNA-02-map_assemblies.sh
new file mode 100644
index 0000000000000000000000000000000000000000..469933f1d05d8c1ae8f9da25f3501eea04c595ce
--- /dev/null
+++ b/06-annotation/annot3-RNA-02-map_assemblies.sh
@@ -0,0 +1,25 @@
+#!/usr/bin/env bash
+
+MAP=/lisc/user/papadopoulos/repos/pycno-seq/annotate/annot3-RNA-02-map_assembly_single.sh
+
+OUT=/lisc/scratch/zoology/pycnogonum/genome/draft/transcriptome/
+
+EMBRYO3_5=$OUT/EMBRYO3
+INSTAR1=$OUT/INSTAR1
+INSTAR2=$OUT/INSTAR2
+INSTAR3=$OUT/INSTAR3
+INSTAR4=$OUT/INSTAR4
+INSTAR5=$OUT/INSTAR5
+INSTAR6=$OUT/INSTAR6
+JUV1=$OUT/JUV1
+SUBADULT=$OUT/SUBADULT
+
+sbatch $MAP $EMBRYO3_5/complete.fasta $EMBRYO3_5
+sbatch $MAP $INSTAR1/complete.fasta $INSTAR1
+sbatch $MAP $INSTAR2/complete.fasta $INSTAR2
+sbatch $MAP $INSTAR3/complete.fasta $INSTAR3
+sbatch $MAP $INSTAR4/complete.fasta $INSTAR4
+sbatch $MAP $INSTAR5/complete.fasta $INSTAR5
+sbatch $MAP $INSTAR6/complete.fasta $INSTAR6
+sbatch $MAP $JUV1/complete.fasta $JUV1
+sbatch $MAP $SUBADULT/complete.fasta $SUBADULT
\ No newline at end of file
diff --git a/06-annotation/annot3-RNA-02-map_assembly_single.sh b/06-annotation/annot3-RNA-02-map_assembly_single.sh
new file mode 100644
index 0000000000000000000000000000000000000000..2e4612f4b15daaf6d0dafeb4ca070785eadcb4a3
--- /dev/null
+++ b/06-annotation/annot3-RNA-02-map_assembly_single.sh
@@ -0,0 +1,30 @@
+#!/usr/bin/env bash
+#
+#SBATCH --job-name=mmap_pycno
+#SBATCH --ntasks=1
+#SBATCH --cpus-per-task=3
+#SBATCH --mem=10G
+#SBATCH --time=5:00
+#SBATCH --mail-type=ALL
+#SBATCH --mail-user=nikolaos.papadopoulos@univie.ac.at
+#SBATCH --output=/lisc/user/papadopoulos/log/pycno-map-txome-%j.out
+#SBATCH --error=/lisc/user/papadopoulos/log/pycno-map-txome-%j.err
+
+# with 1% of data and 3 CPUs:
+# [M::main] Real time: 448.742 sec; CPU: 1260.678 sec; Peak RSS: 1.830 GB
+# 1251.43user 9.26system 7:28.76elapsed 280%CPU (0avgtext+0avgdata 1919296maxresident)k
+
+# with 10% of data and 3 CPUs:
+# [M::main] Real time: 1528.465 sec; CPU: 3772.846 sec; Peak RSS: 2.758 GB
+# 3743.86user 29.02system 25:28.51elapsed 246%CPU (0avgtext+0avgdata 2892008maxresident)k
+# BUT THAT WAS WITH THE WRONG ARGUMENT ORDER, it is much faster the other way round
+
+module load conda
+conda activate minimap2-2.28
+
+TXOME=$1
+OUTPUT=$2
+DRAFT=/lisc/scratch/zoology/pycnogonum/genome/draft/draft.fasta
+
+cd "$TMPDIR" || exit
+minimap2 -ax splice:hq $DRAFT "$TXOME" > "$OUTPUT"/txome_assembly.sam
\ No newline at end of file
diff --git a/06-annotation/annot3-RNA-03-pack_sam.sh b/06-annotation/annot3-RNA-03-pack_sam.sh
new file mode 100644
index 0000000000000000000000000000000000000000..a81b66e6147734600d90558a0a14da257214fb3c
--- /dev/null
+++ b/06-annotation/annot3-RNA-03-pack_sam.sh
@@ -0,0 +1,24 @@
+#!/usr/bin/env bash
+#
+#SBATCH --job-name=pycno_filter
+#SBATCH --ntasks=1
+#SBATCH --cpus-per-task=2
+#SBATCH --mem=2G
+#SBATCH --time=5:00
+#SBATCH --mail-type=ALL
+#SBATCH --mail-user=nikolaos.papadopoulos@univie.ac.at
+#SBATCH --output=/lisc/user/papadopoulos/log/pycno-filter-%j.out
+#SBATCH --error=/lisc/user/papadopoulos/log/pycno-filter-%j.err
+
+module load samtools
+
+TXOME=$1
+
+cd "$TXOME" || exit
+samtools view -@2 -F 0x900 -q 30 -S -b txome_assembly.sam > txome_assembly.hd.bam
+
+
+module load conda
+conda activate agat-1.4.0
+
+agat_convert_minimap2_bam2gff.pl -i txome_assembly.hd.bam -o txome_assembly.gff
\ No newline at end of file
diff --git a/06-annotation/annot3-RNA-03-pack_sams.sh b/06-annotation/annot3-RNA-03-pack_sams.sh
new file mode 100644
index 0000000000000000000000000000000000000000..149d0554548707ffa6335c8d8858b761a9981ca4
--- /dev/null
+++ b/06-annotation/annot3-RNA-03-pack_sams.sh
@@ -0,0 +1,25 @@
+#!/usr/bin/env bash
+
+FILTER=/lisc/user/papadopoulos/repos/pycno-seq/annotate/annot3-RNA-03-pack_sam.sh
+
+OUT=/lisc/scratch/zoology/pycnogonum/genome/draft/transcriptome/
+
+EMBRYO3_OUT=$OUT/EMBRYO3/
+INSTAR1_OUT=$OUT/INSTAR1/
+INSTAR2_OUT=$OUT/INSTAR2/
+INSTAR3_OUT=$OUT/INSTAR3/
+INSTAR4_OUT=$OUT/INSTAR4/
+INSTAR5_OUT=$OUT/INSTAR5/
+INSTAR6_OUT=$OUT/INSTAR6/
+JUV1_OUT=$OUT/JUV1/
+SUBADULT_OUT=$OUT/SUBADULT/
+
+sbatch $FILTER $EMBRYO3_OUT
+sbatch $FILTER $INSTAR1_OUT
+sbatch $FILTER $INSTAR2_OUT
+sbatch $FILTER $INSTAR3_OUT
+sbatch $FILTER $INSTAR4_OUT
+sbatch $FILTER $INSTAR5_OUT
+sbatch $FILTER $INSTAR6_OUT
+sbatch $FILTER $JUV1_OUT
+sbatch $FILTER $SUBADULT_OUT
\ No newline at end of file
diff --git a/06-annotation/annot3-RNA-04-denovo_mining.ipynb b/06-annotation/annot3-RNA-04-denovo_mining.ipynb
new file mode 100644
index 0000000000000000000000000000000000000000..0e096cb616e3a0c7d42ad4569d286eff1d52ad6c
--- /dev/null
+++ b/06-annotation/annot3-RNA-04-denovo_mining.ipynb
@@ -0,0 +1,426 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Mining the _de novo_ transcriptomes to improve genome annotation\n",
+    "\n",
+    "An ISO-seq transcriptome and two rounds of BRAKER3 have produced close to 14,500 gene models for the\n",
+    "_P. litorale_ genome. However, manual inspection of the RNA-seq peaks in the genome browser revealed\n",
+    "that there are still plenty of loci with plausible gene locations that did not receive a gene model.\n",
+    "\n",
+    "To alleviate that I will try to use the _de novo_ transcriptomes that were generated from the same \n",
+    "RNA-seq data that was used for the genome annotation. The idea is to map onto the draft genome and \n",
+    "keep the transcripts that don't overlap with any gene models, then gather evidence from multiple\n",
+    "transcriptomes to generate gene models."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%bash\n",
+    "cd /Volumes/scratch/pycnogonum/genome/draft/annot_merge/\n",
+    "cat isoseq.gff > merged.gff3\n",
+    "cat braker.gff >> merged.gff3\n",
+    "cat braker2_unique_renamed_nocodon_intron.gff3 >> merged.gff3\n",
+    "\n",
+    "cd /Volumes/scratch/pycnogonum/genome/draft/annot_merge/denovo_txomes\n",
+    "DRAFT=/Volumes/scratch/pycnogonum/genome/draft/annot_merge/merged.gff3\n",
+    "BASE=/Volumes/scratch/pycnogonum/genome/draft/transcriptome\n",
+    "\n",
+    "EMBRYO3=$BASE/EMBRYO3/txome_assembly.gff\n",
+    "INSTAR1=$BASE/INSTAR1/txome_assembly.gff\n",
+    "INSTAR2=$BASE/INSTAR2/txome_assembly.gff\n",
+    "INSTAR3=$BASE/INSTAR3/txome_assembly.gff\n",
+    "INSTAR4=$BASE/INSTAR4/txome_assembly.gff\n",
+    "INSTAR5=$BASE/INSTAR5/txome_assembly.gff\n",
+    "INSTAR6=$BASE/INSTAR6/txome_assembly.gff\n",
+    "JUV1=$BASE/JUV1/txome_assembly.gff\n",
+    "SUBADULT=$BASE/SUBADULT/txome_assembly.gff\n",
+    "\n",
+    "bedtools intersect -v -a $EMBRYO3 -b $DRAFT | sed s/TRINITY/EMBRYO3/g > EMBRYO3.gff\n",
+    "bedtools intersect -v -a $INSTAR1 -b $DRAFT | sed s/TRINITY/INSTAR1/g > INSTAR1.gff\n",
+    "bedtools intersect -v -a $INSTAR2 -b $DRAFT | sed s/TRINITY/INSTAR2/g > INSTAR2.gff\n",
+    "bedtools intersect -v -a $INSTAR3 -b $DRAFT | sed s/TRINITY/INSTAR3/g > INSTAR3.gff\n",
+    "bedtools intersect -v -a $INSTAR4 -b $DRAFT | sed s/TRINITY/INSTAR4/g > INSTAR4.gff\n",
+    "bedtools intersect -v -a $INSTAR5 -b $DRAFT | sed s/TRINITY/INSTAR5/g > INSTAR5.gff\n",
+    "bedtools intersect -v -a $INSTAR6 -b $DRAFT | sed s/TRINITY/INSTAR6/g > INSTAR6.gff\n",
+    "bedtools intersect -v -a $JUV1 -b $DRAFT | sed s/TRINITY/JUV1/g > JUV1.gff\n",
+    "bedtools intersect -v -a $SUBADULT -b $DRAFT | sed s/TRINITY/SUBADULT/g > SUBADULT.gff"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now we can look at the overlap of all of these GFF files, again using `bedtools intersect`. Since\n",
+    "this is already tenuous evidence for gene models we should be as strict as possible."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%bash \n",
+    "cd /Volumes/scratch/pycnogonum/genome/draft/annot_merge/denovo_txomes\n",
+    "\n",
+    "bedtools intersect -f 0.9 -F 0.9 -wao -a INSTAR3.gff -b EMBRYO3.gff INSTAR1.gff INSTAR2.gff INSTAR4.gff INSTAR5.gff INSTAR6.gff JUV1.gff SUBADULT.gff > overlap.gff"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from tqdm import tqdm \n",
+    "\n",
+    "import numpy as np\n",
+    "import pandas as pd\n",
+    "\n",
+    "from matplotlib import pyplot as plt\n",
+    "from matplotlib.lines import Line2D"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "file_order = {\n",
+    "    \"1\": \"EMBRYO3\",\n",
+    "    \"2\": \"INSTAR1\",\n",
+    "    \"3\": \"INSTAR2\",\n",
+    "    \"4\": \"INSTAR4\",\n",
+    "    \"5\": \"INSTAR5\",\n",
+    "    \"6\": \"INSTAR6\",\n",
+    "    \"7\": \"JUV1\",\n",
+    "    \"8\": \"SUBADULT\",\n",
+    "}\n",
+    "\n",
+    "# assign the tab10 colors to each stage\n",
+    "colors = {\n",
+    "    \"EMBRYO3\": \"tab:blue\",\n",
+    "    \"INSTAR1\": \"tab:orange\",\n",
+    "    \"INSTAR2\": \"tab:green\",\n",
+    "    \"INSTAR4\": \"tab:red\",\n",
+    "    \"INSTAR5\": \"tab:purple\",\n",
+    "    \"INSTAR6\": \"tab:brown\",\n",
+    "    \"JUV1\": \"tab:pink\",\n",
+    "    \"SUBADULT\": \"tab:gray\",\n",
+    "}\n",
+    "\n",
+    "legend_elements = [\n",
+    "    Line2D([0], [0], marker='o', color='w', label='EMBRYO3', markerfacecolor='tab:blue', markersize=8),\n",
+    "    Line2D([0], [0], marker='o', color='w', label='INSTAR1', markerfacecolor='tab:orange', markersize=8),\n",
+    "    Line2D([0], [0], marker='o', color='w', label='INSTAR2', markerfacecolor='tab:green', markersize=8),\n",
+    "    Line2D([0], [0], marker='o', color='w', label='INSTAR4', markerfacecolor='tab:red', markersize=8),\n",
+    "    Line2D([0], [0], marker='o', color='w', label='INSTAR5', markerfacecolor='tab:purple', markersize=8),\n",
+    "    Line2D([0], [0], marker='o', color='w', label='INSTAR6', markerfacecolor='tab:brown', markersize=8),\n",
+    "    Line2D([0], [0], marker='o', color='w', label='JUV1', markerfacecolor='tab:pink', markersize=8),\n",
+    "    Line2D([0], [0], marker='o', color='w', label='SUBADULT', markerfacecolor='tab:gray', markersize=8),\n",
+    "]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "/var/folders/md/d6lwwbv97xb6g6ddypntnprh0000gp/T/ipykernel_61764/3506524708.py:1: DtypeWarning: Columns (9,15) have mixed types. Specify dtype option on import or set low_memory=False.\n",
+      "  overlap = pd.read_csv('/Volumes/scratch/pycnogonum/genome/draft/annot_merge/denovo_txomes/overlap.gff', sep='\\t', header=None)\n"
+     ]
+    }
+   ],
+   "source": [
+    "overlap = pd.read_csv('/Volumes/scratch/pycnogonum/genome/draft/annot_merge/denovo_txomes/overlap.gff', sep='\\t', header=None)\n",
+    "overlap = overlap.drop(columns=[1, 11])\n",
+    "overlap[\"instar3_match\"] = overlap[8].str.split(';').str[0].str.split('=').str[1].str.split(\".\").str[0]\n",
+    "overlap[\"target_match\"] = overlap[18].str.split(';').str[0].str.split('=').str[1].str.split(\".\").str[0]\n",
+    "overlap[\"target_dataset\"] = overlap[9].astype(str).replace(file_order)\n",
+    "overlap[\"target_id\"] = overlap[\"target_dataset\"].astype(str) + \"_\" + overlap[\"target_match\"].astype(str)\n",
+    "# we should only keep lines that match similar things (cDNA_match or cDNA_match_part)\n",
+    "keep = overlap[2] == overlap[12]\n",
+    "overlap = overlap[keep]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "(2635957, 22)"
+      ]
+     },
+     "execution_count": 6,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "overlap.shape"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "gene_level_matches = overlap[overlap[2] == 'cDNA_match']['instar3_match'].unique()\n",
+    "overlap = overlap[overlap[\"instar3_match\"].isin(gene_level_matches)]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "(531784, 22)"
+      ]
+     },
+     "execution_count": 8,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "overlap.shape"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "overview = overlap[overlap[2] == \"cDNA_match\"].groupby(\"instar3_match\").apply(lambda x: x[\"target_dataset\"].value_counts()>0, include_groups=False).unstack()\n",
+    "overview[\"total_score\"] = overview.sum(axis=1)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAkAAAAGdCAYAAAD60sxaAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjkuMCwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy80BEi2AAAACXBIWXMAAA9hAAAPYQGoP6dpAAAneUlEQVR4nO3dfXSU9Z3//9fklhAyCUGSQCHhRjCEG1nAwgi2FVJSyHJQOF20USKwdXUDAilUqFRcqAbpijdb7uxiAksRy1ZsxXKPhVWC3AjIzR7uvAmaTMIpkCHxkITk+v3hYX47Bb8lw5VcST7PxznXOc51XTPz/lhOeXrNNYnLsixLAAAABglxegAAAIDGRgABAADjEEAAAMA4BBAAADAOAQQAAIxDAAEAAOMQQAAAwDgEEAAAME6Y0wM0BXV1dSouLlZMTIxcLpfT4wAAgFtgWZauXLmijh07KiSkftd0CCBJxcXF6ty5s9NjAACAIJw/f16dOnWq13MIIEkxMTGSvvkX6Ha7HZ4GAADcCp/Pp86dO/v/Hq8PAkjyf+zldrsJIAAAmplgbl/hJmgAAGAcAggAABiHAAIAAMYhgAAAgHEIIAAAYBwCCAAAGIcAAgAAxiGAAACAcQggAABgHAIIAAAYhwACAADGIYAAAIBxCCAAAGAcAggAABgnzOkBAAAwXZc57zk9Qr19vijT6RFuC1eAAACAcQggAABgHAIIAAAYhwACAADGIYAAAIBxCCAAAGAcAggAABiHAAIAAMYhgAAAgHEIIAAAYBwCCAAAGIcAAgAAxiGAAACAcQggAABgHAIIAAAYhwACAADGIYAAAIBxCCAAAGAcAggAABiHAAIAAMYhgAAAgHEIIAAAYBwCCAAAGIcAAgAAxiGAAACAcQggAABgHAIIAAAYhwACAADGIYAAAIBxCCAAAGAcRwPoueeek8vlCthSU1P9x69evaqcnBy1a9dObdq00fjx41VaWhrwGkVFRcrMzFTr1q2VkJCg2bNn69q1a429FAAA0IyEOT1A7969tWPHDv/jsLD/f6SZM2fqvffe04YNGxQbG6upU6dq3Lhx+vDDDyVJtbW1yszMVFJSkvbu3auSkhJNnDhR4eHheuGFFxp9LQAAoHlwPIDCwsKUlJR0w/7y8nKtWrVK69at0/DhwyVJ+fn56tWrl/bt26chQ4Zo27ZtOnnypHbs2KHExET1799fCxcu1NNPP63nnntOERERjb0cAADQDDh+D9CZM2fUsWNHdevWTVlZWSoqKpIkHTp0SDU1NUpPT/efm5qaquTkZBUWFkqSCgsL1bdvXyUmJvrPycjIkM/n04kTJ771PauqquTz+QI2AABgDkcDaPDgwSooKNCWLVu0fPlyffbZZ7rvvvt05coVeb1eRUREKC4uLuA5iYmJ8nq9kiSv1xsQP9ePXz/2bfLy8hQbG+vfOnfubO/CAABAk+boR2CjRo3y/3O/fv00ePBgpaSk6Pe//72ioqIa7H3nzp2r3Nxc/2Ofz0cEAQBgEMc/Avu/4uLi1LNnT509e1ZJSUmqrq7W5cuXA84pLS313zOUlJR0w7fCrj++2X1F10VGRsrtdgdsAADAHE0qgCoqKnTu3Dl16NBBAwcOVHh4uHbu3Ok/furUKRUVFcnj8UiSPB6Pjh07prKyMv8527dvl9vtVlpaWqPPDwAAmgdHPwKbNWuWxowZo5SUFBUXF2v+/PkKDQ3Vww8/rNjYWE2ZMkW5ubmKj4+X2+3WtGnT5PF4NGTIEEnSyJEjlZaWpkcffVSLFy+W1+vVvHnzlJOTo8jISCeXBgAAmjBHA+jLL7/Uww8/rL/+9a9q3769hg0bpn379ql9+/aSpJdfflkhISEaP368qqqqlJGRoWXLlvmfHxoaqk2bNunJJ5+Ux+NRdHS0srOztWDBAqeWBAAAmgGXZVmW00M4zefzKTY2VuXl5dwPBABodF3mvOf0CPX2+aJMp0e4rb+/m9Q9QAAAAI2BAAIAAMYhgAAAgHEIIAAAYBwCCAAAGIcAAgAAxiGAAACAcQggAABgHAIIAAAYhwACAADGIYAAAIBxCCAAAGAcAggAABiHAAIAAMYhgAAAgHEIIAAAYBwCCAAAGIcAAgAAxiGAAACAcQggAABgHAIIAAAYhwACAADGIYAAAIBxCCAAAGAcAggAABiHAAIAAMYhgAAAgHEIIAAAYBwCCAAAGIcAAgAAxiGAAACAcQggAABgHAIIAAAYhwACAADGIYAAAIBxCCAAAGAcAggAABiHAAIAAMYhgAAAgHEIIAAAYBwCCAAAGIcAAgAAxiGAAACAcQggAABgHAIIAAAYhwACAADGIYAAAIBxCCAAAGAcAggAABiHAAIAAMYhgAAAgHEIIAAAYJwmE0CLFi2Sy+XSjBkz/PuuXr2qnJwctWvXTm3atNH48eNVWloa8LyioiJlZmaqdevWSkhI0OzZs3Xt2rVGnh4AADQnTSKADhw4oJUrV6pfv34B+2fOnKl3331XGzZs0O7du1VcXKxx48b5j9fW1iozM1PV1dXau3evVq9erYKCAj377LONvQQAANCMOB5AFRUVysrK0m9/+1u1bdvWv7+8vFyrVq3SkiVLNHz4cA0cOFD5+fnau3ev9u3bJ0natm2bTp48qbVr16p///4aNWqUFi5cqKVLl6q6utqpJQEAgCbO8QDKyclRZmam0tPTA/YfOnRINTU1AftTU1OVnJyswsJCSVJhYaH69u2rxMRE/zkZGRny+Xw6ceJE4ywAAAA0O2FOvvn69ev18ccf68CBAzcc83q9ioiIUFxcXMD+xMREeb1e/zn/N36uH79+7NtUVVWpqqrK/9jn8wW7BAAA0Aw5dgXo/Pnzmj59un73u9+pVatWjfreeXl5io2N9W+dO3du1PcHAADOciyADh06pLKyMg0YMEBhYWEKCwvT7t279dprryksLEyJiYmqrq7W5cuXA55XWlqqpKQkSVJSUtIN3wq7/vj6OTczd+5clZeX+7fz58/buzgAANCkORZAI0aM0LFjx3TkyBH/NmjQIGVlZfn/OTw8XDt37vQ/59SpUyoqKpLH45EkeTweHTt2TGVlZf5ztm/fLrfbrbS0tG9978jISLnd7oANAACYw7F7gGJiYtSnT5+AfdHR0WrXrp1//5QpU5Sbm6v4+Hi53W5NmzZNHo9HQ4YMkSSNHDlSaWlpevTRR7V48WJ5vV7NmzdPOTk5ioyMbPQ1AQCA5sHRm6D/npdfflkhISEaP368qqqqlJGRoWXLlvmPh4aGatOmTXryySfl8XgUHR2t7OxsLViwwMGpAQBAU+eyLMtyegin+Xw+xcbGqry8nI/DAACNrsuc95weod4+X5Tp9Ai39fe34z8HCAAAoLERQAAAwDgEEAAAMA4BBAAAjEMAAQAA4zTpr8EDAMA3pNAQuAIEAACMQwABAADjEEAAAMA4BBAAADAOAQQAAIxDAAEAAOMQQAAAwDgEEAAAMA4BBAAAjEMAAQAA4xBAAADAOAQQAAAwDgEEAACMQwABAADjEEAAAMA4BBAAADAOAQQAAIxDAAEAAOMQQAAAwDgEEAAAMA4BBAAAjEMAAQAA4xBAAADAOAQQAAAwTlAB9Omnn9o9BwAAQKMJKoDuvPNO3X///Vq7dq2uXr1q90wAAAANKqgA+vjjj9WvXz/l5uYqKSlJ//Iv/6L9+/fbPRsAAECDCCqA+vfvr1dffVXFxcV64403VFJSomHDhqlPnz5asmSJLly4YPecAAAAtrmtm6DDwsI0btw4bdiwQS+++KLOnj2rWbNmqXPnzpo4caJKSkrsmhMAAMA2txVABw8e1L/+67+qQ4cOWrJkiWbNmqVz585p+/btKi4u1tixY+2aEwAAwDZhwTxpyZIlys/P16lTpzR69GitWbNGo0ePVkjINz3VtWtXFRQUqEuXLnbOCgAAYIugAmj58uWaPHmyHnvsMXXo0OGm5yQkJGjVqlW3NRwAAEBDCCqAzpw583fPiYiIUHZ2djAvDwAA0KCCugcoPz9fGzZsuGH/hg0btHr16tseCgAAoCEFdQUoLy9PK1euvGF/QkKCHn/8ca78GKbLnPecHqHePl+U6fQIAAAHBRVARUVF6tq16w37U1JSVFRUdNtDtSTEAQAATU9QH4ElJCTok08+uWH/0aNH1a5du9seCgAAoCEFFUAPP/ywnnrqKb3//vuqra1VbW2tdu3apenTp+uhhx6ye0YAAABbBfUR2MKFC/X5559rxIgRCgv75iXq6uo0ceJEvfDCC7YOCAAAYLegAigiIkJvvfWWFi5cqKNHjyoqKkp9+/ZVSkqK3fMBAADYLqgAuq5nz57q2bOnXbMAAAA0iqACqLa2VgUFBdq5c6fKyspUV1cXcHzXrl22DAcAANAQggqg6dOnq6CgQJmZmerTp49cLpfdcwEAADSYoAJo/fr1+v3vf6/Ro0fbPQ8AAECDC+pr8BEREbrzzjvtngUAAKBRBBVAP/vZz/Tqq6/Ksiy75wEAAGhwQX0E9sEHH+j999/X5s2b1bt3b4WHhwccf/vtt20ZDgAAoCEEdQUoLi5ODz74oL7//e/rjjvuUGxsbMB2q5YvX65+/frJ7XbL7XbL4/Fo8+bN/uNXr15VTk6O2rVrpzZt2mj8+PEqLS0NeI2ioiJlZmaqdevWSkhI0OzZs3Xt2rVglgUAAAwR1BWg/Px8W968U6dOWrRokXr06CHLsrR69WqNHTtWhw8fVu/evTVz5ky999572rBhg2JjYzV16lSNGzdOH374oaRvvo6fmZmppKQk7d27VyUlJZo4caLCw8P5idQAAOBbBXUFSJKuXbumHTt2aOXKlbpy5Yokqbi4WBUVFbf8GmPGjNHo0aPVo0cP9ezZU88//7zatGmjffv2qby8XKtWrdKSJUs0fPhwDRw4UPn5+dq7d6/27dsnSdq2bZtOnjyptWvXqn///ho1apQWLlyopUuXqrq6OtilAQCAFi6oAPriiy/Ut29fjR07Vjk5Obpw4YIk6cUXX9SsWbOCGqS2tlbr169XZWWlPB6PDh06pJqaGqWnp/vPSU1NVXJysgoLCyVJhYWF6tu3rxITE/3nZGRkyOfz6cSJE9/6XlVVVfL5fAEbAAAwR1ABNH36dA0aNEiXLl1SVFSUf/+DDz6onTt31uu1jh07pjZt2igyMlJPPPGENm7cqLS0NHm9XkVERCguLi7g/MTERHm9XkmS1+sNiJ/rx68f+zZ5eXkB9yx17ty5XjMDAIDmLah7gP7nf/5He/fuVURERMD+Ll266KuvvqrXa9111106cuSIysvL9d///d/Kzs7W7t27gxnrls2dO1e5ubn+xz6fjwgCAMAgQQVQXV2damtrb9j/5ZdfKiYmpl6v9X9/qOLAgQN14MABvfrqq5owYYKqq6t1+fLlgKtApaWlSkpKkiQlJSVp//79Aa93/Vti18+5mcjISEVGRtZrTgAA0HIE9RHYyJEj9corr/gfu1wuVVRUaP78+bf96zHq6upUVVWlgQMHKjw8POAjtVOnTqmoqEgej0eS5PF4dOzYMZWVlfnP2b59u9xut9LS0m5rDgAA0HIFdQXopZdeUkZGhtLS0nT16lX95Cc/0ZkzZ3THHXfozTffvOXXmTt3rkaNGqXk5GRduXJF69at01/+8hdt3bpVsbGxmjJlinJzcxUfHy+3261p06bJ4/FoyJAhkr4JsbS0ND366KNavHixvF6v5s2bp5ycHK7wAACAbxVUAHXq1ElHjx7V+vXr9cknn6iiokJTpkxRVlZWwE3Rf09ZWZkmTpyokpISxcbGql+/ftq6dat++MMfSpJefvllhYSEaPz48aqqqlJGRoaWLVvmf35oaKg2bdqkJ598Uh6PR9HR0crOztaCBQuCWRYAADBEUAEkSWFhYXrkkUdu681XrVr1/zzeqlUrLV26VEuXLv3Wc1JSUvTnP//5tuYAAABmCSqA1qxZ8/88PnHixKCGAZqiLnPec3qEevt8UabTIwBAkxZUAE2fPj3gcU1Njb7++mtFRESodevWBBAAAGjSgvoW2KVLlwK2iooKnTp1SsOGDavXTdAAAABOCPp3gf2tHj16aNGiRTdcHQIAAGhqbAsg6Zsbo4uLi+18SQAAANsFdQ/Qn/70p4DHlmWppKREv/nNbzR06FBbBgPQOLjJG4CJggqgBx54IOCxy+VS+/btNXz4cL300kt2zAUAtiHyAPytoH8XGAAAQHNl6z1AAAAAzUFQV4Byc3Nv+dwlS5YE8xYAAAANJqgAOnz4sA4fPqyamhrdddddkqTTp08rNDRUAwYM8J/ncrnsmRIAAMBGQQXQmDFjFBMTo9WrV6tt27aSvvnhiJMmTdJ9992nn/3sZ7YOCQAAYKeg7gF66aWXlJeX548fSWrbtq1+9atf8S0wAADQ5AUVQD6fTxcuXLhh/4ULF3TlypXbHgoAAKAhBRVADz74oCZNmqS3335bX375pb788kv94Q9/0JQpUzRu3Di7ZwQAALBVUPcArVixQrNmzdJPfvIT1dTUfPNCYWGaMmWKfv3rX9s6IAAAgN2CCqDWrVtr2bJl+vWvf61z585Jkrp3767o6GhbhwMAAGgIt/WDEEtKSlRSUqIePXooOjpalmXZNRcAAECDCSqA/vrXv2rEiBHq2bOnRo8erZKSEknSlClT+Ao8AABo8oIKoJkzZyo8PFxFRUVq3bq1f/+ECRO0ZcsW24YDAABoCEHdA7Rt2zZt3bpVnTp1Ctjfo0cPffHFF7YMBgAA0FCCugJUWVkZcOXnuosXLyoyMvK2hwIAAGhIQQXQfffdpzVr1vgfu1wu1dXVafHixbr//vttGw4AAKAhBPUR2OLFizVixAgdPHhQ1dXV+vnPf64TJ07o4sWL+vDDD+2eEQAAwFZBXQHq06ePTp8+rWHDhmns2LGqrKzUuHHjdPjwYXXv3t3uGQEAAGxV7ytANTU1+tGPfqQVK1bomWeeaYiZAAAAGlS9Ayg8PFyffPJJQ8wCAAhClznvOT1CvX2+KNPpEWC4oD4Ce+SRR7Rq1Sq7ZwEAAGgUQd0Efe3aNb3xxhvasWOHBg4ceMPvAFuyZIktwwEAADSEegXQp59+qi5duuj48eMaMGCAJOn06dMB57hcLvumAwAAaAD1CqAePXqopKRE77//vqRvfvXFa6+9psTExAYZDgAAoCHU6x6gv/1t75s3b1ZlZaWtAwEAADS0oG6Cvu5vgwgAAKA5qFcAuVyuG+7x4Z4fAADQ3NTrHiDLsvTYY4/5f+Hp1atX9cQTT9zwLbC3337bvgkBAABsVq8Ays7ODnj8yCOP2DoMAABAY6hXAOXn5zfUHAAAAI3mtm6CBgAAaI4IIAAAYBwCCAAAGIcAAgAAxiGAAACAcQggAABgHAIIAAAYhwACAADGIYAAAIBxCCAAAGAcAggAABiHAAIAAMYhgAAAgHEIIAAAYBwCCAAAGMfRAMrLy9M999yjmJgYJSQk6IEHHtCpU6cCzrl69apycnLUrl07tWnTRuPHj1dpaWnAOUVFRcrMzFTr1q2VkJCg2bNn69q1a425FAAA0Iw4GkC7d+9WTk6O9u3bp+3bt6umpkYjR45UZWWl/5yZM2fq3Xff1YYNG7R7924VFxdr3Lhx/uO1tbXKzMxUdXW19u7dq9WrV6ugoEDPPvusE0sCAADNQJiTb75ly5aAxwUFBUpISNChQ4f0ve99T+Xl5Vq1apXWrVun4cOHS5Ly8/PVq1cv7du3T0OGDNG2bdt08uRJ7dixQ4mJierfv78WLlyop59+Ws8995wiIiKcWBoAAGjCmtQ9QOXl5ZKk+Ph4SdKhQ4dUU1Oj9PR0/zmpqalKTk5WYWGhJKmwsFB9+/ZVYmKi/5yMjAz5fD6dOHHipu9TVVUln88XsAEAAHM0mQCqq6vTjBkzNHToUPXp00eS5PV6FRERobi4uIBzExMT5fV6/ef83/i5fvz6sZvJy8tTbGysf+vcubPNqwEAAE1ZkwmgnJwcHT9+XOvXr2/w95o7d67Ky8v92/nz5xv8PQEAQNPh6D1A102dOlWbNm3Snj171KlTJ//+pKQkVVdX6/LlywFXgUpLS5WUlOQ/Z//+/QGvd/1bYtfP+VuRkZGKjIy0eRUAAKC5cPQKkGVZmjp1qjZu3Khdu3apa9euAccHDhyo8PBw7dy507/v1KlTKioqksfjkSR5PB4dO3ZMZWVl/nO2b98ut9uttLS0xlkIAABoVhy9ApSTk6N169bpj3/8o2JiYvz37MTGxioqKkqxsbGaMmWKcnNzFR8fL7fbrWnTpsnj8WjIkCGSpJEjRyotLU2PPvqoFi9eLK/Xq3nz5iknJ4erPAAA4KYcDaDly5dLkn7wgx8E7M/Pz9djjz0mSXr55ZcVEhKi8ePHq6qqShkZGVq2bJn/3NDQUG3atElPPvmkPB6PoqOjlZ2drQULFjTWMgAAQDPjaABZlvV3z2nVqpWWLl2qpUuXfus5KSkp+vOf/2znaAAAoAVrMt8CAwAAaCwEEAAAMA4BBAAAjEMAAQAA4xBAAADAOAQQAAAwDgEEAACMQwABAADjEEAAAMA4BBAAADAOAQQAAIxDAAEAAOMQQAAAwDgEEAAAMA4BBAAAjEMAAQAA4xBAAADAOAQQAAAwDgEEAACMQwABAADjEEAAAMA4BBAAADAOAQQAAIxDAAEAAOMQQAAAwDgEEAAAMA4BBAAAjEMAAQAA4xBAAADAOAQQAAAwDgEEAACMQwABAADjEEAAAMA4BBAAADAOAQQAAIxDAAEAAOMQQAAAwDgEEAAAMA4BBAAAjEMAAQAA4xBAAADAOAQQAAAwDgEEAACMQwABAADjEEAAAMA4BBAAADAOAQQAAIxDAAEAAOMQQAAAwDgEEAAAMA4BBAAAjEMAAQAA4xBAAADAOI4G0J49ezRmzBh17NhRLpdL77zzTsBxy7L07LPPqkOHDoqKilJ6errOnDkTcM7FixeVlZUlt9utuLg4TZkyRRUVFY24CgAA0Nw4GkCVlZW6++67tXTp0pseX7x4sV577TWtWLFCH330kaKjo5WRkaGrV6/6z8nKytKJEye0fft2bdq0SXv27NHjjz/eWEsAAADNUJiTbz5q1CiNGjXqpscsy9Irr7yiefPmaezYsZKkNWvWKDExUe+8844eeugh/e///q+2bNmiAwcOaNCgQZKk//iP/9Do0aP17//+7+rYsWOjrQUAADQfTfYeoM8++0xer1fp6en+fbGxsRo8eLAKCwslSYWFhYqLi/PHjySlp6crJCREH3300be+dlVVlXw+X8AGAADM0WQDyOv1SpISExMD9icmJvqPeb1eJSQkBBwPCwtTfHy8/5ybycvLU2xsrH/r3LmzzdMDAICmrMkGUEOaO3euysvL/dv58+edHgkAADSiJhtASUlJkqTS0tKA/aWlpf5jSUlJKisrCzh+7do1Xbx40X/OzURGRsrtdgdsAADAHE02gLp27aqkpCTt3LnTv8/n8+mjjz6Sx+ORJHk8Hl2+fFmHDh3yn7Nr1y7V1dVp8ODBjT4zAABoHhz9FlhFRYXOnj3rf/zZZ5/pyJEjio+PV3JysmbMmKFf/epX6tGjh7p27apf/vKX6tixox544AFJUq9evfSjH/1IP/3pT7VixQrV1NRo6tSpeuihh/gGGAAA+FaOBtDBgwd1//33+x/n5uZKkrKzs1VQUKCf//znqqys1OOPP67Lly9r2LBh2rJli1q1auV/zu9+9ztNnTpVI0aMUEhIiMaPH6/XXnut0dcCAACaD0cD6Ac/+IEsy/rW4y6XSwsWLNCCBQu+9Zz4+HitW7euIcYDAAAtVJO9BwgAAKChEEAAAMA4BBAAADAOAQQAAIxDAAEAAOMQQAAAwDgEEAAAMA4BBAAAjEMAAQAA4xBAAADAOAQQAAAwDgEEAACMQwABAADjEEAAAMA4BBAAADAOAQQAAIxDAAEAAOMQQAAAwDgEEAAAMA4BBAAAjEMAAQAA4xBAAADAOAQQAAAwDgEEAACMQwABAADjEEAAAMA4BBAAADAOAQQAAIxDAAEAAOMQQAAAwDgEEAAAMA4BBAAAjEMAAQAA4xBAAADAOAQQAAAwDgEEAACMQwABAADjEEAAAMA4BBAAADAOAQQAAIxDAAEAAOMQQAAAwDgEEAAAMA4BBAAAjEMAAQAA4xBAAADAOAQQAAAwDgEEAACMQwABAADjEEAAAMA4BBAAADAOAQQAAIzTYgJo6dKl6tKli1q1aqXBgwdr//79To8EAACaqBYRQG+99ZZyc3M1f/58ffzxx7r77ruVkZGhsrIyp0cDAABNUIsIoCVLluinP/2pJk2apLS0NK1YsUKtW7fWG2+84fRoAACgCQpzeoDbVV1drUOHDmnu3Ln+fSEhIUpPT1dhYeFNn1NVVaWqqir/4/LyckmSz+ezfb66qq9tf82GVt9/Dy19jayv6eHPaCDW1/TwZ7RxZ7Asq/5Ptpq5r776ypJk7d27N2D/7Nmzre9+97s3fc78+fMtSWxsbGxsbGwtYDt37ly9+6HZXwEKxty5c5Wbm+t/XFdXp4sXL6pdu3ZyuVwOTnbrfD6fOnfurPPnz8vtdjs9ju1YX/PX0tfY0tcntfw1sr7mr7y8XMnJyYqPj6/3c5t9AN1xxx0KDQ1VaWlpwP7S0lIlJSXd9DmRkZGKjIwM2BcXF9dQIzYot9vdYv9gS6yvJWjpa2zp65Na/hpZX/MXElL/W5qb/U3QERERGjhwoHbu3OnfV1dXp507d8rj8Tg4GQAAaKqa/RUgScrNzVV2drYGDRqk7373u3rllVdUWVmpSZMmOT0aAABoglpEAE2YMEEXLlzQs88+K6/Xq/79+2vLli1KTEx0erQGExkZqfnz59/wUV5Lwfqav5a+xpa+Pqnlr5H1NX+3s0aXZQXz3TEAAIDmq9nfAwQAAFBfBBAAADAOAQQAAIxDAAEAAOMQQM3Mnj17NGbMGHXs2FEul0vvvPOO0yPZKi8vT/fcc49iYmKUkJCgBx54QKdOnXJ6LNssX75c/fr18/9gMo/Ho82bNzs9VoNZtGiRXC6XZsyY4fQotnnuuefkcrkCttTUVKfHstVXX32lRx55RO3atVNUVJT69u2rgwcPOj2Wbbp06XLD/4Yul0s5OTlOj2aL2tpa/fKXv1TXrl0VFRWl7t27a+HChcH9vqwm6sqVK5oxY4ZSUlIUFRWle++9VwcOHKjXa7SIr8GbpLKyUnfffbcmT56scePGOT2O7Xbv3q2cnBzdc889unbtmn7xi19o5MiROnnypKKjo50e77Z16tRJixYtUo8ePWRZllavXq2xY8fq8OHD6t27t9Pj2erAgQNauXKl+vXr5/Qotuvdu7d27NjhfxwW1nL+r/TSpUsaOnSo7r//fm3evFnt27fXmTNn1LZtW6dHs82BAwdUW1vrf3z8+HH98Ic/1I9//GMHp7LPiy++qOXLl2v16tXq3bu3Dh48qEmTJik2NlZPPfWU0+PZ4p//+Z91/Phx/dd//Zc6duyotWvXKj09XSdPntR3vvOdW3uRev/2MDQZkqyNGzc6PUaDKisrsyRZu3fvdnqUBtO2bVvrP//zP50ew1ZXrlyxevToYW3fvt36/ve/b02fPt3pkWwzf/586+6773Z6jAbz9NNPW8OGDXN6jEY1ffp0q3v37lZdXZ3To9giMzPTmjx5csC+cePGWVlZWQ5NZK+vv/7aCg0NtTZt2hSwf8CAAdYzzzxzy6/DR2Bo0srLyyUpqF9019TV1tZq/fr1qqysbHG/tiUnJ0eZmZlKT093epQGcebMGXXs2FHdunVTVlaWioqKnB7JNn/60580aNAg/fjHP1ZCQoL+4R/+Qb/97W+dHqvBVFdXa+3atZo8eXKz+WXYf8+9996rnTt36vTp05Kko0eP6oMPPtCoUaMcnswe165dU21trVq1ahWwPyoqSh988MEtv07LuW6LFqeurk4zZszQ0KFD1adPH6fHsc2xY8fk8Xh09epVtWnTRhs3blRaWprTY9lm/fr1+vjjj+v9eXxzMXjwYBUUFOiuu+5SSUmJ/u3f/k333Xefjh8/rpiYGKfHu22ffvqpli9frtzcXP3iF7/QgQMH9NRTTykiIkLZ2dlOj2e7d955R5cvX9Zjjz3m9Ci2mTNnjnw+n1JTUxUaGqra2lo9//zzysrKcno0W8TExMjj8WjhwoXq1auXEhMT9eabb6qwsFB33nnnrb+Q3Zem0HjUwj8Ce+KJJ6yUlBTr/PnzTo9iq6qqKuvMmTPWwYMHrTlz5lh33HGHdeLECafHskVRUZGVkJBgHT161L+vpX0E9rcuXbpkud3uFvMxZnh4uOXxeAL2TZs2zRoyZIhDEzWskSNHWv/4j//o9Bi2evPNN61OnTpZb775pvXJJ59Ya9asseLj462CggKnR7PN2bNnre9973uWJCs0NNS65557rKysLCs1NfWWX4MrQGiSpk6dqk2bNmnPnj3q1KmT0+PYKiIiwv9fKQMHDtSBAwf06quvauXKlQ5PdvsOHTqksrIyDRgwwL+vtrZWe/bs0W9+8xtVVVUpNDTUwQntFxcXp549e+rs2bNOj2KLDh063HBFslevXvrDH/7g0EQN54svvtCOHTv09ttvOz2KrWbPnq05c+booYcekiT17dtXX3zxhfLy8lrMVbzu3btr9+7dqqyslM/nU4cOHTRhwgR169btll+De4DQpFiWpalTp2rjxo3atWuXunbt6vRIDa6urk5VVVVOj2GLESNG6NixYzpy5Ih/GzRokLKysnTkyJEWFz+SVFFRoXPnzqlDhw5Oj2KLoUOH3vCjJ06fPq2UlBSHJmo4+fn5SkhIUGZmptOj2Orrr79WSEjgX++hoaGqq6tzaKKGEx0drQ4dOujSpUvaunWrxo4de8vP5QpQM1NRURHwX5qfffaZjhw5ovj4eCUnJzs4mT1ycnK0bt06/fGPf1RMTIy8Xq8kKTY2VlFRUQ5Pd/vmzp2rUaNGKTk5WVeuXNG6dev0l7/8RVu3bnV6NFvExMTccL9WdHS02rVr12Lu45o1a5bGjBmjlJQUFRcXa/78+QoNDdXDDz/s9Gi2mDlzpu6991698MIL+qd/+ift379fr7/+ul5//XWnR7NVXV2d8vPzlZ2d3aJ+jIEkjRkzRs8//7ySk5PVu3dvHT58WEuWLNHkyZOdHs02W7dulWVZuuuuu3T27FnNnj1bqampmjRp0q2/SIN9QIcG8f7771uSbtiys7OdHs0WN1ubJCs/P9/p0WwxefJkKyUlxYqIiLDat29vjRgxwtq2bZvTYzWolnYP0IQJE6wOHTpYERER1ne+8x1rwoQJ1tmzZ50ey1bvvvuu1adPHysyMtJKTU21Xn/9dadHst3WrVstSdapU6ecHsV2Pp/Pmj59upWcnGy1atXK6tatm/XMM89YVVVVTo9mm7feesvq1q2bFRERYSUlJVk5OTnW5cuX6/UaLstqQT8aEgAA4BZwDxAAADAOAQQAAIxDAAEAAOMQQAAAwDgEEAAAMA4BBAAAjEMAAQAA4xBAAADAOAQQAAAwDgEEAACMQwABAADjEEAAAMA4/x8aGBSYY3Q3ogAAAABJRU5ErkJggg==",
+      "text/plain": [
+       "<Figure size 640x480 with 1 Axes>"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "overview[\"total_score\"].plot(kind='hist', bins=np.arange(9)+0.6, color='tab:blue', width=0.8);"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def find_gene_boundaries(df):\n",
+    "    entire = df[df[2] == \"cDNA_match\"]\n",
+    "    if entire.shape[0] == 0:\n",
+    "        return None\n",
+    "    start = entire[3].unique()\n",
+    "    end = entire[4].unique()\n",
+    "    if len(start) > 1 or len(end) > 1:\n",
+    "        return ValueError(\"More than one start or end\")\n",
+    "    return (df[0].iloc[0], start[0], end[0])"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def find_exon_boundaries(df):\n",
+    "    start = df[13].value_counts().idxmax()\n",
+    "    end = df[14].value_counts().idxmax()\n",
+    "    return (start, end)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 13,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def gene_string(gene_id, start, stop, seqid, source, strand, support):\n",
+    "    seq_type = \"gene\"\n",
+    "    score = \".\"\n",
+    "    phase = \".\"\n",
+    "    attributes = f\"ID={gene_id};gene_id={gene_id};support={support}\"\n",
+    "    string = f\"{seqid}\\t{source}\\t{seq_type}\\t{start}\\t{stop}\\t{score}\\t{strand}\\t{phase}\\t{attributes};\\n\"\n",
+    "    return string\n",
+    "\n",
+    "def transcript_string(gene_id, transcript_id, seqid, source, start, stop, score, strand):\n",
+    "    seq_type = \"mRNA\"\n",
+    "    phase = \".\"\n",
+    "    attributes = f\"ID={transcript_id};Parent={gene_id}\"\n",
+    "    string = f\"{seqid}\\t{source}\\t{seq_type}\\t{start}\\t{stop}\\t{score}\\t{strand}\\t{phase}\\t{attributes};\\n\"\n",
+    "    return string\n",
+    "\n",
+    "def exon_string(transcript_id, exon_counter, seqid, source, start, stop, score, strand):\n",
+    "    seq_type = \"exon\"\n",
+    "    phase = \".\"\n",
+    "    attributes = f\"ID={transcript_id}.exon{exon_counter};Parent={transcript_id}\"\n",
+    "    exon_string = f\"{seqid}\\t{source}\\t{seq_type}\\t{start}\\t{stop}\\t{score}\\t{strand}\\t{phase}\\t{attributes};\\n\"\n",
+    "    \n",
+    "    seq_type = \"CDS\"\n",
+    "    phase=\"0\"\n",
+    "    attributes = f\"ID={transcript_id}.CDS{exon_counter};Parent={transcript_id}\"\n",
+    "    CDS_string = f\"{seqid}\\t{source}\\t{seq_type}\\t{start}\\t{stop}\\t{score}\\t{strand}\\t{phase}\\t{attributes};\\n\"\n",
+    "    \n",
+    "    return exon_string + CDS_string\n",
+    "\n",
+    "def contained(gene, catalogue):\n",
+    "    scaffold, start, stop = gene\n",
+    "    for seen in catalogue:\n",
+    "        if scaffold != seen[0]:\n",
+    "            continue\n",
+    "        if start >= seen[1] and stop <= seen[2]:\n",
+    "            return True\n",
+    "        if start == seen[1] or stop == seen[2]:\n",
+    "            return True\n",
+    "    return False"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 14,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "100%|██████████| 2686/2686 [02:12<00:00, 20.34it/s]\n"
+     ]
+    }
+   ],
+   "source": [
+    "# test = overlap[overlap[\"instar3_match\"] == \"match262\"]\n",
+    "problematic = []\n",
+    "seen = []\n",
+    "\n",
+    "with open(\"/Volumes/scratch/pycnogonum/genome/draft/annot_merge/denovo_txomes/overlap_translated.gff3\", \"w\") as gff:\n",
+    "    for i, match in enumerate(tqdm(gene_level_matches)):\n",
+    "        gene = overlap[overlap[\"instar3_match\"] == match]\n",
+    "        pseudochrom, start, stop = find_gene_boundaries(gene)\n",
+    "        if contained([pseudochrom, start, stop], seen):\n",
+    "            continue\n",
+    "        seen.append([pseudochrom, start, stop])\n",
+    "        parts = gene[gene[2] == \"cDNA_match_part\"].copy()\n",
+    "        if parts.shape[0] == 0:\n",
+    "            problematic.append(match)\n",
+    "            continue\n",
+    "        parts[\"boundary\"] = parts[3].astype(str) + \"-\" + parts[4].astype(str)\n",
+    "        exons = parts.groupby(\"boundary\").apply(lambda x: find_exon_boundaries(x), include_groups=False)\n",
+    "\n",
+    "        seqid = gene[0].values[0]\n",
+    "        gene_id = f\"at_DN{i+1:04d}\"\n",
+    "        source = \"Trinity\"\n",
+    "        strand = gene[6].values[0]\n",
+    "        support = len(gene[\"target_dataset\"].unique())\n",
+    "        gene_gff = gene_string(gene_id, start, stop, seqid, source, strand, support)\n",
+    "        gff.write(gene_gff)\n",
+    "        transcript_gff = transcript_string(gene_id, gene_id+\".t1\", seqid, source, start, stop, \".\", strand)\n",
+    "        gff.write(transcript_gff)\n",
+    "        for exon in exons:\n",
+    "            exon_gff = exon_string(gene_id + \".t1\", i, seqid, source, exon[0], exon[1], \".\", strand)\n",
+    "            gff.write(exon_gff)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 15,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "['match20947',\n",
+       " 'match31308',\n",
+       " 'match31516',\n",
+       " 'match31653',\n",
+       " 'match32973',\n",
+       " 'match36716']"
+      ]
+     },
+     "execution_count": 15,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "problematic"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 16,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "780"
+      ]
+     },
+     "execution_count": 16,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "len(seen)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.12.4"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/06-annotation/install_braker.md b/06-annotation/install_braker.md
new file mode 100644
index 0000000000000000000000000000000000000000..ae68eec064208e4a002a4b45716cd2227e8b842a
--- /dev/null
+++ b/06-annotation/install_braker.md
@@ -0,0 +1,361 @@
+# Installing BRAKER3 on LISC
+
+#### Accurate as of 2023-12-21
+
+## Motivation
+
+BRAKER3 is one of the premier pipelines for eukaryotic gene prediction. Unfortunately, it is also
+one of the most difficult to install and run, especially in a HPC environment and _especially_
+without root access. This is because BRAKER3 is a pipeline that combines several different tools,
+needs several obscure dependencies, and requires a lot of configuration.
+
+Since I foresee a lot of gene prediction in my future, I decided to take the time to document my
+progress installing and running the software so that future me (and maybe current/future you) can
+benefit the next time this is needed.
+
+## Installation
+
+My main sources are
+* the [BRAKER3 GitHub page](https://github.com/Gaius-Augustus/BRAKER)
+* the [BRAKER3 DockerHub page](https://hub.docker.com/r/teambraker/braker3)
+* the [LISC HOWTO run containers page](https://lisc.univie.ac.at/wiki/content/working_environment/applications/software_as_container#singularity_and_apptainer)
+
+### Step 1: Get BRAKER image
+
+We are going to be working exclusively with Singularity, since Docker is not really optimised for 
+HPC environments. There is no dedicated Singularity image for BRAKER3, but there is a Docker image
+that Singularity can use. Let's pull that.
+
+First I will create an appropriate directory structure for all of this:
+
+```
+>niko@laptop$ ssh lisc
+>niko@lisc:~$ mkdir containers
+>niko@lisc:~$ cd containers
+```
+
+The LISC website suggests the `apptainer` env, which should contain the newest version of
+apptainer/singularity. Unfortunately, this does not seem to be the case, so I will use the
+`singularity` env instead.
+
+```
+>niko@lisc:~/containers$ module load conda
+>niko@lisc:~/containers$ conda activate singularity
+```
+
+Pull the image:
+
+```
+>niko@lisc:~/containers$ singularity build braker3.sif docker://teambraker/braker3:latest
+```
+
+<details>
+  <summary>output</summary>
+  ```
+    (lots of text but the end should look similar to this:)
+    2023/12/21 11:16:46  warn rootless{opt/conda/share/jupyter/lab/staging/node_modules/url-loader/node_modules/.bin/webpack} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers"
+    2023/12/21 11:16:46  warn rootless{opt/conda/share/jupyter/lab/staging/node_modules/vega-embed/node_modules/.bin/semver} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers"
+    2023/12/21 11:16:46  warn rootless{opt/conda/share/jupyter/lab/staging/node_modules/vega-embed/node_modules/.bin/vl2pdf} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers"
+    2023/12/21 11:16:46  warn rootless{opt/conda/share/jupyter/lab/staging/node_modules/vega-embed/node_modules/.bin/vl2png} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers"
+    2023/12/21 11:16:46  warn rootless{opt/conda/share/jupyter/lab/staging/node_modules/vega-embed/node_modules/.bin/vl2svg} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers"
+    2023/12/21 11:16:46  warn rootless{opt/conda/share/jupyter/lab/staging/node_modules/vega-embed/node_modules/.bin/vl2vg} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers"
+    2023/12/21 11:16:47  warn rootless{opt/conda/share/jupyter/lab/staging/node_modules/vega-loader/node_modules/.bin/csv2json} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers"
+    2023/12/21 11:16:47  warn rootless{opt/conda/share/jupyter/lab/staging/node_modules/vega-loader/node_modules/.bin/csv2tsv} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers"
+    2023/12/21 11:16:47  warn rootless{opt/conda/share/jupyter/lab/staging/node_modules/vega-loader/node_modules/.bin/dsv2dsv} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers"
+    2023/12/21 11:16:47  warn rootless{opt/conda/share/jupyter/lab/staging/node_modules/vega-loader/node_modules/.bin/dsv2json} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers"
+    2023/12/21 11:16:47  warn rootless{opt/conda/share/jupyter/lab/staging/node_modules/vega-loader/node_modules/.bin/json2csv} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers"
+    2023/12/21 11:16:47  warn rootless{opt/conda/share/jupyter/lab/staging/node_modules/vega-loader/node_modules/.bin/json2dsv} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers"
+    2023/12/21 11:16:47  warn rootless{opt/conda/share/jupyter/lab/staging/node_modules/vega-loader/node_modules/.bin/json2tsv} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers"
+    2023/12/21 11:16:47  warn rootless{opt/conda/share/jupyter/lab/staging/node_modules/vega-loader/node_modules/.bin/topo2geo} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers"
+    2023/12/21 11:16:47  warn rootless{opt/conda/share/jupyter/lab/staging/node_modules/vega-loader/node_modules/.bin/topomerge} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers"
+    2023/12/21 11:16:47  warn rootless{opt/conda/share/jupyter/lab/staging/node_modules/vega-loader/node_modules/.bin/topoquantize} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers"
+    2023/12/21 11:16:47  warn rootless{opt/conda/share/jupyter/lab/staging/node_modules/vega-loader/node_modules/.bin/tsv2csv} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers"
+    2023/12/21 11:16:47  warn rootless{opt/conda/share/jupyter/lab/staging/node_modules/vega-loader/node_modules/.bin/tsv2json} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers"
+    2023/12/21 11:16:47  warn rootless{opt/conda/share/jupyter/lab/staging/node_modules/vega-projection/node_modules/.bin/geo2svg} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers"
+    2023/12/21 11:16:47  warn rootless{opt/conda/share/jupyter/lab/staging/node_modules/vega-projection/node_modules/.bin/geograticule} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers"
+    2023/12/21 11:16:47  warn rootless{opt/conda/share/jupyter/lab/staging/node_modules/vega-projection/node_modules/.bin/geoproject} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers"
+    2023/12/21 11:16:47  warn rootless{opt/conda/share/jupyter/lab/staging/node_modules/vega-projection/node_modules/.bin/geoquantize} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers"
+    2023/12/21 11:16:47  warn rootless{opt/conda/share/jupyter/lab/staging/node_modules/vega-projection/node_modules/.bin/geostitch} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers"
+    2023/12/21 11:16:47  warn rootless{opt/conda/share/jupyter/lab/staging/node_modules/vega-themes/node_modules/.bin/vl2pdf} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers"
+    2023/12/21 11:16:47  warn rootless{opt/conda/share/jupyter/lab/staging/node_modules/vega-themes/node_modules/.bin/vl2png} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers"
+    2023/12/21 11:16:47  warn rootless{opt/conda/share/jupyter/lab/staging/node_modules/vega-themes/node_modules/.bin/vl2svg} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers"
+    2023/12/21 11:16:47  warn rootless{opt/conda/share/jupyter/lab/staging/node_modules/vega-themes/node_modules/.bin/vl2vg} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers"
+    2023/12/21 11:16:47  warn rootless{opt/conda/share/jupyter/lab/staging/node_modules/verdaccio/node_modules/.bin/JSONStream} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers"
+    2023/12/21 11:16:47  warn rootless{opt/conda/share/jupyter/lab/staging/node_modules/verdaccio/node_modules/.bin/envinfo} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers"
+    2023/12/21 11:16:47  warn rootless{opt/conda/share/jupyter/lab/staging/node_modules/verdaccio/node_modules/.bin/handlebars} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers"
+    2023/12/21 11:16:47  warn rootless{opt/conda/share/jupyter/lab/staging/node_modules/verdaccio/node_modules/.bin/js-yaml} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers"
+    2023/12/21 11:16:47  warn rootless{opt/conda/share/jupyter/lab/staging/node_modules/verdaccio/node_modules/.bin/marked} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers"
+    2023/12/21 11:16:47  warn rootless{opt/conda/share/jupyter/lab/staging/node_modules/verdaccio/node_modules/.bin/mime} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers"
+    2023/12/21 11:16:47  warn rootless{opt/conda/share/jupyter/lab/staging/node_modules/verdaccio/node_modules/.bin/mkdirp} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers"
+    2023/12/21 11:16:47  warn rootless{opt/conda/share/jupyter/lab/staging/node_modules/verdaccio/node_modules/.bin/pino} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers"
+    2023/12/21 11:16:47  warn rootless{opt/conda/share/jupyter/lab/staging/node_modules/verdaccio/node_modules/.bin/semver} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers"
+    2023/12/21 11:16:47  warn rootless{opt/conda/share/jupyter/lab/staging/node_modules/verdaccio/node_modules/request/node_modules/.bin/uuid} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers"
+    2023/12/21 11:16:47  warn rootless{opt/conda/share/jupyter/lab/staging/node_modules/warning/node_modules/.bin/loose-envify} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers"
+    2023/12/21 11:16:47  warn rootless{opt/conda/share/jupyter/lab/staging/node_modules/webpack/node_modules/.bin/acorn} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers"
+    2023/12/21 11:16:47  warn rootless{opt/conda/share/jupyter/lab/staging/node_modules/webpack/node_modules/.bin/browserslist} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers"
+    2023/12/21 11:16:47  warn rootless{opt/conda/share/jupyter/lab/staging/node_modules/webpack/node_modules/terser-webpack-plugin/node_modules/.bin/terser} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers"
+    2023/12/21 11:16:47  warn rootless{opt/conda/share/jupyter/lab/staging/node_modules/webpack/node_modules/terser-webpack-plugin/node_modules/.bin/webpack} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers"
+    2023/12/21 11:16:47  warn rootless{opt/conda/share/jupyter/lab/staging/node_modules/webpack-bundle-analyzer/node_modules/.bin/acorn} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers"
+    2023/12/21 11:16:47  warn rootless{opt/conda/share/jupyter/lab/staging/node_modules/webpack-bundle-analyzer/node_modules/.bin/mkdirp} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers"
+    2023/12/21 11:16:47  warn rootless{opt/conda/share/jupyter/lab/staging/node_modules/webpack-bundle-analyzer/node_modules/.bin/opener} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers"
+    2023/12/21 11:16:47  warn rootless{opt/conda/share/jupyter/lab/staging/node_modules/webpack-cli/node_modules/.bin/import-local-fixture} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers"
+    2023/12/21 11:16:47  warn rootless{opt/conda/share/jupyter/lab/staging/node_modules/webpack-cli/node_modules/.bin/webpack} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers"
+    2023/12/21 11:16:47  warn rootless{opt/conda/share/jupyter/lab/staging/node_modules/worker-loader/node_modules/.bin/webpack} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers"
+    2023/12/21 11:16:47  warn rootless{opt/conda/share/jupyter/lab/staging/node_modules/yarn-deduplicate/node_modules/.bin/semver} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers"
+    2023/12/21 11:16:47  info unpack layer: sha256:326593c72c7e03cb9871c467596fa18fc4b639718a0d9b6401fa9d10c651dfa2
+    2023/12/21 11:16:54  info unpack layer: sha256:5c0048263e6eb546f9aca8c9de2e27e5f27ab5d0f65b11ec910ec6c425aa89f3
+    2023/12/21 11:16:54  info unpack layer: sha256:179f754036e2ed1733341fa46bb1701649ad0cb671cbb1c27024e82c28a7c739
+    2023/12/21 11:16:54  info unpack layer: sha256:e0f250156230802be0037b5aab71a25983bc15fd6939ddfe860f8ca769aae171
+    2023/12/21 11:16:59  info unpack layer: sha256:722137bce0015e70b1c769e21a9cee9e2db0dad25023d40b7be5de4f6cd36043
+    2023/12/21 11:16:59  info unpack layer: sha256:afd4f74c025028308b4f876a35509b1c8e66fa6cae794cd7cd756fcf8b9b1538
+    2023/12/21 11:16:59  info unpack layer: sha256:1c44777d0dc4b79af795a0c0e36ff5c06de0ebf44d720d2362b736256cc6608e
+    2023/12/21 11:16:59  info unpack layer: sha256:d1b34f82f8cdd5b8a1a4b02092856245514c8d9a8f4c6967ec3cec874d1b8829
+    2023/12/21 11:17:00  info unpack layer: sha256:7da56a27f920dac47ec7df9eabd6bd4512b38968bfce850ee97ead3a85060fca
+    2023/12/21 11:17:01  info unpack layer: sha256:911b3fb18e1c56358fae9bede839bfc53d2f7c23074805b8e50682f3d127f72e
+    2023/12/21 11:17:02  info unpack layer: sha256:a8aaeed7aa6f26e7a85826f0c81a19764c544860568bed2565f89aa5a8eccd91
+    2023/12/21 11:17:03  info unpack layer: sha256:4e55a6c82cf9816efd77f43fda4782801df62d729f37e64cf646c8c830c0202d
+    2023/12/21 11:17:03  info unpack layer: sha256:4e8420ef9a6d15a7586ac7a35b91a85cba159711c14b6b3d6c12c60f802749c8
+    2023/12/21 11:17:03  info unpack layer: sha256:d2fcf73697ea247644f6b122f58d8388fbfa87727ef8640485cced96a764a24e
+    2023/12/21 11:17:03  info unpack layer: sha256:1079f418230549427d87bfba97ed9f97f16ad8383d036bf380d48674f6ffbd43
+    2023/12/21 11:17:10  warn rootless{home/jovyan/.npm/_cacache/content-v2/sha512/74/32/89d2fc016984c6e2e7b9772ef32b1223783f6a7c6bf7c01cedf6b7198fcbdc968d0b79375939cda8169ccce2297c2dd698619b0b73fdbebe8638776523d1} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers"
+    2023/12/21 11:17:10  warn rootless{home/jovyan/.npm/_cacache/content-v2/sha512/a0/42/8e68d25c890945d9f40cfa48df32f539fdb939bb3d2042a5fd60e02d47221c144de430001d63b6e0739ec4f32eb6a889f55b4b5595b4c88f7ddbe1bf78ce} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers"
+    2023/12/21 11:17:10  warn rootless{home/jovyan/.npm/_cacache/content-v2/sha512/b7/f5/6b32b6a074deca06db865ddc0e078371033360075e39ce72e3a92ef0404452feed6e253ab67af36c76240b67aa82b76fdb60d594daa8cdfc5ccddd55815c} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers"
+    2023/12/21 11:17:10  warn rootless{home/jovyan/.npm/_cacache/content-v2/sha512/c8/df/f4f9c4e46618c643c0951ecc7c3d15413af8b6a43f2bfb0d6236d0b02530812035365425fe75b9f42b732f9fd9208a4f97f6281f0b7761f368ef30799b22} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers"
+    2023/12/21 11:17:10  warn rootless{home/jovyan/.npm/_cacache/content-v2/sha512/ff/46/32caf6c684788c1988aefa35dd857d3ebbc978564c6eaf7d91a8395d2b034695845f8dcf346925064e7f4c91f51e7a3637449a48cb0fcc7dc6cac7f66078} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers"
+    2023/12/21 11:17:10  warn rootless{opt/conda/lib/libblas.so} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers"
+    2023/12/21 11:17:10  warn rootless{opt/conda/lib/libcblas.so} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers"
+    2023/12/21 11:17:10  warn rootless{opt/conda/lib/libcrypto.so} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers"
+    2023/12/21 11:17:10  warn rootless{opt/conda/lib/libgfortran.so} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers"
+    2023/12/21 11:17:10  warn rootless{opt/conda/lib/libgfortran.so.5} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers"
+    2023/12/21 11:17:10  warn rootless{opt/conda/lib/liblapack.so} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers"
+    2023/12/21 11:17:10  warn rootless{opt/conda/lib/libopenblas.so} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers"
+    2023/12/21 11:17:10  warn rootless{opt/conda/lib/libopenblas.so.0} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers"
+    2023/12/21 11:17:10  warn rootless{opt/conda/lib/libssl.so} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers"
+    2023/12/21 11:17:17  warn rootless{opt/conda/ssl/cert.pem} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers"
+    2023/12/21 11:17:17  warn rootless{opt/conda/ssl/misc/tsget} ignoring (usually) harmless EPERM on setxattr "user.rootlesscontainers"
+    2023/12/21 11:17:17  info unpack layer: sha256:72bfc406f45c190828445af56635120ccd43b4cc0ff96b1f4aed725fc31f3dab
+    2023/12/21 11:17:20  info unpack layer: sha256:1cc47d4e243f5f6f7538d9d8c1849f08ae21c306410b0d8a8310a6d374a84a4d
+    INFO:    Creating SIF file...
+    INFO:    Build complete: braker3.sif
+  ```
+</details>
+
+This should result in a file called "braker3.sif" in the current directory.
+
+At this point, we should be able to run BRAKER3. Let's test it out:
+
+```bash
+>niko@lisc:~/containers$ singularity exec braker3.sif braker.pl
+```
+
+This prints BRAKER's help message, so it seems to work.
+
+### Step 2: GeneMark-ETP license
+
+There are additional instructions for getting BRAKER3 to run, where a valid license for GeneMark-ETP
+is required. BRAKER suggests you run the following command:
+
+```bash
+>niko@lisc:~/containers$ singularity exec braker3.sif braker.pl print_braker3_setup.py
+```
+
+but for me this hangs _forever_, so I looked for the script itself. While it is not in the
+repository any more, you can still find it in an old
+[commit](https://github.com/Gaius-Augustus/BRAKER/commit/dfd688adcd71e96fb02cc784e2b05b307874377d).
+I extrapolated from the script that this is what has to happen:
+
+Go to the [GeneMark website](http://exon.gatech.edu/GeneMark/license_download.cgi), and pretend you
+want to download `GeneMark-ES/ET/EP ver 4.71_lic`. This will take you to a page where you can choose
+to download the software (ignore) and the key. Copy the link for the 64bit key.
+
+Then, on LISC, do the following:
+
+```bash
+>niko@lisc:~/containers$ cd ~
+>niko@lisc:~$ wget <link> # this is the link for the 64 bit key
+>niko@lisc:~$ gunzip gm_key_64.gz
+>niko@lisc:~$ mv gm_key_64 ~/.gm_key
+```
+
+This should be enough to get BRAKER3 with GeneMark-ETP to run.
+
+### Step 3: Test run & Augustus config
+
+Let's make a test directory:
+
+```bash
+>niko@lisc:~/containers$ mkdir test
+>niko@lisc:~/containers$ cd test
+```
+
+A critical part of the BRAKER3 pipeline is Augustus, another notoriously finicky piece of software.
+If we don't say anything, Augustus will try to overwrite its default config files, somewhere in the
+depths of the container, where it doesn't have write access. We need to provide a valid config
+collection to Augustus in a writable location. It seems like the most prudent way to do this is to
+clone the Augustus repository somewhere safe and then copy the config folder to the test directory:
+
+```bash
+>niko@lisc:~/containers/test$ cd ~/repos
+>niko@lisc:~/repos$ git clone git://github.com/Gaius-Augustus/Augustus.git
+>niko@lisc:~/repos$ cd -
+>niko@lisc:~/containers/test$ cp -r ~/repos/Augustus/config .
+```
+
+In the future, in the scripts we will write for slurm, we will have to copy the config folder to the
+corresponding results directory on `/scratch/` or directly in the `$TMPDIR`.
+
+Now we can run BRAKER3. I am modifying the final call from the [test3
+script](https://github.com/Gaius-Augustus/BRAKER/blob/master/example/singularity-tests/test3.sh),
+and benefitting from [najoshi](https://github.com/najoshi)'s [helpful GitHub
+comment](https://github.com/Gaius-Augustus/BRAKER/issues/609#issuecomment-1571506735):
+
+```bash
+>niko@lisc:~/containers/test$ singularity exec -B ${PWD} $HOME/containers/braker3.sif braker.pl \
+--AUGUSTUS_CONFIG_PATH=${PWD}/config \
+--genome=/opt/BRAKER/example/genome.fa \
+--prot_seq=/opt/BRAKER/example/proteins.fa \
+--bam=/opt/BRAKER/example/RNAseq.bam \
+--softmasking \
+--threads 8 \
+--gm_max_intergenic 10000 \
+--skipOptimize
+```
+
+This took less than 10 minutes for me, and the output should look something like that:
+
+```bash
+INFO:    Converting SIF file to temporary sandbox...
+# Thu Dec 21 14:42:44 2023:Both protein and RNA-Seq data in input detected. BRAKER will be executed in ETP mode (BRAKER3).
+#*********
+# Thu Dec 21 14:42:44 2023: Log information is stored in file /lisc/user/papadopoulos/test/braker/braker.log
+#*********
+# WARNING: Number of reliable training genes is low (114). Recommended are at least 600 genes
+#*********
+INFO:    Cleaning up image...
+```
+
+## Preparing the run
+
+It is now time to compose a script for our real-life use cases. We aim to run BRAKER with RNA-seq
+and protein data, so we need to produce BRAKER's desired input files first.
+
+A good reference for this is the recent [TSEBRA
+tutorial](https://github.com/KatharinaHoff/BRAKER-TSEBRA-Workshop/blob/main/GenomeAnnotation.ipynb)
+offered by Katharina Hoff. This tutorial is pretty detailed, and if a user made it this far they can
+probably follow it. I will just outline the steps I took.
+
+#### Step 0: repeat masking
+
+Repeats are a big problem for gene prediction; they may look coincidentally like protein-coding
+genes, or may be actual protein-coding genes (such as transposases); both cases are usually not what
+we are interested in. A genome should be repeat-masked prior to gene prediction. Paraphrasing from
+the tutorial:
+
+```bash
+T=72 # you need a large number of threads and fast i/o storage
+GENOME=/path/to/your/genome.fa # the fasta file of your genome
+DB=some_db_name_that_fits_to_species
+
+BuildDatabase -name ${DB} ${GENOME}
+RepeatModeler -database ${DB} -pa ${T} -LTRStruct
+RepeatMasker -pa 72 -lib ${DB}-families.fa -xsmall ${GENOME}
+```
+
+The `-xsmall` flag is important, as it means we are _soft-masking_ the genome, i.e. replacing the
+repeats with lower-case letters instead of Ns.
+
+#### Step 1: RNA-seq alignment
+
+Spliced alignments of bulk RNA-seq data are invaluable for finding protein-coding genes, as they
+can pinpoint the location of exons, introns, and UTRs. I am using STAR for this, but any other
+aligner should work as well. The tutorial uses HISAT2.
+
+First create an index for your genome:
+
+```bash
+STAR --runThreadN 30 \ # choose an appropriate number of threads
+--runMode genomeGenerate \
+--genomeDir "$ASSEMBLY" \ # the path to the directory where you want the index to be
+--genomeFastaFiles "$FASTA" \ # the input genome FASTA file
+--genomeSAindexNbases 13 # 12 or 13 will make a little difference, we can use either
+```
+
+Then align your reads:
+
+```bash
+STAR --runThreadN 30 \ # choose an appropriate number of threads
+--outSAMstrandField intronMotif \ # very important for BRAKER
+--genomeDir "$ASSEMBLY" \ # the path to the directory where the index is
+--readFilesIn "$R1" "$R2" \ # Read 1 and Read 2 of your paired-end RNA-seq data
+--readFilesCommand gunzip -c # if your reads are gzipped
+```
+
+This will generate a file called `Aligned.out.sam` in the current directory. We need to convert that
+to BAM format and sort it:
+
+```bash
+samtools view -bSh Aligned.out.sam > Aligned.out.bam
+```
+
+#### Step 2: protein alignment
+
+For that we need a protein database. I am using the [OrthoDB
+database](https://bioinf.uni-greifswald.de/bioinf/partitioned_odb11/); in my case, it's Arthropods,
+but use whatever fits.
+
+## Running BRAKER3 on LISC
+
+I used the following script, with a genome of ~530Mb and 8 RNA-seq datasets. It took about 20h to
+run. RAM should not be the bottleneck for BRAKER3, as most of the steps are not so intensive (I also
+am using pre-mapped RNA-seq data, so that alleviates the situation somewhat). There are some
+guesstimates of performance on the AUGUSTUS repository that could help you decide how much
+RAM/runtime to ask for. I was laughably wrong in my estimates.
+
+```bash
+#!/usr/bin/env bash
+#
+#SBATCH --job-name=pycno_braker
+#SBATCH --ntasks=1
+#SBATCH --cpus-per-task=32 # 60% CPU efficiency for me, no idea about peak consumption
+#SBATCH --mem=100G # ended up needing 30G
+#SBATCH --time=10-0 # ended up needing 20h
+#SBATCH --mail-type=ALL
+#SBATCH --mail-user=your.email@univie.ac.at
+#SBATCH --output=/path/to/logfile-%j.out
+#SBATCH --error=/path/to/logfile-%j.err
+
+# 
+module load conda
+conda activate singularity
+
+# container location
+BRAKER=~/containers/braker3.sif # or wherever you put it; I use absolute paths to be safe
+# path to the output directory, where all the magic should happen
+OUTPUT=/path/to/output/directory
+mkdir "$OUTPUT" -p || exit
+cd "$OUTPUT" || exit
+
+# copy AUGUSTUS config files for BRAKER
+cp -r ~/repos/Augustus/config . # again, amend directories as needed
+
+# define input files
+# the genome assembly
+GENOME=/path/to/some/assembly.fasta # mine was about 530Mbase (and a file size of 514Mbyte)
+# the BAMs of the RNA-seq datasets to use as references for AUGUSTUS
+BASE=/some/base/directory/that/contains/all/of/them/
+DEV_TIMEPOINT1="$BASE/DEV_TIMEPOINT1/Aligned.out.bam"
+DEV_TIMEPOINT2="$BASE/DEV_TIMEPOINT2/Aligned.out.bam"
+DEV_TIMEPOINT3="$BASE/DEV_TIMEPOINT3/Aligned.out.bam"
+# the corresponding OrthoDB protein sequences
+ORTHODB=/path/to/fasta/file.fa
+
+# now call the container. Important to mount the current directory, /scratch/, and the $TMPDIR,
+# in case it is needed.
+singularity exec -B "$PWD,/lisc/scratch/partition/of/your/institute/where/all/your/files/live/,$TMPDIR" "$BRAKER" braker.pl \
+--AUGUSTUS_CONFIG_PATH="${PWD}"/config \
+--genome=$GENOME \
+--prot_seq=$ORTHODB \
+--bam="$DEV_TIMEPOINT1","$DEV_TIMEPOINT2","$DEV_TIMEPOINT3" \
+--softmasking \
+--threads 32
+```
\ No newline at end of file
diff --git a/06-annotation/prep-RNA-bam.sh b/06-annotation/prep-RNA-bam.sh
new file mode 100644
index 0000000000000000000000000000000000000000..6e65b2e89b665a45132b9276cd316b37a383464f
--- /dev/null
+++ b/06-annotation/prep-RNA-bam.sh
@@ -0,0 +1,35 @@
+#!/usr/bin/env bash
+
+CONVERTER=/lisc/user/papadopoulos/repos/pycno-seq/annotate/prep-RNA-sam2bam.sh
+
+BASE="/lisc/scratch/zoology/pycnogonum/genome/draft/transcriptome"
+
+ZYGOTE="$BASE/ZYGOTE/Aligned.out.sam"
+EARLY_CLEAVAGE="$BASE/EARLY_CLEAVAGE/Aligned.out.sam"
+EMBRYO0_1="$BASE/EMBRYO0_1/Aligned.out.sam"
+EMBRYO3_5="$BASE/EMBRYO3_5/Aligned.out.sam"
+EMBRYO9_10="$BASE/EMBRYO9_10/Aligned.out.sam"
+EMBRYO3="$BASE/EMBRYO3/Aligned.out.sam"
+INSTAR1="$BASE/INSTAR1/Aligned.out.sam"
+INSTAR2="$BASE/INSTAR2/Aligned.out.sam"
+INSTAR3="$BASE/INSTAR3/Aligned.out.sam"
+INSTAR4="$BASE/INSTAR4/Aligned.out.sam"
+INSTAR5="$BASE/INSTAR5/Aligned.out.sam"
+INSTAR6="$BASE/INSTAR6/Aligned.out.sam"
+JUV1="$BASE/JUV1/Aligned.out.sam"
+SUBADULT="$BASE/SUBADULT/Aligned.out.sam"
+
+sbatch $CONVERTER $ZYGOTE
+sbatch $CONVERTER $EARLY_CLEAVAGE
+sbatch $CONVERTER $EMBRYO0_1
+sbatch $CONVERTER $EMBRYO3_5
+sbatch $CONVERTER $EMBRYO9_10
+sbatch $CONVERTER $EMBRYO3
+sbatch $CONVERTER $INSTAR1
+sbatch $CONVERTER $INSTAR2
+sbatch $CONVERTER $INSTAR3
+sbatch $CONVERTER $INSTAR4
+sbatch $CONVERTER $INSTAR5
+sbatch $CONVERTER $INSTAR6
+sbatch $CONVERTER $JUV1
+sbatch $CONVERTER $SUBADULT
\ No newline at end of file
diff --git a/06-annotation/prep-RNA-index_bam.sh b/06-annotation/prep-RNA-index_bam.sh
new file mode 100644
index 0000000000000000000000000000000000000000..2f770f30b4e6c692df0b1ac66cb59011fd370807
--- /dev/null
+++ b/06-annotation/prep-RNA-index_bam.sh
@@ -0,0 +1,21 @@
+#!/usr/bin/env bash
+#
+#SBATCH --job-name=converter
+#SBATCH --ntasks=1
+#SBATCH --cpus-per-task=6
+#SBATCH --mem=16G
+#SBATCH --time=15:00
+#SBATCH --mail-type=ALL
+#SBATCH --mail-user=nikolaos.papadopoulos@univie.ac.at
+#SBATCH --output=/lisc/user/papadopoulos/log/pycno-indexbam-%j.out
+#SBATCH --error=/lisc/user/papadopoulos/log/pycno-indexbam-%j.err
+
+module load samtools
+
+INPUT=$1
+BASENAME=$(dirname "$INPUT")
+OUTPUT=$BASENAME/Aligned.out.sorted.bam
+
+cd "$TMPDIR" || exit
+samtools sort -@ 8 "$INPUT" > "$OUTPUT"
+samtools index -@ 8 -b "$OUTPUT"
\ No newline at end of file
diff --git a/06-annotation/prep-RNA-map-all.sh b/06-annotation/prep-RNA-map-all.sh
new file mode 100644
index 0000000000000000000000000000000000000000..85389b5b4aa3a92654631ce91ed2fabe78793014
--- /dev/null
+++ b/06-annotation/prep-RNA-map-all.sh
@@ -0,0 +1,42 @@
+#!/usr/bin/env bash
+
+MAPPER=/lisc/user/papadopoulos/repos/pycno-seq/annotate/prep-RNA-map-single.sh
+ASSEMBLY=/lisc/scratch/zoology/pycnogonum/genome/draft/
+
+# early development
+BASE="/lisc/scratch/zoology/pycnogonum/raw/early_RNA"
+ZYGOTE="$BASE/Plit35_S8"
+EARLY_CLEAVAGE="$BASE/Plit31_S6"
+EMBRYO0_1="$BASE/Plit40_S11"
+EMBRYO3_5="$BASE/Plit32_S7"
+EMBRYO9_10="$BASE/Plit36_S9"
+# INSTAR2_5="$BASE/Plit39_S10"
+
+sbatch $MAPPER $ASSEMBLY ZYGOTE "$ZYGOTE"_L001
+sbatch $MAPPER $ASSEMBLY EARLY_CLEAVAGE "$EARLY_CLEAVAGE"_L001
+sbatch $MAPPER $ASSEMBLY EMBRYO0_1 "$EMBRYO0_1"_L001
+sbatch $MAPPER $ASSEMBLY EMBRYO3_5 "$EMBRYO3_5"_L001
+sbatch $MAPPER $ASSEMBLY EMBRYO9_10 "$EMBRYO9_10"_L001
+
+# instars
+BASE="/lisc/scratch/zoology/pycnogonum/raw/HTT33DSX5_4_R15615_20230713/demultiplexed"
+
+EMBRYO3="$BASE/235255/235255_S49"
+INSTAR1="$BASE/235253/235253_S47"
+INSTAR2="$BASE/235256/235256_S50"
+INSTAR3="$BASE/235257/235257_S51"
+INSTAR4="$BASE/235258/235258_S52"
+INSTAR5="$BASE/235259/235259_S53"
+INSTAR6="$BASE/235254/235254_S48"
+JUV1="$BASE/235260/235260_S54"
+SUBADULT="$BASE/235261/235261_S55"
+
+sbatch $MAPPER $ASSEMBLY EMBRYO3 "$EMBRYO3"_L004
+sbatch $MAPPER $ASSEMBLY INSTAR1 "$INSTAR1"_L004
+sbatch $MAPPER $ASSEMBLY INSTAR2 "$INSTAR2"_L004
+sbatch $MAPPER $ASSEMBLY INSTAR3 "$INSTAR3"_L004
+sbatch $MAPPER $ASSEMBLY INSTAR4 "$INSTAR4"_L004
+sbatch $MAPPER $ASSEMBLY INSTAR5 "$INSTAR5"_L004
+sbatch $MAPPER $ASSEMBLY INSTAR6 "$INSTAR6"_L004
+sbatch $MAPPER $ASSEMBLY JUV1 "$JUV1"_L004
+sbatch $MAPPER $ASSEMBLY SUBADULT "$SUBADULT"_L004
\ No newline at end of file
diff --git a/06-annotation/prep-RNA-map-single.sh b/06-annotation/prep-RNA-map-single.sh
new file mode 100644
index 0000000000000000000000000000000000000000..ead0fda848349f5be4928230ed97890324c21b4e
--- /dev/null
+++ b/06-annotation/prep-RNA-map-single.sh
@@ -0,0 +1,73 @@
+#!/usr/bin/env bash
+#
+#SBATCH --job-name=star_pycno
+#SBATCH --ntasks=1
+#SBATCH --cpus-per-task=12
+#SBATCH --mem=32G
+#SBATCH --time=2:00:00
+#SBATCH --mail-type=ALL
+#SBATCH --mail-user=nikolaos.papadopoulos@univie.ac.at
+#SBATCH --output=/lisc/user/papadopoulos/log/pycno-star-worker-%j.out
+#SBATCH --error=/lisc/user/papadopoulos/log/pycno-star-worker-%j.err
+
+# GeneMark-ETP utilizes Stringtie2 to assemble RNA-Seq data, which requires that the aligned reads
+# (BAM files) contain the XS (strand) tag for spliced reads. Therefore, when aligning with STAR we
+# must use the --outSAMstrandField intronMotif option.
+
+# rule minimap2_txome_to_genome:
+#     input:
+#         assem = config["REF"],
+#         txome = config["TXOME"]
+#     output:
+#         bam = config["tool"] + "/output/bams/txome_to_ref.filtered.sorted.bam"
+#     threads: workflow.cores - 1
+#     params:
+#         sort_threads = int(workflow.cores/5)
+#     shell:
+#         """
+#         minimap2 -t {threads} -ax splice {input.assem} {input.txome} | \
+#            samtools view -F 4 -q 10 -hb -@ {params.sort_threads} - | \
+#            samtools sort -@ {params.sort_threads} - > {output.bam}
+#         """
+
+# rule minimap2_long_to_genome:
+#     input:
+#         assem = config["REF"],
+#         txome = config["LONGREADS"]
+#     output:
+#         bam = config["tool"] + "/output/bams/long_to_ref.filtered.sorted.bam"
+#     threads: workflow.cores - 1
+#     params:
+#         sort_threads = int(workflow.cores/5)
+#     shell:
+#         """
+#         minimap2 -t {threads} -ax splice:hq {input.assem} {input.txome} | \
+#            samtools view -F 4 -q 10 -hb -@ {params.sort_threads} - | \
+#            samtools sort -@ {params.sort_threads} - > {output.bam}
+#         """
+
+module load ngstools/star/2.7.11b
+
+ASSEMBLY=$1
+STAGENAME=$2
+STAGE=$3
+OUTPUT=$ASSEMBLY/transcriptome/$STAGENAME
+
+mkdir "$OUTPUT" -p || exit
+
+R1="$STAGE"_R1_001.fastq.gz
+R2="$STAGE"_R2_001.fastq.gz
+
+cd "$TMPDIR" || exit
+
+STAR --runThreadN 30 \
+--outSAMstrandField intronMotif \
+--genomeDir "$ASSEMBLY" \
+--readFilesIn "$R1" "$R2" \
+--readFilesCommand gunzip -c
+
+cp -r "$TMPDIR"/Log.* "$OUTPUT"/
+cp -r "$TMPDIR"/Aligned.* "$OUTPUT"/
+
+# remove $TMPDIR
+rm -rf "$TMPDIR"
\ No newline at end of file
diff --git a/06-annotation/prep-RNA-mkindex.sh b/06-annotation/prep-RNA-mkindex.sh
new file mode 100644
index 0000000000000000000000000000000000000000..45cf65d4687a85d32c058bfab2eed9ed6a67d46c
--- /dev/null
+++ b/06-annotation/prep-RNA-mkindex.sh
@@ -0,0 +1,33 @@
+#!/usr/bin/env bash
+#
+#SBATCH --job-name=star_pycno
+#SBATCH --ntasks=1
+#SBATCH --cpus-per-task=30
+#SBATCH --mem=20G
+#SBATCH --time=15:00
+#SBATCH --mail-type=ALL
+#SBATCH --mail-user=nikolaos.papadopoulos@univie.ac.at
+#SBATCH --output=/lisc/user/papadopoulos/log/pycno-star-%j.out
+#SBATCH --error=/lisc/user/papadopoulos/log/pycno-star-%j.err
+
+# GeneMark-ETP utilizes Stringtie2 to assemble RNA-Seq data, which requires that the aligned reads
+# (BAM files) contain the XS (strand) tag for spliced reads. Therefore, when aligning with STAR we
+# must use the --outSAMstrandField intronMotif option.
+
+module load ngstools/star/2.7.11b
+
+FASTA=$1
+ASSEMBLY=$(dirname "$FASTA") # path to the genome directory where the STAR index will be generated
+OUTPUT=$ASSEMBLY/transcriptome
+
+mkdir "$OUTPUT" -p || exit
+
+# first go into the assembly directory and create the STAR index
+cd "$ASSEMBLY" || exit
+
+# create the STAR index
+STAR --runThreadN 30 \
+--runMode genomeGenerate \
+--genomeDir "$ASSEMBLY" \
+--genomeFastaFiles "$FASTA" \
+--genomeSAindexNbases 13
\ No newline at end of file
diff --git a/06-annotation/prep-RNA-sam2bam.sh b/06-annotation/prep-RNA-sam2bam.sh
new file mode 100644
index 0000000000000000000000000000000000000000..17f6d97658b634517f9d57d8253e30ee16525991
--- /dev/null
+++ b/06-annotation/prep-RNA-sam2bam.sh
@@ -0,0 +1,19 @@
+#!/usr/bin/env bash
+#
+#SBATCH --job-name=converter
+#SBATCH --ntasks=1
+#SBATCH --cpus-per-task=4
+#SBATCH --mem=100M
+#SBATCH --time=90:00
+#SBATCH --mail-type=ALL
+#SBATCH --mail-user=nikolaos.papadopoulos@univie.ac.at
+#SBATCH --output=/lisc/user/papadopoulos/log/pycno-sam2bam-%j.out
+#SBATCH --error=/lisc/user/papadopoulos/log/pycno-sam2bam-%j.err
+
+module load samtools
+
+INPUT=$1
+BASENAME=$(dirname "$INPUT")
+OUTPUT=$BASENAME/Aligned.out.bam
+
+samtools view -bSh --threads 4 "$INPUT" > "$OUTPUT"
\ No newline at end of file
diff --git a/06-annotation/tRNAscan_to_gff3.ipynb b/06-annotation/tRNAscan_to_gff3.ipynb
new file mode 100644
index 0000000000000000000000000000000000000000..ee76b0889c9f00c5c04eca53cc93aeedeb45e6a4
--- /dev/null
+++ b/06-annotation/tRNAscan_to_gff3.ipynb
@@ -0,0 +1,146 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Converting tRNAscan-SE output to GFF3 format\n",
+    "\n",
+    "tRNAscan-SE produces multiple output files:\n",
+    "\n",
+    "- trnascan.bed: a BED file with the coordinates of the tRNAs\n",
+    "- trnascan.fasta: a FASTA file with the sequences of the tRNAs (and their coordinates in the headers)\n",
+    "- trnascan.stats: a human-readable file with summary statistics from the program output\n",
+    "- trnascan.out: a human-readable table with lots of metadata for each tRNA\n",
+    "\n",
+    "For our resulting GFF3 file we would like to keep all of the predicted tRNAs that \n",
+    "\n",
+    "1. are not pseudogenes\n",
+    "2. don't have introns\n",
+    "3. don't overlap with any protein-coding genes\n",
+    "\n",
+    "Info for points 1 and 2 can be found in the `trnascan.out` file, while point 3 requires a separate\n",
+    "GFF3 file with the protein-coding genes."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pandas as pd"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "scan_loc = \"/Volumes/scratch/pycnogonum/genome/draft/trnascan/\"\n",
+    "out = scan_loc + 'trnascan.out'\n",
+    "\n",
+    "trna = pd.read_csv(out, sep='\\t', header=None, skiprows=3)\n",
+    "trna.columns = ['seqname', 'tRNA', 'start', 'end', 'type', 'anti', 'intron_start', 'intron_end', 'inf score', 'origin', 'note']\n",
+    "\n",
+    "trna['id'] = trna['seqname'].str.strip() + '.tRNA' + trna['tRNA'].astype(str) + '-' + trna['type'] + trna['anti']\n",
+    "trna.set_index('id', inplace=True)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "For point 3 we need to check the overlap of the tRNAs with the protein-coding genes. We can use\n",
+    "`bedtools intersect` and the draft gff file:\n",
+    "\n",
+    "```bash\n",
+    "$ bedtools intersect -wao -v -a trnascan.bed -b ../annot_merge/merged_sorted.gff3 > no_overlap.bed\n",
+    "$ bedtools intersect -wao -S -a trnascan.bed -b ../annot_merge/merged_sorted.gff3 | grep -e \"-1\" -v > overlap_otherstrand.bed\n",
+    "$ cat overlap_otherstrand.bed > keep.bed\n",
+    "$ cat no_overlap.bed >> keep.bed\n",
+    "$ cut -d\"        \" -f4,6 keep.bed | sort -u > keep_unique.txt # the delimiter should be a tab character; try ctrl+v tab\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "trna_ids = pd.read_table(scan_loc + \"keep_unique.txt\", header=None, index_col=0)\n",
+    "trna_ids.columns = ['strand']"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "keep = trna.join(trna_ids, how='inner')\n",
+    "not_pseudo = keep['note'].fillna(\"\") == \"\"\n",
+    "not_intron = (keep['intron_start'] == 0) & (keep['intron_end'] == 0)\n",
+    "\n",
+    "gff = keep[not_pseudo & not_intron]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def gff_string(index, name, row):\n",
+    "    seqid = row['seqname'].strip()\n",
+    "    source = \"tRNAscan-SE\"\n",
+    "    seq_type = \"tRNA\"\n",
+    "    start = min(row['start'], row['end'])\n",
+    "    stop = max(row['start'], row['end'])\n",
+    "    score = \".\"\n",
+    "    strand = row['strand']\n",
+    "    phase = \".\"\n",
+    "    # :Gly-GCC-1-1\n",
+    "    attributes = f\"ID=tRNA{index+1};gene_id={name};name=tRNA:{row['type']}-{row['anti']};score={row['inf score']}\"\n",
+    "    gene = f\"{seqid}\\t{source}\\tgene\\t{start}\\t{stop}\\t{score}\\t{strand}\\t{phase}\\t{attributes};\\n\"\n",
+    "    attributes = f\"ID=tRNAscan{index+1};name=tRNA:{row['type']}-{row['anti']}\"\n",
+    "    trna = f\"{seqid}\\t{source}\\ttRNA\\t{start}\\t{stop}\\t{score}\\t{strand}\\t{phase}\\t{attributes};\\n\"\n",
+    "    return gene + trna"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "with open(scan_loc + 'trnascan.gff3', 'w') as f:\n",
+    "    for i, (name, row) in enumerate(gff.iterrows()):\n",
+    "        f.write(gff_string(i, name, row))"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "ascc24",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.9.19"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}