contamination analysis

773a55a8 · Niko (Nikolaos) Papadopoulos · b5211d19 · 773a55a8 · 773a55a8 · 773a55a8
Commit 773a55a8 authored 8 months ago by Niko (Nikolaos) Papadopoulos
--- a/04-contam/README.md
+++ b/04-contam/README.md
+# Repeats, annotation, and possible contamination
+
+After the trials and tribulations of genome assembly, scaffolding, repeat masking, and basic
+(protein-coding) gene annotation, we finally reached the point where the genome could be useful. At
+the same time, this was another checkpoint where we could compare the genome to other known
+metazoan/arthropod/chelicerate genome assemblies to get a sense of whether we had something
+reasonable or whether we might need to go back to the drawing board.
+
+We got three red flags, which on their own would not have prompted much further investigation, but
+together were very concerning:
+
+1. RepeatMasker/RepeatModeler suggested that 60% of the (rather small) genome was repetitive, and
+   that 45% of that was "unclassified" repeats. While lineage-specific repeats are common, _that
+   many_ of them certainly aren't.
+2. In the process of investigating the repeats (read: BLASTing them against NCBI `nr` for funsies),
+   we were surprised to find that some of them returned hits in sea anemones. In particular, two
+   persistent results belonged to _Metridium senile_, a sea anemone that we actually feed the _P.
+   litorale_ larvae in our culture. While the matches were partial and the e-values were barely
+   significant, this was too much of a coincidence.
+3. We used a set of bulk RNA-seq data from various developmental timepoints to annotate the genome
+   with `BRAKER`. Despite the extremely good coverage of development, BRAKER only identified 11.4k
+   protein-coding genes, a far cry from the number we'd usually expect from a metazoan genome.
+
+We have three objectives:
+
+1. Make a decision re: repeats. Are they real? Can we come up with an idea for how we have so many
+   unclassified repeats?
+2. Is the _Metridium_ contamination real?
+3. Can we find more protein-coding genes?
+
+### 0. data and setup
+
+- `FLYE=/lisc/scratch/zoology/pycnogonum/genome/flye_full/assembly.fasta` - the original, unscaffolded _P. litorale_ genome assembly
+- `DRAFT=/lisc/scratch/zoology/pycnogonum/genome/draft/draft_softmasked.fasta` - The scaffolded, manually edited _P. litorale_ genome assembly
+- `METRIDIUM=/lisc/scratch/zoology/pycnogonum/raw/GCA_949775045.1_jaMetSeni4.1_genomic.fna` - the _M. senile_ genome assembly
+- `NANOPORE=/lisc/scratch/zoology/pycnogonum/raw/20230601_0904_3C_PAK64436_4d39d6a6/basecalls/nanopore.fasta` - the ONT data that we used to assemble the _P. litorale_ genome
+- `HIFI=/lisc/scratch/zoology/pycnogonum/raw/m64120_240312_152428.ccs.fasta` - PacBio HiFi data from a collaborator that weren't deep enough for an assembly
+- `REPEATS=/lisc/scratch/zoology/pycnogonum/genome/draft/repeats/repeat_modeller/pycno-families.fa` - the predicted repeat families from RepeatModeler
+
+### 1. Repeats
+
+We need to figure out if the repeats are real. We will go back to the raw reads that we used for
+assembly and ask if they already contain repeats. If they do, and if the percentage of reads that
+contain repeats is high, we can be more confident that the results of RepeatMasker are not an 
+artifact.
+
+First, map the repeats on the reads:
+
+```bash
+$ OUTPUT=/lisc/scratch/zoology/pycnogonum/genome/draft/repeats
+$ bwa mem -x ont2d $REPEATS $NANOPORE > $OUTPUT/ont.sam
+$ bwa mem -x pacbio $REPEATS $HIFI > $OUTPUT/pb.sam
+```
+
+See the [ONT](check_repeats-ont.sh) and [PacBio](check_repeats-pb.sh) scripts for
+details.
+
+Let's first have a look at the ONT data:
+
+```bash
+$ samtools flagstat $OUTPUT/ont.sam
+$
+$ 97,952,355 + 0 in total (QC-passed reads + QC-failed reads)
+$ 30,502,516 + 0 primary
+$ 0 + 0 secondary
+$ 67,449,839 + 0 supplementary
+$ 0 + 0 duplicates
+$ 0 + 0 primary duplicates
+$ 85,877,260 + 0 mapped (87.67% : N/A)
+$ 18,427,421 + 0 primary mapped (60.41% : N/A)
+$ 0 + 0 paired in sequencing
+$ 0 + 0 read1
+$ 0 + 0 read2
+$ 0 + 0 properly paired (A : N/A)
+$ 0 + 0 with itself and mate mapped
+$ 0 + 0 singletons (A : N/A)
+$ 0 + 0 with mate mapped to a different chr
+$ 0 + 0 with mate mapped to a different chr (mapQ>=5)
+```
+
+Let's break this down:
+
+- ~98 million total alignments are produced, from...
+- 30.5 million reads (primary).
+- The total map rate is 60.4% (primary mapped: 18.4 million reads). Since our mapping reference are
+  the _P. litorale_ repeat families, this means that 60.4% of the reads have at least one repeat
+  element on them. However, there are also...
+- 67.4m supplementary alignments, i.e. alternative alignments. If our reference was a real genome,
+  this category would denote chimeric reads or split alignments. However, since our reference is the
+  (rather short) repeats, this essentially means that every read has an average of 2-3 repeats
+  mapping to it.
+
+The 60.4% map rate is surprisingly similar to the proportion of the genome that was marked as
+repeats by RepeatMasker/RepeatModeler (60%). This suggests that the repeats might be real.
+
+Second, let's have a look at our validation:
+
+```bash
+$ samtools flagstat $OUTPUT/pb.sam
+$ 
+$ 11,081,282 + 0 in total (QC-passed reads + QC-failed reads)
+$ 1,715,700 + 0 primary
+$ 0 + 0 secondary
+$ 9,365,582 + 0 supplementary
+$ 0 + 0 duplicates
+$ 0 + 0 primary duplicates
+$ 10,739,929 + 0 mapped (96.92% : N/A)
+$ 1,374,347 + 0 primary mapped (80.10% : N/A)
+$ 0 + 0 paired in sequencing
+$ 0 + 0 read1
+$ 0 + 0 read2
+$ 0 + 0 properly paired (A : N/A)
+$ 0 + 0 with itself and mate mapped
+$ 0 + 0 singletons (A : N/A)
+$ 0 + 0 with mate mapped to a different chr
+$ 0 + 0 with mate mapped to a different chr (mapQ>=5)
+```
+
+Let's break this down:
+
+- 11 million total alignments are produced, from...
+- 1.7 million reads (primary).
+- The total map rate is 80.1% (primary mapped: 1.37 million reads), so 4/5 of the PacBio reads
+  have at least one repeat element on them. However, there are also...
+- 9.37m supplementary alignments, i.e. alternative alignments, suggesting an average of 8-9 repeats
+  per read. The average length of the PacBio reads is three times higher than the ONT reads
+  (5,892nt/1,848nt), so this makes sense.
+
+#### Takeaways: 
+
+The repeats seem to be real. There seem to be multiple repeats present on each read.
+
+Ways to improve this analysis:
+
+- repeat analysis with short reads
+- filter mapq30 alignments, count #alignments per read
+- analyse the CIGAR strings to see what portion of each read is matched by high quality maps of
+  repeat content
+
+
+### 2a. _Metridium_ contamination
+
+Map the ONT reads onto the _Metridium_ genome using the [backmap
+script](../nanopore/eval-backmap-ont.sh) We wrote for assembly qc:
+
+```bash
+$ OUTPUT=/lisc/scratch/zoology/pycnogonum/genome/metridium_cont
+$ sbatch eval-backmap-ont.sh $METRIDIUM $OUTPUT
+```
+
+The first thing to ask here is how many reads mapped well to the _Metridium_ genome. This
+information is included in the `.stats` file produced by `backmap`. Consider the SN section of that
+file here for closer inspection, but the important takeaway is that ~290k/30.5m reads mapped to the
+_Metridium_ genome (0.95%) with mapping quality >0. This is worrying and should be investigated
+more.
+
+<details>
+
+<summary>Summary Numbers from `samtools stats`</summary>
+
+```bash
+SN      raw total sequences:    30502516        # excluding supplementary and secondary reads
+SN      filtered sequences:     0
+SN      sequences:      30502516
+SN      is sorted:      1
+SN      1st fragments:  30502516
+SN      last fragments: 0
+SN      reads mapped:   419599
+SN      reads mapped and paired:        0       # paired-end technology bit set + both mates mapped
+SN      reads unmapped: 30082917
+SN      reads properly paired:  0       # proper-pair bit set
+SN      reads paired:   0       # paired-end technology bit set
+SN      reads duplicated:       0       # PCR or optical duplicate bit set
+SN      reads MQ0:      129520  # mapped and MQ=0
+SN      reads QC failed:        0
+SN      non-primary alignments: 1413288
+SN      supplementary alignments:       1742261
+SN      total length:   56368688445     # ignores clipping
+SN      total first fragment length:    56368688445     # ignores clipping
+SN      total last fragment length:     0       # ignores clipping
+SN      bases mapped:   3057365257      # ignores clipping
+SN      bases mapped (cigar):   356727603       # more accurate
+SN      bases trimmed:  0
+SN      bases duplicated:       0
+SN      mismatches:     79904074        # from NM fields
+SN      error rate:     2.239918e-01    # mismatches / bases mapped (cigar)
+SN      average length: 1848
+SN      average first fragment length:  1848
+SN      average last fragment length:   0
+SN      maximum length: 1533915
+SN      maximum first fragment length:  1533915
+SN      maximum last fragment length:   0
+SN      average quality:        34.7
+SN      insert size average:    0.0
+SN      insert size standard deviation: 0.0
+SN      inward oriented pairs:  0
+SN      outward oriented pairs: 0
+SN      pairs with other orientation:   0
+SN      pairs on different chromosomes: 0
+SN      percentage of properly paired reads (%):        0.0
+```
+
+</details>
+
+We can then extract only the high-quality (map quality >30) primary alignments (exclude non-primary
+alignments with `-F 0x100`):
+
+```bash
+$ cd $OUTPUT
+$ samtools view -q 30 -F 0x100 GCA_949775045.1_jaMetSeni4.1_genomic.fna.ont.sort.bam | cut -f1 > primary_mq30.txt
+```
+
+This leaves us with only 32,490 reads.
+
+At the same time, since `flye` doesn't keep track of which reads were used for contig building, we
+need to map the reads back to the assembly, which I'd already done for QC reasons anyway. However,
+we should be focusing on high-quality alignments that are not secondary or supplementary (0x100 and
+0x800 flags):
+
+```bash
+$ OUTPUT=/lisc/scratch/zoology/pycnogonum/genome/draft/backmap/ont
+$ sbatch eval-backmap-ont.sh $DRAFT $OUTPUT
+$ samtools view -F 0x900 -q 30 -S $OUTPUT/draft.fasta.ont.sort.bam > $OUTPUT/draft.fasta.ont.sort.sam
+$ BACKMAPPED_ONT=/lisc/scratch/zoology/pycnogonum/genome/draft/backmap/ont/draft.fasta.ont.sort.sam
+```
+
+(feel free to use more threads to speed `samtools view` up.)
+
+We can combine the two outputs to see where the high-confidence _Metridium_ reads are concentrated
+on the _P. litorale_ assembly:
+
+```bash
+$ LC_ALL=C fgrep -f primary_mq30.txt $BACKMAPPED_ONT > metridium_scaffolds.txt
+```
+
+This leaves us with 10,170 putatively _Metridium_ reads which have mapped well to the _P. litorale_
+draft. We can then sum up how many times each scaffold was hit by _Metridium_ reads:
+
+```bash
+$ cat metridium_scaffolds.txt | cut -f3 | sort | uniq -c | sort -k1,1nr > metridium_scaffolds.txt_summary
+```
+
+The summary file is available in the [repository](./metridium_scaffolds.sam_summary). However, the
+most important message becomes clear if we just inspect the top hits:
+
+```bash
+$  2478 pseudochrom_16
+$   623 pseudochrom_22
+$   373 scaffold_1844
+$   355 pseudochrom_38
+$   272 pseudochrom_13
+$   238 pseudochrom_7
+$   233 pseudochrom_24
+$   214 pseudochrom_8
+$   213 pseudochrom_17
+$   174 pseudochrom_23
+```
+
+This looks a bit concerning. We need to go back to the contigs that were used to build the scaffolds
+and do the same. However, if the offending contigs are highly repetitive and the "metridial" reads
+are also highly repetitive, this is less concerning, and can be explained by the similarity of
+low-complexity regions or tandem repeats.
+
+Mapping scaffolds to contigs is not straightforward because of the manual editing of the hiC map.
+It might be faster to just look at the locations of the putative _Metridium_ reads on the current
+assembly and try to make sense out of that.
+
+We cross-referenced the predicted repeats of the _P. litorale_ genome with the mapping locations of
+the putative _Metridium_ reads. Out of the 10,170 putative _Metridium_ reads, 6,039 (60%) are at
+least 25% repetitive, and 4,593 are at least 50% repetitive (see relevant
+[notebook](metridium.ipynb)). This leaves us with maybe 5,000 problematic reads out of a total of
+30.5 million.
+
+For context, if we do the same thing but with contigs, and then refer back to the original flye
+assembly (before scaffolding):
+
+```bash
+$ OUTPUT=/lisc/scratch/zoology/pycnogonum/genome/flye_full/backmap_ont
+$ sbatch eval-backmap-ont.sh $FLYE $OUTPUT
+$ samtools view -N primary_mq30.txt -F 0x900 -q 30 metridium_locs.sam > metridium_contigs.sam
+$ cat metridium_contigs.sam | cut -f3 | sort | uniq -c | sort -k1,1nr > metridium_contigs.sam_summary
+```
+
+This file is also [uploaded](./metridium_contigs.sam_summary). Let's have a look at the top hits:
+
+```bash
+$ head metridium_contigs.txt_summary
+$
+$ 2197 contig_8782
+$  324 contig_21131
+$  151 contig_28132
+$  146 contig_6939
+$  123 contig_19945
+$   94 contig_11056
+$   79 contig_5985
+$   78 contig_19987
+$   77 contig_4985
+$   76 contig_4041
+```
+
+If we go back to the alignment graph from `flye` and have a look, the alignment path is very
+illuminating:
+
+```bash
+contig_8782     209547  86      N       N       1       *       10576,-3609,-3607,8782,20233,-24417,-24416,-28783,13730,3605,3605,3605,3605,3605,3605,3605,3605,3605,3605,3605,3605,3605,3605,3605,3605,3605,3605,3605,3605,3605,3605,3605,3605,3605,3605,3605,3605,3605,3605,3605,3605,3605,3605,3605,3605,3605,3605,3605
+```
+
+This contig's path starts out normal, but very quickly devolves into a self-repeat.
+
+Overall, it looks like _Metridium_ contamination is not a real danger; rather, we should start
+contemplating other possibilities to explain these weak sequence similarities, such as horizontal
+gene transfer.
+
+
+### 2b. Other contaminants
+
+Widespread _Metridium_ contamination seems to be out of the question, but this doesn't mean that
+other contaminants can't be present. To check for that, we implemented an approach very similar to
+`Kraken2.0`. Briefly, we aligned UniRef90 onto the draft using `mmseqs2` and then then used the
+`taxonomy` module to assign kingdom-level taxonomy information to each hit; we then excluded
+contigs/scaffolds that did not overwhelmingly contain metazoan genes.
+
+First we need to download UniRef90 and also build the `mmseqs2` database from the genome sequences.
+
+- [mmseqs-download_uniref90.sh](mmseqs-download_uniref90.sh)
+- [mmseqs-prepdb.sh](mmseqs-prepdb.sh)
+
+These can be run in parallel.
+
+Then we do the actual alignment/taxonomy assignment:
+
+- [mmseqs-align.sh](mmseqs-align.sh)
+
+We then used a custom notebook to parse the result and save taxonomy information per
+pseudochromosome/scaffold:
+
+- [contamination.ipynb](contamination.ipynb)
+
+Finally, we used the outputs of this notebook to filter out the contamination from the assembly:
+
+- [filter_draft.ipynb](filter_draft.ipynb)
+
--- a/04-contam/check_repeats-ont.sh
+++ b/04-contam/check_repeats-ont.sh
+#!/usr/bin/env bash
+#
+#SBATCH --job-name=read_repeats
+#SBATCH --ntasks=1
+#SBATCH --cpus-per-task=4
+#SBATCH --mem=5000M
+#SBATCH --time=20:00:00
+#SBATCH --mail-type=ALL
+#SBATCH --mail-user=nikolaos.papadopoulos@univie.ac.at
+#SBATCH --output=/lisc/user/papadopoulos/log/pycno-scan-ont-%j.out
+#SBATCH --error=/lisc/user/papadopoulos/log/pycno-scan-ont-%j.err
+
+# tests:
+# 610050 FASTA rows (~1% of file)
+# /usr/bin/time bwa mem -x ont2d -t 4 ../repeat_modeller/pycno-families.fa onepercent.fasta > test.sam
+# needed 3.2min with 388% CPU usage and 359MB RAM
+
+# 6100500 FASTA rows (~10% of file)
+# /usr/bin/time bwa mem -x ont2d -t 4 ../repeat_modeller/pycno-families.fa tenpercent.fasta > test.sam
+# needed 54min with 384% CPU usage and 2GB RAM
+
+# first run timed out at 11h (--time=10:00:00) and processed 17,467,334 / 30,502,516 reads, so we 
+# need at least 17.5h. Going with 20h to be safe. At least now I know the real memory consumption.
+
+# second run went OOM with 3.5G of RAM, so I'm going with 5G now.
+
+module load bwa/0.7.18
+
+NANOPORE="/lisc/scratch/zoology/pycnogonum/raw/20230601_0904_3C_PAK64436_4d39d6a6/basecalls/202306050/nanopore.fasta"
+REPEATS="/lisc/scratch/zoology/pycnogonum/genome/draft/repeats/repeat_modeller/pycno-families.fa"
+OUTPUT="/lisc/scratch/zoology/pycnogonum/genome/draft/repeats/"
+
+bwa mem -x ont2d -t 16 $REPEATS $NANOPORE > $OUTPUT/ont.sam
\ No newline at end of file
--- a/04-contam/check_repeats-pb.sh
+++ b/04-contam/check_repeats-pb.sh
+#!/usr/bin/env bash
+#
+#SBATCH --job-name=read_repeats
+#SBATCH --ntasks=1
+#SBATCH --cpus-per-task=4
+#SBATCH --mem=300M
+#SBATCH --time=80:00
+#SBATCH --mail-type=ALL
+#SBATCH --mail-user=nikolaos.papadopoulos@univie.ac.at
+#SBATCH --output=/lisc/user/papadopoulos/log/pycno-scan-pb-%j.out
+#SBATCH --error=/lisc/user/papadopoulos/log/pycno-scan-pb-%j.err
+
+# tests:
+# 34314 FASTA rows (1% of file)
+# /usr/bin/time bwa mem -x pacbio -t 4 ../repeat_modeller/pycno-families.fa onepercent.fasta > test.sam
+# needed 49s with 392% CPU usage and 271MB RAM
+
+# 343140 FASTA rows (10% of file)
+# /usr/bin/time bwa mem -x pacbio -t 4 ../repeat_modeller/pycno-families.fa tenpercent.fasta > test.sam
+# needed 480s wuth 392% CPU usage and 271MB RAM
+
+module load bwa/0.7.18
+
+HIFI="/lisc/scratch/zoology/pycnogonum/raw/m64120_240312_152428.ccs.fasta"
+REPEATS="/lisc/scratch/zoology/pycnogonum/genome/draft/repeats/repeat_modeller/pycno-families.fa"
+OUTPUT="/lisc/scratch/zoology/pycnogonum/genome/draft/repeats/"
+
+bwa mem -x pacbio -t 4 $REPEATS $HIFI > $OUTPUT/pb.sam
\ No newline at end of file
--- a/04-contam/contamination.ipynb
+++ b/04-contam/contamination.ipynb
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Contamination analysis\n",
+    "\n",
+    "I don't want to run [ContScout](https://github.com/h836472/ContScout/) because it requires insane\n",
+    "amounts of RAM/storage. Instead, I will align my draft genome against UniRef90 with mmseqs2, append\n",
+    "taxonomic information to the better hits, and calculate the taxonomic makeup of each scaffold.\n",
+    "\n",
+    "We start by [downloading the UniRef90 database](mmseqs-download_uniref90.sh) and creating a mmseqs2\n",
+    "index. While it is running, we will [prepare the draft genome](mmseqs-prepdb.sh) for alignment. When\n",
+    "the download finishes, we can [align the draft genome](mmseqs-align.sh) against UniRef90, add \n",
+    "taxonomic information, and save the result in a .tsv file, which we can afterwards read and parse\n",
+    "at our leisure."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import datetime\n",
+    "\n",
+    "print(datetime.datetime.today().date().isoformat())"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from tqdm import tqdm\n",
+    "import os\n",
+    "\n",
+    "import pandas as pd"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "columns = [\"query\", \"target\", \"alnScore\", \"seqIdentity\", \"eVal\", \"qStart\", \"qEnd\", \"qLen\", \"tStart\",\n",
+    "\"tEnd\", \"tLen\", \"queryOrfStart\", \"queryOrfEnd\", \"dbOrfStart\", \"dbOrfEnd\", \"taxid\",\n",
+    "           \"level\", \"level_value\", \"taxonomy\"]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The end result is close to 20Gb. To parse it, we first need to know how many lines it has."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "m8 = \"/Volumes/scratch/pycnogonum/genome/draft/contamination/contam_tax.m8\"\n",
+    "output = \"/Volumes/scratch/pycnogonum/genome/draft/contamination/chromosomes/\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%bash -s \"$m8\" --out lines\n",
+    "m8=$1\n",
+    "\n",
+    "wc -l $m8"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "lines"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "total_lines = int(lines.split()[0])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now we will parse the file. Since we don't want to keep 20Gb in memory, we will extract only the\n",
+    "relevant information: what we need is the taxonomic information of the hits _per\n",
+    "scaffold/chromosome_. We will keep the corresponding columns of the `.m8` file and save them in\n",
+    "separate files per chromosome/scaffold, for later processing. This is somewhat time-consuming, but\n",
+    "can just run in the background, and we will only need to do it once*.\n",
+    "\n",
+    "\\* unless, of course, we change the genome, which has already happened once :D"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "chromosome = \"\"\n",
+    "\n",
+    "with open(m8) as f:\n",
+    "    for i in tqdm(range(total_lines)):\n",
+    "        line = f.readline().strip().split(\"\\t\")\n",
+    "        query = line[0]\n",
+    "        xx = int(line[12])\n",
+    "        taxonomy = line[18]\n",
+    "        if chromosome == \"\":\n",
+    "            chromosome = query\n",
+    "            out = open(f\"{output}/{chromosome}.tsv\", \"w\")\n",
+    "        if query != chromosome:\n",
+    "            out.close()\n",
+    "            chromosome = query\n",
+    "            out = open(f\"{output}/{chromosome}.tsv\", \"w\")\n",
+    "        \n",
+    "        out.write(\"\\t\".join(line) + \"\\n\")\n",
+    "\n",
+    "out.close()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def approximate_taxonomic_distribution(sequence_path, columns, resolution=1000):\n",
+    "    \"\"\"\n",
+    "    Approximate the taxonomic distribution of a scaffold/pseudochromosome by aggregating hits within\n",
+    "    ORFs. Essentially breaks the sequence into bins of size `resolution` but only uses detected ORFs\n",
+    "    instead of blindly scanning with a sliding window. Assigns the taxon of the hit with the highest\n",
+    "    score to each bin.\n",
+    "\n",
+    "    Parameters\n",
+    "    ----------\n",
+    "    sequence_path : str\n",
+    "        Path to the sequence file.\n",
+    "    columns : list\n",
+    "        Column names of the sequence file.\n",
+    "    resolution : int\n",
+    "        Resolution of the approximation (default: 1000).\n",
+    "\n",
+    "    Returns\n",
+    "    -------\n",
+    "    pd.Series\n",
+    "        Approximated taxonomic distribution; contains the absolute number of hits for each taxon\n",
+    "        (Metazoa, unknown, Viridiplantae, uc_Bacteria, Fungi, various viruses, uc_Archaea,\n",
+    "        uc_Eukaryota).\n",
+    "    \"\"\"\n",
+    "    raw = pd.read_csv(sequence_path, sep=\"\\t\", header=None)\n",
+    "    raw.columns = columns\n",
+    "    raw[\"queryOrfStart_approx\"] = raw[\"queryOrfStart\"] // resolution\n",
+    "    raw[\"queryOrfEnd_approx\"] = raw[\"queryOrfEnd\"] // resolution\n",
+    "    first_pass = raw.groupby(\"queryOrfStart_approx\").first().sort_values(\"queryOrfEnd\", ascending=False)\n",
+    "    second_pass = first_pass.groupby(\"queryOrfEnd_approx\").first().sort_values(\"queryOrfEnd\", ascending=False)\n",
+    "    return second_pass[\"taxonomy\"].value_counts()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "dir_contents = os.listdir(f\"{output}\")\n",
+    "sequences = [s for s in dir_contents if s.endswith('.tsv')]\n",
+    "\n",
+    "result = {}\n",
+    "\n",
+    "for sequence in tqdm(sequences):\n",
+    "    result[sequence] = approximate_taxonomic_distribution(f\"{output}/{sequence}\", columns)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df = pd.concat(result.values(), axis=1).fillna(0)\n",
+    "df.columns = [k.split(\".\")[0] for k in result.keys()]\n",
+    "# normalize each column by the sum of the column\n",
+    "perc = df.div(df.sum(axis=0), axis=1)\n",
+    "\n",
+    "df = df.T\n",
+    "perc = perc.T"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# df.to_csv(\"/Volumes/scratch/pycnogonum/genome/draft/contamination/scaffolds_taxonomic_distribution.tsv\", sep=\"\\t\")\n",
+    "df = pd.read_csv(\"/Volumes/scratch/pycnogonum/genome/draft/contamination/scaffolds_taxonomic_distribution.tsv\", sep=\"\\t\", header=0, index_col=0)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "viruses = df.columns[df.columns.str.contains(\"vir\")]\n",
+    "df['viruses'] = df[viruses].sum(axis=1)\n",
+    "df.drop(columns=viruses, inplace=True)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df.to_csv(\"/Volumes/scratch/pycnogonum/genome/draft/contamination/scaffolds_taxonomic_distribution_collapsed_vir.tsv\", sep=\"\\t\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "suspect = df.loc[perc[\"Metazoa\"] < 0.9]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "suspect.to_csv(\"/Volumes/scratch/pycnogonum/genome/draft/contamination/scaffolds_taxonomic_distribution_suspect.tsv\", sep=\"\\t\")"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "ascc24",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.9.19"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
+%% Cell type:markdown id: tags:
+
+# Contamination analysis
+
+I don't want to run [ContScout](https://github.com/h836472/ContScout/) because it requires insane
+amounts of RAM/storage. Instead, I will align my draft genome against UniRef90 with mmseqs2, append
+taxonomic information to the better hits, and calculate the taxonomic makeup of each scaffold.
+
+We start by [downloading the UniRef90 database](mmseqs-download_uniref90.sh) and creating a mmseqs2
+index. While it is running, we will [prepare the draft genome](mmseqs-prepdb.sh) for alignment. When
+the download finishes, we can [align the draft genome](mmseqs-align.sh) against UniRef90, add
+taxonomic information, and save the result in a .tsv file, which we can afterwards read and parse
+at our leisure.
+
+%% Cell type:code id: tags:
+
+``` python
+import datetime
+
+print(datetime.datetime.today().date().isoformat())
+```
+
+%% Cell type:code id: tags:
+
+``` python
+from tqdm import tqdm
+import os
+
+import pandas as pd
+```
+
+%% Cell type:code id: tags:
+
+``` python
+columns = ["query", "target", "alnScore", "seqIdentity", "eVal", "qStart", "qEnd", "qLen", "tStart",
+"tEnd", "tLen", "queryOrfStart", "queryOrfEnd", "dbOrfStart", "dbOrfEnd", "taxid",
+           "level", "level_value", "taxonomy"]
+```
+
+%% Cell type:markdown id: tags:
+
+The end result is close to 20Gb. To parse it, we first need to know how many lines it has.
+
+%% Cell type:code id: tags:
+
+``` python
+m8 = "/Volumes/scratch/pycnogonum/genome/draft/contamination/contam_tax.m8"
+output = "/Volumes/scratch/pycnogonum/genome/draft/contamination/chromosomes/"
+```
+
+%% Cell type:code id: tags:
+
+``` python
+%%bash -s "$m8" --out lines
+m8=$1
+
+wc -l $m8
+```
+
+%% Cell type:code id: tags:
+
+``` python
+lines
+```
+
+%% Cell type:code id: tags:
+
+``` python
+total_lines = int(lines.split()[0])
+```
+
+%% Cell type:markdown id: tags:
+
+Now we will parse the file. Since we don't want to keep 20Gb in memory, we will extract only the
+relevant information: what we need is the taxonomic information of the hits _per
+scaffold/chromosome_. We will keep the corresponding columns of the `.m8` file and save them in
+separate files per chromosome/scaffold, for later processing. This is somewhat time-consuming, but
+can just run in the background, and we will only need to do it once*.
+
+\* unless, of course, we change the genome, which has already happened once :D
+
+%% Cell type:code id: tags:
+
+``` python
+chromosome = ""
+
+with open(m8) as f:
+    for i in tqdm(range(total_lines)):
+        line = f.readline().strip().split("\t")
+        query = line[0]
+        xx = int(line[12])
+        taxonomy = line[18]
+        if chromosome == "":
+            chromosome = query
+            out = open(f"{output}/{chromosome}.tsv", "w")
+        if query != chromosome:
+            out.close()
+            chromosome = query
+            out = open(f"{output}/{chromosome}.tsv", "w")
+
+        out.write("\t".join(line) + "\n")
+
+out.close()
+```
+
+%% Cell type:code id: tags:
+
+``` python
+def approximate_taxonomic_distribution(sequence_path, columns, resolution=1000):
+    """
+    Approximate the taxonomic distribution of a scaffold/pseudochromosome by aggregating hits within
+    ORFs. Essentially breaks the sequence into bins of size `resolution` but only uses detected ORFs
+    instead of blindly scanning with a sliding window. Assigns the taxon of the hit with the highest
+    score to each bin.
+
+    Parameters
+    ----------
+    sequence_path : str
+        Path to the sequence file.
+    columns : list
+        Column names of the sequence file.
+    resolution : int
+        Resolution of the approximation (default: 1000).
+
+    Returns
+    -------
+    pd.Series
+        Approximated taxonomic distribution; contains the absolute number of hits for each taxon
+        (Metazoa, unknown, Viridiplantae, uc_Bacteria, Fungi, various viruses, uc_Archaea,
+        uc_Eukaryota).
+    """
+    raw = pd.read_csv(sequence_path, sep="\t", header=None)
+    raw.columns = columns
+    raw["queryOrfStart_approx"] = raw["queryOrfStart"] // resolution
+    raw["queryOrfEnd_approx"] = raw["queryOrfEnd"] // resolution
+    first_pass = raw.groupby("queryOrfStart_approx").first().sort_values("queryOrfEnd", ascending=False)
+    second_pass = first_pass.groupby("queryOrfEnd_approx").first().sort_values("queryOrfEnd", ascending=False)
+    return second_pass["taxonomy"].value_counts()
+```
+
+%% Cell type:code id: tags:
+
+``` python
+dir_contents = os.listdir(f"{output}")
+sequences = [s for s in dir_contents if s.endswith('.tsv')]
+
+result = {}
+
+for sequence in tqdm(sequences):
+    result[sequence] = approximate_taxonomic_distribution(f"{output}/{sequence}", columns)
+```
+
+%% Cell type:code id: tags:
+
+``` python
+df = pd.concat(result.values(), axis=1).fillna(0)
+df.columns = [k.split(".")[0] for k in result.keys()]
+# normalize each column by the sum of the column
+perc = df.div(df.sum(axis=0), axis=1)
+
+df = df.T
+perc = perc.T
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# df.to_csv("/Volumes/scratch/pycnogonum/genome/draft/contamination/scaffolds_taxonomic_distribution.tsv", sep="\t")
+df = pd.read_csv("/Volumes/scratch/pycnogonum/genome/draft/contamination/scaffolds_taxonomic_distribution.tsv", sep="\t", header=0, index_col=0)
+```
+
+%% Cell type:code id: tags:
+
+``` python
+viruses = df.columns[df.columns.str.contains("vir")]
+df['viruses'] = df[viruses].sum(axis=1)
+df.drop(columns=viruses, inplace=True)
+```
+
+%% Cell type:code id: tags:
+
+``` python
+df.to_csv("/Volumes/scratch/pycnogonum/genome/draft/contamination/scaffolds_taxonomic_distribution_collapsed_vir.tsv", sep="\t")
+```
+
+%% Cell type:code id: tags:
+
+``` python
+suspect = df.loc[perc["Metazoa"] < 0.9]
+```
+
+%% Cell type:code id: tags:
+
+``` python
+suspect.to_csv("/Volumes/scratch/pycnogonum/genome/draft/contamination/scaffolds_taxonomic_distribution_suspect.tsv", sep="\t")
+```
--- a/04-contam/filter_draft.ipynb
+++ b/04-contam/filter_draft.ipynb
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "import pandas as pd"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "draft = \"/Volumes/scratch/pycnogonum/genome/draft/GAP_sort_scaffolds_by_hic_insert/output/assembly/plit_q_0_50000_0.5FracBest_output.fasta\"\n",
+    "suspect = \"/Volumes/scratch/pycnogonum/genome/draft/contamination/scaffolds_taxonomic_distribution_suspect.tsv\"\n",
+    "filtered_draft = \"/Volumes/scratch/pycnogonum/genome/draft/contamination/plit_q_0_50000_0.5FracBest_output_filtered.fasta\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "exclude = pd.read_csv(suspect, sep=\"\\t\", index_col=0).index.values"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "array(['scaffold_1542', 'scaffold_373', 'scaffold_405', ...,\n",
+       "       'scaffold_6737', 'scaffold_4062', 'scaffold_1888'], dtype=object)"
+      ]
+     },
+     "execution_count": 4,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "exclude"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "copying = False\n",
+    "with open(draft, 'r') as f:\n",
+    "    with open(filtered_draft, 'w') as out:\n",
+    "        for line in f:\n",
+    "            if line.startswith('>'):\n",
+    "                seq = line[1:].split()[0]\n",
+    "                if seq in exclude:\n",
+    "                    copying = False\n",
+    "                else: \n",
+    "                    copying = True\n",
+    "                    out.write(line)\n",
+    "            else:\n",
+    "                if copying:\n",
+    "                    out.write(line)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "   11648\n",
+      "   10258\n"
+     ]
+    }
+   ],
+   "source": [
+    "%%bash -s \"$filtered_draft\" \"$draft\"\n",
+    "draft=$2\n",
+    "filtered_draft=$1\n",
+    "grep \">\" $draft | wc -l\n",
+    "grep \">\" $filtered_draft | wc -l"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "ascc24",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.9.19"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
+%% Cell type:code id: tags:
+
+``` python
+import os
+import pandas as pd
+```
+
+%% Cell type:code id: tags:
+
+``` python
+draft = "/Volumes/scratch/pycnogonum/genome/draft/GAP_sort_scaffolds_by_hic_insert/output/assembly/plit_q_0_50000_0.5FracBest_output.fasta"
+suspect = "/Volumes/scratch/pycnogonum/genome/draft/contamination/scaffolds_taxonomic_distribution_suspect.tsv"
+filtered_draft = "/Volumes/scratch/pycnogonum/genome/draft/contamination/plit_q_0_50000_0.5FracBest_output_filtered.fasta"
+```
+
+%% Cell type:code id: tags:
+
+``` python
+exclude = pd.read_csv(suspect, sep="\t", index_col=0).index.values
+```
+
+%% Cell type:code id: tags:
+
+``` python
+exclude
+```
+
+%% Output
+
+    array(['scaffold_1542', 'scaffold_373', 'scaffold_405', ...,
+           'scaffold_6737', 'scaffold_4062', 'scaffold_1888'], dtype=object)
+
+%% Cell type:code id: tags:
+
+``` python
+copying = False
+with open(draft, 'r') as f:
+    with open(filtered_draft, 'w') as out:
+        for line in f:
+            if line.startswith('>'):
+                seq = line[1:].split()[0]
+                if seq in exclude:
+                    copying = False
+                else:
+                    copying = True
+                    out.write(line)
+            else:
+                if copying:
+                    out.write(line)
+```
+
+%% Cell type:code id: tags:
+
+``` python
+%%bash -s "$filtered_draft" "$draft"
+draft=$2
+filtered_draft=$1
+grep ">" $draft | wc -l
+grep ">" $filtered_draft | wc -l
+```
+
+%% Output
+
+       11648
+       10258
--- a/04-contam/metridium.ipynb
+++ b/04-contam/metridium.ipynb
--- a/04-contam/metridium_contigs.sam_summary
+++ b/04-contam/metridium_contigs.sam_summary
--- a/04-contam/metridium_scaffolds.sam_summary
+++ b/04-contam/metridium_scaffolds.sam_summary
+   2478 pseudochrom_16
+    623 pseudochrom_22
+    373 scaffold_1844
+    355 pseudochrom_38
+    272 pseudochrom_13
+    238 pseudochrom_7
+    233 pseudochrom_24
+    214 pseudochrom_8
+    213 pseudochrom_17
+    174 pseudochrom_23
+    164 pseudochrom_19
+    160 scaffold_875
+    146 scaffold_126
+    140 pseudochrom_21
+    134 pseudochrom_9
+    132 pseudochrom_59
+    127 pseudochrom_1
+    116 pseudochrom_10
+    116 pseudochrom_15
+    115 pseudochrom_35
+    107 pseudochrom_37
+    107 pseudochrom_54
+    104 pseudochrom_30
+    101 pseudochrom_3
+     98 pseudochrom_2
+     97 pseudochrom_58
+     94 pseudochrom_28
+     92 pseudochrom_11
+     92 pseudochrom_26
+     88 pseudochrom_12
+     87 pseudochrom_27
+     86 pseudochrom_31
+     85 pseudochrom_6
+     84 pseudochrom_39
+     83 pseudochrom_33
+     83 pseudochrom_57
+     83 scaffold_5103
+     80 scaffold_152
+     78 pseudochrom_41
+     75 pseudochrom_44
+     75 pseudochrom_47
+     74 pseudochrom_48
+     71 pseudochrom_20
+     71 pseudochrom_45
+     70 scaffold_601
+     68 pseudochrom_14
+     68 pseudochrom_50
+     66 pseudochrom_36
+     63 pseudochrom_32
+     63 scaffold_6579
+     61 pseudochrom_34
+     60 pseudochrom_4
+     59 pseudochrom_25
+     55 pseudochrom_18
+     51 scaffold_1883
+     50 pseudochrom_29
+     50 pseudochrom_43
+     48 pseudochrom_40
+     47 pseudochrom_51
+     47 pseudochrom_56
+     38 pseudochrom_49
+     37 scaffold_139
+     35 pseudochrom_42
+     33 pseudochrom_46
+     33 pseudochrom_5
+     24 pseudochrom_55
+     23 scaffold_10842
+     23 scaffold_2738
+     22 scaffold_365
+     19 scaffold_1105
+     17 scaffold_342
+     17 scaffold_4829
+     16 scaffold_1377
+     13 scaffold_7325
+     11 scaffold_2532
+     11 scaffold_3335
+     10 scaffold_246
+     10 scaffold_296
+     10 scaffold_5270
+      8 scaffold_2824
+      8 scaffold_3119
+      8 scaffold_3681
+      7 scaffold_462
+      7 scaffold_8788
+      6 scaffold_1578
+      6 scaffold_3633
+      6 scaffold_8483
+      5 scaffold_10507
+      5 scaffold_254
+      5 scaffold_5366
+      5 scaffold_6651
+      5 scaffold_6748
+      4 scaffold_2128
+      4 scaffold_817
+      4 scaffold_9451
+      3 scaffold_1281
+      3 scaffold_1615
+      3 scaffold_1732
+      3 scaffold_5394
+      3 scaffold_5725
+      3 scaffold_7012
+      2 scaffold_10435
+      2 scaffold_11049
+      2 scaffold_11619
+      2 scaffold_11825
+      2 scaffold_12344
+      2 scaffold_1553
+      2 scaffold_1642
+      2 scaffold_195
+      2 scaffold_204
+      2 scaffold_2520
+      2 scaffold_356
+      2 scaffold_3880
+      2 scaffold_5554
+      2 scaffold_6479
+      2 scaffold_804
+      1 scaffold_10482
+      1 scaffold_10792
+      1 scaffold_11196
+      1 scaffold_11454
+      1 scaffold_115
+      1 scaffold_120
+      1 scaffold_12273
+      1 scaffold_12277
+      1 scaffold_12312
+      1 scaffold_12338
+      1 scaffold_1238
+      1 scaffold_1291
+      1 scaffold_135
+      1 scaffold_136
+      1 scaffold_141
+      1 scaffold_1484
+      1 scaffold_158
+      1 scaffold_1639
+      1 scaffold_166
+      1 scaffold_1686
+      1 scaffold_175
+      1 scaffold_1756
+      1 scaffold_1779
+      1 scaffold_185
+      1 scaffold_1946
+      1 scaffold_2174
+      1 scaffold_2249
+      1 scaffold_2265
+      1 scaffold_232
+      1 scaffold_242
+      1 scaffold_2471
+      1 scaffold_2474
+      1 scaffold_248
+      1 scaffold_249
+      1 scaffold_252
+      1 scaffold_2536
+      1 scaffold_268
+      1 scaffold_2829
+      1 scaffold_2836
+      1 scaffold_2865
+      1 scaffold_294
+      1 scaffold_2969
+      1 scaffold_3003
+      1 scaffold_306
+      1 scaffold_325
+      1 scaffold_3299
+      1 scaffold_3487
+      1 scaffold_3502
+      1 scaffold_3736
+      1 scaffold_437
+      1 scaffold_449
+      1 scaffold_4666
+      1 scaffold_4697
+      1 scaffold_484
+      1 scaffold_4875
+      1 scaffold_502
+      1 scaffold_5241
+      1 scaffold_525
+      1 scaffold_5360
+      1 scaffold_549
+      1 scaffold_554
+      1 scaffold_5543
+      1 scaffold_573
+      1 scaffold_6082
+      1 scaffold_6518
+      1 scaffold_6880
+      1 scaffold_7139
+      1 scaffold_7200
+      1 scaffold_725
+      1 scaffold_7415
+      1 scaffold_7589
+      1 scaffold_7631
+      1 scaffold_7836
+      1 scaffold_8100
+      1 scaffold_8207
+      1 scaffold_8239
+      1 scaffold_8555
+      1 scaffold_8989
+      1 scaffold_8991
+      1 scaffold_920
+      1 scaffold_9571
+      1 scaffold_9603
+      1 scaffold_9690
--- a/04-contam/mmseqs-align.sh
+++ b/04-contam/mmseqs-align.sh
+#!/usr/bin/env bash
+#
+#SBATCH --job-name=mmseqs2-align
+#SBATCH --ntasks=1
+#SBATCH --cpus-per-task=32
+#SBATCH --mem=120G
+#SBATCH --time=36:00:00
+#SBATCH --mail-type=ALL
+#SBATCH --mail-user=nikolaos.papadopoulos@univie.ac.at
+#SBATCH --output=/lisc/user/papadopoulos/log/mmseqs-align-%j.out
+#SBATCH --error=/lisc/user/papadopoulos/log/mmseqs-align-%j.err
+
+module load mmseqs2
+
+OUTPUT="/lisc/scratch/zoology/pycnogonum/genome/draft/contamination/"
+UNIREF90="/lisc/scratch/zoology/db/uniref90/uniref90"
+
+cd "$OUTPUT" || exit 1
+
+mmseqs search asm $UNIREF90 contam "$TMPDIR" --threads 32 --search-type 2 --split-memory-limit 90G
+mmseqs addtaxonomy $UNIREF90 contam contam_tax --threads 32 --lca-ranks kingdom
+mmseqs createtsv asm $UNIREF90 contam_tax contam_tax.m8 --threads 32
\ No newline at end of file
--- a/04-contam/mmseqs-download_uniref90.sh
+++ b/04-contam/mmseqs-download_uniref90.sh
+#!/usr/bin/env bash
+#
+#SBATCH --job-name=uniref90
+#SBATCH --ntasks=1
+#SBATCH --cpus-per-task=16
+#SBATCH --mem=50G
+#SBATCH --time=1:00:00
+#SBATCH --mail-type=ALL
+#SBATCH --mail-user=nikolaos.papadopoulos@univie.ac.at
+#SBATCH --output=/lisc/user/papadopoulos/log/mmseqs-uniref90-%j.out
+#SBATCH --error=/lisc/user/papadopoulos/log/mmseqs-uniref90-%j.err
+
+module load mmseqs2
+
+OUTPUT="/lisc/scratch/zoology/db/uniref90/"
+
+mkdir -p "$OUTPUT" || exit 1
+cd "$OUTPUT" || exit 1
+
+mmseqs databases UniRef90 "$OUTPUT"/uniref90 "$TMPDIR" --threads 16
\ No newline at end of file
--- a/04-contam/mmseqs-prepdb.sh
+++ b/04-contam/mmseqs-prepdb.sh
+#!/usr/bin/env bash
+#
+#SBATCH --job-name=mmseqs2-index
+#SBATCH --ntasks=1
+#SBATCH --cpus-per-task=8
+#SBATCH --mem=10G
+#SBATCH --time=10:00
+#SBATCH --mail-type=ALL
+#SBATCH --mail-user=nikolaos.papadopoulos@univie.ac.at
+#SBATCH --output=/lisc/user/papadopoulos/log/mmseqs-prep-%j.out
+#SBATCH --error=/lisc/user/papadopoulos/log/mmseqs-prep-%j.err
+
+module load mmseqs2
+
+OUTPUT="/lisc/scratch/zoology/pycnogonum/genome/draft/contamination"
+ASSEMBLY="/lisc/scratch/zoology/pycnogonum/genome/draft/GAP_sort_scaffolds_by_hic_insert/output/assembly/plit_q_0_50000_0.5FracBest_output.fasta"
+
+mkdir -p "$OUTPUT" || exit 1
+cd "$OUTPUT" || exit 1
+
+mmseqs createdb $ASSEMBLY $OUTPUT/asm
+mmseqs createindex $OUTPUT/asm "$TMPDIR" --threads 8 --search-type 2
\ No newline at end of file