Skip to content
Snippets Groups Projects
Commit 773a55a8 authored by Niko (Nikolaos) Papadopoulos's avatar Niko (Nikolaos) Papadopoulos
Browse files

contamination analysis

parent b5211d19
No related branches found
No related tags found
No related merge requests found
# Repeats, annotation, and possible contamination
After the trials and tribulations of genome assembly, scaffolding, repeat masking, and basic
(protein-coding) gene annotation, we finally reached the point where the genome could be useful. At
the same time, this was another checkpoint where we could compare the genome to other known
metazoan/arthropod/chelicerate genome assemblies to get a sense of whether we had something
reasonable or whether we might need to go back to the drawing board.
We got three red flags, which on their own would not have prompted much further investigation, but
together were very concerning:
1. RepeatMasker/RepeatModeler suggested that 60% of the (rather small) genome was repetitive, and
that 45% of that was "unclassified" repeats. While lineage-specific repeats are common, _that
many_ of them certainly aren't.
2. In the process of investigating the repeats (read: BLASTing them against NCBI `nr` for funsies),
we were surprised to find that some of them returned hits in sea anemones. In particular, two
persistent results belonged to _Metridium senile_, a sea anemone that we actually feed the _P.
litorale_ larvae in our culture. While the matches were partial and the e-values were barely
significant, this was too much of a coincidence.
3. We used a set of bulk RNA-seq data from various developmental timepoints to annotate the genome
with `BRAKER`. Despite the extremely good coverage of development, BRAKER only identified 11.4k
protein-coding genes, a far cry from the number we'd usually expect from a metazoan genome.
We have three objectives:
1. Make a decision re: repeats. Are they real? Can we come up with an idea for how we have so many
unclassified repeats?
2. Is the _Metridium_ contamination real?
3. Can we find more protein-coding genes?
### 0. data and setup
- `FLYE=/lisc/scratch/zoology/pycnogonum/genome/flye_full/assembly.fasta` - the original, unscaffolded _P. litorale_ genome assembly
- `DRAFT=/lisc/scratch/zoology/pycnogonum/genome/draft/draft_softmasked.fasta` - The scaffolded, manually edited _P. litorale_ genome assembly
- `METRIDIUM=/lisc/scratch/zoology/pycnogonum/raw/GCA_949775045.1_jaMetSeni4.1_genomic.fna` - the _M. senile_ genome assembly
- `NANOPORE=/lisc/scratch/zoology/pycnogonum/raw/20230601_0904_3C_PAK64436_4d39d6a6/basecalls/nanopore.fasta` - the ONT data that we used to assemble the _P. litorale_ genome
- `HIFI=/lisc/scratch/zoology/pycnogonum/raw/m64120_240312_152428.ccs.fasta` - PacBio HiFi data from a collaborator that weren't deep enough for an assembly
- `REPEATS=/lisc/scratch/zoology/pycnogonum/genome/draft/repeats/repeat_modeller/pycno-families.fa` - the predicted repeat families from RepeatModeler
### 1. Repeats
We need to figure out if the repeats are real. We will go back to the raw reads that we used for
assembly and ask if they already contain repeats. If they do, and if the percentage of reads that
contain repeats is high, we can be more confident that the results of RepeatMasker are not an
artifact.
First, map the repeats on the reads:
```bash
$ OUTPUT=/lisc/scratch/zoology/pycnogonum/genome/draft/repeats
$ bwa mem -x ont2d $REPEATS $NANOPORE > $OUTPUT/ont.sam
$ bwa mem -x pacbio $REPEATS $HIFI > $OUTPUT/pb.sam
```
See the [ONT](check_repeats-ont.sh) and [PacBio](check_repeats-pb.sh) scripts for
details.
Let's first have a look at the ONT data:
```bash
$ samtools flagstat $OUTPUT/ont.sam
$
$ 97,952,355 + 0 in total (QC-passed reads + QC-failed reads)
$ 30,502,516 + 0 primary
$ 0 + 0 secondary
$ 67,449,839 + 0 supplementary
$ 0 + 0 duplicates
$ 0 + 0 primary duplicates
$ 85,877,260 + 0 mapped (87.67% : N/A)
$ 18,427,421 + 0 primary mapped (60.41% : N/A)
$ 0 + 0 paired in sequencing
$ 0 + 0 read1
$ 0 + 0 read2
$ 0 + 0 properly paired (A : N/A)
$ 0 + 0 with itself and mate mapped
$ 0 + 0 singletons (A : N/A)
$ 0 + 0 with mate mapped to a different chr
$ 0 + 0 with mate mapped to a different chr (mapQ>=5)
```
Let's break this down:
- ~98 million total alignments are produced, from...
- 30.5 million reads (primary).
- The total map rate is 60.4% (primary mapped: 18.4 million reads). Since our mapping reference are
the _P. litorale_ repeat families, this means that 60.4% of the reads have at least one repeat
element on them. However, there are also...
- 67.4m supplementary alignments, i.e. alternative alignments. If our reference was a real genome,
this category would denote chimeric reads or split alignments. However, since our reference is the
(rather short) repeats, this essentially means that every read has an average of 2-3 repeats
mapping to it.
The 60.4% map rate is surprisingly similar to the proportion of the genome that was marked as
repeats by RepeatMasker/RepeatModeler (60%). This suggests that the repeats might be real.
Second, let's have a look at our validation:
```bash
$ samtools flagstat $OUTPUT/pb.sam
$
$ 11,081,282 + 0 in total (QC-passed reads + QC-failed reads)
$ 1,715,700 + 0 primary
$ 0 + 0 secondary
$ 9,365,582 + 0 supplementary
$ 0 + 0 duplicates
$ 0 + 0 primary duplicates
$ 10,739,929 + 0 mapped (96.92% : N/A)
$ 1,374,347 + 0 primary mapped (80.10% : N/A)
$ 0 + 0 paired in sequencing
$ 0 + 0 read1
$ 0 + 0 read2
$ 0 + 0 properly paired (A : N/A)
$ 0 + 0 with itself and mate mapped
$ 0 + 0 singletons (A : N/A)
$ 0 + 0 with mate mapped to a different chr
$ 0 + 0 with mate mapped to a different chr (mapQ>=5)
```
Let's break this down:
- 11 million total alignments are produced, from...
- 1.7 million reads (primary).
- The total map rate is 80.1% (primary mapped: 1.37 million reads), so 4/5 of the PacBio reads
have at least one repeat element on them. However, there are also...
- 9.37m supplementary alignments, i.e. alternative alignments, suggesting an average of 8-9 repeats
per read. The average length of the PacBio reads is three times higher than the ONT reads
(5,892nt/1,848nt), so this makes sense.
#### Takeaways:
The repeats seem to be real. There seem to be multiple repeats present on each read.
Ways to improve this analysis:
- repeat analysis with short reads
- filter mapq30 alignments, count #alignments per read
- analyse the CIGAR strings to see what portion of each read is matched by high quality maps of
repeat content
### 2a. _Metridium_ contamination
Map the ONT reads onto the _Metridium_ genome using the [backmap
script](../nanopore/eval-backmap-ont.sh) We wrote for assembly qc:
```bash
$ OUTPUT=/lisc/scratch/zoology/pycnogonum/genome/metridium_cont
$ sbatch eval-backmap-ont.sh $METRIDIUM $OUTPUT
```
The first thing to ask here is how many reads mapped well to the _Metridium_ genome. This
information is included in the `.stats` file produced by `backmap`. Consider the SN section of that
file here for closer inspection, but the important takeaway is that ~290k/30.5m reads mapped to the
_Metridium_ genome (0.95%) with mapping quality >0. This is worrying and should be investigated
more.
<details>
<summary>Summary Numbers from `samtools stats`</summary>
```bash
SN raw total sequences: 30502516 # excluding supplementary and secondary reads
SN filtered sequences: 0
SN sequences: 30502516
SN is sorted: 1
SN 1st fragments: 30502516
SN last fragments: 0
SN reads mapped: 419599
SN reads mapped and paired: 0 # paired-end technology bit set + both mates mapped
SN reads unmapped: 30082917
SN reads properly paired: 0 # proper-pair bit set
SN reads paired: 0 # paired-end technology bit set
SN reads duplicated: 0 # PCR or optical duplicate bit set
SN reads MQ0: 129520 # mapped and MQ=0
SN reads QC failed: 0
SN non-primary alignments: 1413288
SN supplementary alignments: 1742261
SN total length: 56368688445 # ignores clipping
SN total first fragment length: 56368688445 # ignores clipping
SN total last fragment length: 0 # ignores clipping
SN bases mapped: 3057365257 # ignores clipping
SN bases mapped (cigar): 356727603 # more accurate
SN bases trimmed: 0
SN bases duplicated: 0
SN mismatches: 79904074 # from NM fields
SN error rate: 2.239918e-01 # mismatches / bases mapped (cigar)
SN average length: 1848
SN average first fragment length: 1848
SN average last fragment length: 0
SN maximum length: 1533915
SN maximum first fragment length: 1533915
SN maximum last fragment length: 0
SN average quality: 34.7
SN insert size average: 0.0
SN insert size standard deviation: 0.0
SN inward oriented pairs: 0
SN outward oriented pairs: 0
SN pairs with other orientation: 0
SN pairs on different chromosomes: 0
SN percentage of properly paired reads (%): 0.0
```
</details>
We can then extract only the high-quality (map quality >30) primary alignments (exclude non-primary
alignments with `-F 0x100`):
```bash
$ cd $OUTPUT
$ samtools view -q 30 -F 0x100 GCA_949775045.1_jaMetSeni4.1_genomic.fna.ont.sort.bam | cut -f1 > primary_mq30.txt
```
This leaves us with only 32,490 reads.
At the same time, since `flye` doesn't keep track of which reads were used for contig building, we
need to map the reads back to the assembly, which I'd already done for QC reasons anyway. However,
we should be focusing on high-quality alignments that are not secondary or supplementary (0x100 and
0x800 flags):
```bash
$ OUTPUT=/lisc/scratch/zoology/pycnogonum/genome/draft/backmap/ont
$ sbatch eval-backmap-ont.sh $DRAFT $OUTPUT
$ samtools view -F 0x900 -q 30 -S $OUTPUT/draft.fasta.ont.sort.bam > $OUTPUT/draft.fasta.ont.sort.sam
$ BACKMAPPED_ONT=/lisc/scratch/zoology/pycnogonum/genome/draft/backmap/ont/draft.fasta.ont.sort.sam
```
(feel free to use more threads to speed `samtools view` up.)
We can combine the two outputs to see where the high-confidence _Metridium_ reads are concentrated
on the _P. litorale_ assembly:
```bash
$ LC_ALL=C fgrep -f primary_mq30.txt $BACKMAPPED_ONT > metridium_scaffolds.txt
```
This leaves us with 10,170 putatively _Metridium_ reads which have mapped well to the _P. litorale_
draft. We can then sum up how many times each scaffold was hit by _Metridium_ reads:
```bash
$ cat metridium_scaffolds.txt | cut -f3 | sort | uniq -c | sort -k1,1nr > metridium_scaffolds.txt_summary
```
The summary file is available in the [repository](./metridium_scaffolds.sam_summary). However, the
most important message becomes clear if we just inspect the top hits:
```bash
$ 2478 pseudochrom_16
$ 623 pseudochrom_22
$ 373 scaffold_1844
$ 355 pseudochrom_38
$ 272 pseudochrom_13
$ 238 pseudochrom_7
$ 233 pseudochrom_24
$ 214 pseudochrom_8
$ 213 pseudochrom_17
$ 174 pseudochrom_23
```
This looks a bit concerning. We need to go back to the contigs that were used to build the scaffolds
and do the same. However, if the offending contigs are highly repetitive and the "metridial" reads
are also highly repetitive, this is less concerning, and can be explained by the similarity of
low-complexity regions or tandem repeats.
Mapping scaffolds to contigs is not straightforward because of the manual editing of the hiC map.
It might be faster to just look at the locations of the putative _Metridium_ reads on the current
assembly and try to make sense out of that.
We cross-referenced the predicted repeats of the _P. litorale_ genome with the mapping locations of
the putative _Metridium_ reads. Out of the 10,170 putative _Metridium_ reads, 6,039 (60%) are at
least 25% repetitive, and 4,593 are at least 50% repetitive (see relevant
[notebook](metridium.ipynb)). This leaves us with maybe 5,000 problematic reads out of a total of
30.5 million.
For context, if we do the same thing but with contigs, and then refer back to the original flye
assembly (before scaffolding):
```bash
$ OUTPUT=/lisc/scratch/zoology/pycnogonum/genome/flye_full/backmap_ont
$ sbatch eval-backmap-ont.sh $FLYE $OUTPUT
$ samtools view -N primary_mq30.txt -F 0x900 -q 30 metridium_locs.sam > metridium_contigs.sam
$ cat metridium_contigs.sam | cut -f3 | sort | uniq -c | sort -k1,1nr > metridium_contigs.sam_summary
```
This file is also [uploaded](./metridium_contigs.sam_summary). Let's have a look at the top hits:
```bash
$ head metridium_contigs.txt_summary
$
$ 2197 contig_8782
$ 324 contig_21131
$ 151 contig_28132
$ 146 contig_6939
$ 123 contig_19945
$ 94 contig_11056
$ 79 contig_5985
$ 78 contig_19987
$ 77 contig_4985
$ 76 contig_4041
```
If we go back to the alignment graph from `flye` and have a look, the alignment path is very
illuminating:
```bash
contig_8782 209547 86 N N 1 * 10576,-3609,-3607,8782,20233,-24417,-24416,-28783,13730,3605,3605,3605,3605,3605,3605,3605,3605,3605,3605,3605,3605,3605,3605,3605,3605,3605,3605,3605,3605,3605,3605,3605,3605,3605,3605,3605,3605,3605,3605,3605,3605,3605,3605,3605,3605,3605,3605,3605
```
This contig's path starts out normal, but very quickly devolves into a self-repeat.
Overall, it looks like _Metridium_ contamination is not a real danger; rather, we should start
contemplating other possibilities to explain these weak sequence similarities, such as horizontal
gene transfer.
### 2b. Other contaminants
Widespread _Metridium_ contamination seems to be out of the question, but this doesn't mean that
other contaminants can't be present. To check for that, we implemented an approach very similar to
`Kraken2.0`. Briefly, we aligned UniRef90 onto the draft using `mmseqs2` and then then used the
`taxonomy` module to assign kingdom-level taxonomy information to each hit; we then excluded
contigs/scaffolds that did not overwhelmingly contain metazoan genes.
First we need to download UniRef90 and also build the `mmseqs2` database from the genome sequences.
- [mmseqs-download_uniref90.sh](mmseqs-download_uniref90.sh)
- [mmseqs-prepdb.sh](mmseqs-prepdb.sh)
These can be run in parallel.
Then we do the actual alignment/taxonomy assignment:
- [mmseqs-align.sh](mmseqs-align.sh)
We then used a custom notebook to parse the result and save taxonomy information per
pseudochromosome/scaffold:
- [contamination.ipynb](contamination.ipynb)
Finally, we used the outputs of this notebook to filter out the contamination from the assembly:
- [filter_draft.ipynb](filter_draft.ipynb)
#!/usr/bin/env bash
#
#SBATCH --job-name=read_repeats
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --mem=5000M
#SBATCH --time=20:00:00
#SBATCH --mail-type=ALL
#SBATCH --mail-user=nikolaos.papadopoulos@univie.ac.at
#SBATCH --output=/lisc/user/papadopoulos/log/pycno-scan-ont-%j.out
#SBATCH --error=/lisc/user/papadopoulos/log/pycno-scan-ont-%j.err
# tests:
# 610050 FASTA rows (~1% of file)
# /usr/bin/time bwa mem -x ont2d -t 4 ../repeat_modeller/pycno-families.fa onepercent.fasta > test.sam
# needed 3.2min with 388% CPU usage and 359MB RAM
# 6100500 FASTA rows (~10% of file)
# /usr/bin/time bwa mem -x ont2d -t 4 ../repeat_modeller/pycno-families.fa tenpercent.fasta > test.sam
# needed 54min with 384% CPU usage and 2GB RAM
# first run timed out at 11h (--time=10:00:00) and processed 17,467,334 / 30,502,516 reads, so we
# need at least 17.5h. Going with 20h to be safe. At least now I know the real memory consumption.
# second run went OOM with 3.5G of RAM, so I'm going with 5G now.
module load bwa/0.7.18
NANOPORE="/lisc/scratch/zoology/pycnogonum/raw/20230601_0904_3C_PAK64436_4d39d6a6/basecalls/202306050/nanopore.fasta"
REPEATS="/lisc/scratch/zoology/pycnogonum/genome/draft/repeats/repeat_modeller/pycno-families.fa"
OUTPUT="/lisc/scratch/zoology/pycnogonum/genome/draft/repeats/"
bwa mem -x ont2d -t 16 $REPEATS $NANOPORE > $OUTPUT/ont.sam
\ No newline at end of file
#!/usr/bin/env bash
#
#SBATCH --job-name=read_repeats
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --mem=300M
#SBATCH --time=80:00
#SBATCH --mail-type=ALL
#SBATCH --mail-user=nikolaos.papadopoulos@univie.ac.at
#SBATCH --output=/lisc/user/papadopoulos/log/pycno-scan-pb-%j.out
#SBATCH --error=/lisc/user/papadopoulos/log/pycno-scan-pb-%j.err
# tests:
# 34314 FASTA rows (1% of file)
# /usr/bin/time bwa mem -x pacbio -t 4 ../repeat_modeller/pycno-families.fa onepercent.fasta > test.sam
# needed 49s with 392% CPU usage and 271MB RAM
# 343140 FASTA rows (10% of file)
# /usr/bin/time bwa mem -x pacbio -t 4 ../repeat_modeller/pycno-families.fa tenpercent.fasta > test.sam
# needed 480s wuth 392% CPU usage and 271MB RAM
module load bwa/0.7.18
HIFI="/lisc/scratch/zoology/pycnogonum/raw/m64120_240312_152428.ccs.fasta"
REPEATS="/lisc/scratch/zoology/pycnogonum/genome/draft/repeats/repeat_modeller/pycno-families.fa"
OUTPUT="/lisc/scratch/zoology/pycnogonum/genome/draft/repeats/"
bwa mem -x pacbio -t 4 $REPEATS $HIFI > $OUTPUT/pb.sam
\ No newline at end of file
%% Cell type:markdown id: tags:
# Contamination analysis
I don't want to run [ContScout](https://github.com/h836472/ContScout/) because it requires insane
amounts of RAM/storage. Instead, I will align my draft genome against UniRef90 with mmseqs2, append
taxonomic information to the better hits, and calculate the taxonomic makeup of each scaffold.
We start by [downloading the UniRef90 database](mmseqs-download_uniref90.sh) and creating a mmseqs2
index. While it is running, we will [prepare the draft genome](mmseqs-prepdb.sh) for alignment. When
the download finishes, we can [align the draft genome](mmseqs-align.sh) against UniRef90, add
taxonomic information, and save the result in a .tsv file, which we can afterwards read and parse
at our leisure.
%% Cell type:code id: tags:
``` python
import datetime
print(datetime.datetime.today().date().isoformat())
```
%% Cell type:code id: tags:
``` python
from tqdm import tqdm
import os
import pandas as pd
```
%% Cell type:code id: tags:
``` python
columns = ["query", "target", "alnScore", "seqIdentity", "eVal", "qStart", "qEnd", "qLen", "tStart",
"tEnd", "tLen", "queryOrfStart", "queryOrfEnd", "dbOrfStart", "dbOrfEnd", "taxid",
"level", "level_value", "taxonomy"]
```
%% Cell type:markdown id: tags:
The end result is close to 20Gb. To parse it, we first need to know how many lines it has.
%% Cell type:code id: tags:
``` python
m8 = "/Volumes/scratch/pycnogonum/genome/draft/contamination/contam_tax.m8"
output = "/Volumes/scratch/pycnogonum/genome/draft/contamination/chromosomes/"
```
%% Cell type:code id: tags:
``` python
%%bash -s "$m8" --out lines
m8=$1
wc -l $m8
```
%% Cell type:code id: tags:
``` python
lines
```
%% Cell type:code id: tags:
``` python
total_lines = int(lines.split()[0])
```
%% Cell type:markdown id: tags:
Now we will parse the file. Since we don't want to keep 20Gb in memory, we will extract only the
relevant information: what we need is the taxonomic information of the hits _per
scaffold/chromosome_. We will keep the corresponding columns of the `.m8` file and save them in
separate files per chromosome/scaffold, for later processing. This is somewhat time-consuming, but
can just run in the background, and we will only need to do it once*.
\* unless, of course, we change the genome, which has already happened once :D
%% Cell type:code id: tags:
``` python
chromosome = ""
with open(m8) as f:
for i in tqdm(range(total_lines)):
line = f.readline().strip().split("\t")
query = line[0]
xx = int(line[12])
taxonomy = line[18]
if chromosome == "":
chromosome = query
out = open(f"{output}/{chromosome}.tsv", "w")
if query != chromosome:
out.close()
chromosome = query
out = open(f"{output}/{chromosome}.tsv", "w")
out.write("\t".join(line) + "\n")
out.close()
```
%% Cell type:code id: tags:
``` python
def approximate_taxonomic_distribution(sequence_path, columns, resolution=1000):
"""
Approximate the taxonomic distribution of a scaffold/pseudochromosome by aggregating hits within
ORFs. Essentially breaks the sequence into bins of size `resolution` but only uses detected ORFs
instead of blindly scanning with a sliding window. Assigns the taxon of the hit with the highest
score to each bin.
Parameters
----------
sequence_path : str
Path to the sequence file.
columns : list
Column names of the sequence file.
resolution : int
Resolution of the approximation (default: 1000).
Returns
-------
pd.Series
Approximated taxonomic distribution; contains the absolute number of hits for each taxon
(Metazoa, unknown, Viridiplantae, uc_Bacteria, Fungi, various viruses, uc_Archaea,
uc_Eukaryota).
"""
raw = pd.read_csv(sequence_path, sep="\t", header=None)
raw.columns = columns
raw["queryOrfStart_approx"] = raw["queryOrfStart"] // resolution
raw["queryOrfEnd_approx"] = raw["queryOrfEnd"] // resolution
first_pass = raw.groupby("queryOrfStart_approx").first().sort_values("queryOrfEnd", ascending=False)
second_pass = first_pass.groupby("queryOrfEnd_approx").first().sort_values("queryOrfEnd", ascending=False)
return second_pass["taxonomy"].value_counts()
```
%% Cell type:code id: tags:
``` python
dir_contents = os.listdir(f"{output}")
sequences = [s for s in dir_contents if s.endswith('.tsv')]
result = {}
for sequence in tqdm(sequences):
result[sequence] = approximate_taxonomic_distribution(f"{output}/{sequence}", columns)
```
%% Cell type:code id: tags:
``` python
df = pd.concat(result.values(), axis=1).fillna(0)
df.columns = [k.split(".")[0] for k in result.keys()]
# normalize each column by the sum of the column
perc = df.div(df.sum(axis=0), axis=1)
df = df.T
perc = perc.T
```
%% Cell type:code id: tags:
``` python
# df.to_csv("/Volumes/scratch/pycnogonum/genome/draft/contamination/scaffolds_taxonomic_distribution.tsv", sep="\t")
df = pd.read_csv("/Volumes/scratch/pycnogonum/genome/draft/contamination/scaffolds_taxonomic_distribution.tsv", sep="\t", header=0, index_col=0)
```
%% Cell type:code id: tags:
``` python
viruses = df.columns[df.columns.str.contains("vir")]
df['viruses'] = df[viruses].sum(axis=1)
df.drop(columns=viruses, inplace=True)
```
%% Cell type:code id: tags:
``` python
df.to_csv("/Volumes/scratch/pycnogonum/genome/draft/contamination/scaffolds_taxonomic_distribution_collapsed_vir.tsv", sep="\t")
```
%% Cell type:code id: tags:
``` python
suspect = df.loc[perc["Metazoa"] < 0.9]
```
%% Cell type:code id: tags:
``` python
suspect.to_csv("/Volumes/scratch/pycnogonum/genome/draft/contamination/scaffolds_taxonomic_distribution_suspect.tsv", sep="\t")
```
%% Cell type:code id: tags:
``` python
import os
import pandas as pd
```
%% Cell type:code id: tags:
``` python
draft = "/Volumes/scratch/pycnogonum/genome/draft/GAP_sort_scaffolds_by_hic_insert/output/assembly/plit_q_0_50000_0.5FracBest_output.fasta"
suspect = "/Volumes/scratch/pycnogonum/genome/draft/contamination/scaffolds_taxonomic_distribution_suspect.tsv"
filtered_draft = "/Volumes/scratch/pycnogonum/genome/draft/contamination/plit_q_0_50000_0.5FracBest_output_filtered.fasta"
```
%% Cell type:code id: tags:
``` python
exclude = pd.read_csv(suspect, sep="\t", index_col=0).index.values
```
%% Cell type:code id: tags:
``` python
exclude
```
%% Output
array(['scaffold_1542', 'scaffold_373', 'scaffold_405', ...,
'scaffold_6737', 'scaffold_4062', 'scaffold_1888'], dtype=object)
%% Cell type:code id: tags:
``` python
copying = False
with open(draft, 'r') as f:
with open(filtered_draft, 'w') as out:
for line in f:
if line.startswith('>'):
seq = line[1:].split()[0]
if seq in exclude:
copying = False
else:
copying = True
out.write(line)
else:
if copying:
out.write(line)
```
%% Cell type:code id: tags:
``` python
%%bash -s "$filtered_draft" "$draft"
draft=$2
filtered_draft=$1
grep ">" $draft | wc -l
grep ">" $filtered_draft | wc -l
```
%% Output
11648
10258
Source diff could not be displayed: it is too large. Options to address this: view the blob.
This diff is collapsed.
2478 pseudochrom_16
623 pseudochrom_22
373 scaffold_1844
355 pseudochrom_38
272 pseudochrom_13
238 pseudochrom_7
233 pseudochrom_24
214 pseudochrom_8
213 pseudochrom_17
174 pseudochrom_23
164 pseudochrom_19
160 scaffold_875
146 scaffold_126
140 pseudochrom_21
134 pseudochrom_9
132 pseudochrom_59
127 pseudochrom_1
116 pseudochrom_10
116 pseudochrom_15
115 pseudochrom_35
107 pseudochrom_37
107 pseudochrom_54
104 pseudochrom_30
101 pseudochrom_3
98 pseudochrom_2
97 pseudochrom_58
94 pseudochrom_28
92 pseudochrom_11
92 pseudochrom_26
88 pseudochrom_12
87 pseudochrom_27
86 pseudochrom_31
85 pseudochrom_6
84 pseudochrom_39
83 pseudochrom_33
83 pseudochrom_57
83 scaffold_5103
80 scaffold_152
78 pseudochrom_41
75 pseudochrom_44
75 pseudochrom_47
74 pseudochrom_48
71 pseudochrom_20
71 pseudochrom_45
70 scaffold_601
68 pseudochrom_14
68 pseudochrom_50
66 pseudochrom_36
63 pseudochrom_32
63 scaffold_6579
61 pseudochrom_34
60 pseudochrom_4
59 pseudochrom_25
55 pseudochrom_18
51 scaffold_1883
50 pseudochrom_29
50 pseudochrom_43
48 pseudochrom_40
47 pseudochrom_51
47 pseudochrom_56
38 pseudochrom_49
37 scaffold_139
35 pseudochrom_42
33 pseudochrom_46
33 pseudochrom_5
24 pseudochrom_55
23 scaffold_10842
23 scaffold_2738
22 scaffold_365
19 scaffold_1105
17 scaffold_342
17 scaffold_4829
16 scaffold_1377
13 scaffold_7325
11 scaffold_2532
11 scaffold_3335
10 scaffold_246
10 scaffold_296
10 scaffold_5270
8 scaffold_2824
8 scaffold_3119
8 scaffold_3681
7 scaffold_462
7 scaffold_8788
6 scaffold_1578
6 scaffold_3633
6 scaffold_8483
5 scaffold_10507
5 scaffold_254
5 scaffold_5366
5 scaffold_6651
5 scaffold_6748
4 scaffold_2128
4 scaffold_817
4 scaffold_9451
3 scaffold_1281
3 scaffold_1615
3 scaffold_1732
3 scaffold_5394
3 scaffold_5725
3 scaffold_7012
2 scaffold_10435
2 scaffold_11049
2 scaffold_11619
2 scaffold_11825
2 scaffold_12344
2 scaffold_1553
2 scaffold_1642
2 scaffold_195
2 scaffold_204
2 scaffold_2520
2 scaffold_356
2 scaffold_3880
2 scaffold_5554
2 scaffold_6479
2 scaffold_804
1 scaffold_10482
1 scaffold_10792
1 scaffold_11196
1 scaffold_11454
1 scaffold_115
1 scaffold_120
1 scaffold_12273
1 scaffold_12277
1 scaffold_12312
1 scaffold_12338
1 scaffold_1238
1 scaffold_1291
1 scaffold_135
1 scaffold_136
1 scaffold_141
1 scaffold_1484
1 scaffold_158
1 scaffold_1639
1 scaffold_166
1 scaffold_1686
1 scaffold_175
1 scaffold_1756
1 scaffold_1779
1 scaffold_185
1 scaffold_1946
1 scaffold_2174
1 scaffold_2249
1 scaffold_2265
1 scaffold_232
1 scaffold_242
1 scaffold_2471
1 scaffold_2474
1 scaffold_248
1 scaffold_249
1 scaffold_252
1 scaffold_2536
1 scaffold_268
1 scaffold_2829
1 scaffold_2836
1 scaffold_2865
1 scaffold_294
1 scaffold_2969
1 scaffold_3003
1 scaffold_306
1 scaffold_325
1 scaffold_3299
1 scaffold_3487
1 scaffold_3502
1 scaffold_3736
1 scaffold_437
1 scaffold_449
1 scaffold_4666
1 scaffold_4697
1 scaffold_484
1 scaffold_4875
1 scaffold_502
1 scaffold_5241
1 scaffold_525
1 scaffold_5360
1 scaffold_549
1 scaffold_554
1 scaffold_5543
1 scaffold_573
1 scaffold_6082
1 scaffold_6518
1 scaffold_6880
1 scaffold_7139
1 scaffold_7200
1 scaffold_725
1 scaffold_7415
1 scaffold_7589
1 scaffold_7631
1 scaffold_7836
1 scaffold_8100
1 scaffold_8207
1 scaffold_8239
1 scaffold_8555
1 scaffold_8989
1 scaffold_8991
1 scaffold_920
1 scaffold_9571
1 scaffold_9603
1 scaffold_9690
#!/usr/bin/env bash
#
#SBATCH --job-name=mmseqs2-align
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=32
#SBATCH --mem=120G
#SBATCH --time=36:00:00
#SBATCH --mail-type=ALL
#SBATCH --mail-user=nikolaos.papadopoulos@univie.ac.at
#SBATCH --output=/lisc/user/papadopoulos/log/mmseqs-align-%j.out
#SBATCH --error=/lisc/user/papadopoulos/log/mmseqs-align-%j.err
module load mmseqs2
OUTPUT="/lisc/scratch/zoology/pycnogonum/genome/draft/contamination/"
UNIREF90="/lisc/scratch/zoology/db/uniref90/uniref90"
cd "$OUTPUT" || exit 1
mmseqs search asm $UNIREF90 contam "$TMPDIR" --threads 32 --search-type 2 --split-memory-limit 90G
mmseqs addtaxonomy $UNIREF90 contam contam_tax --threads 32 --lca-ranks kingdom
mmseqs createtsv asm $UNIREF90 contam_tax contam_tax.m8 --threads 32
\ No newline at end of file
#!/usr/bin/env bash
#
#SBATCH --job-name=uniref90
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH --mem=50G
#SBATCH --time=1:00:00
#SBATCH --mail-type=ALL
#SBATCH --mail-user=nikolaos.papadopoulos@univie.ac.at
#SBATCH --output=/lisc/user/papadopoulos/log/mmseqs-uniref90-%j.out
#SBATCH --error=/lisc/user/papadopoulos/log/mmseqs-uniref90-%j.err
module load mmseqs2
OUTPUT="/lisc/scratch/zoology/db/uniref90/"
mkdir -p "$OUTPUT" || exit 1
cd "$OUTPUT" || exit 1
mmseqs databases UniRef90 "$OUTPUT"/uniref90 "$TMPDIR" --threads 16
\ No newline at end of file
#!/usr/bin/env bash
#
#SBATCH --job-name=mmseqs2-index
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=10G
#SBATCH --time=10:00
#SBATCH --mail-type=ALL
#SBATCH --mail-user=nikolaos.papadopoulos@univie.ac.at
#SBATCH --output=/lisc/user/papadopoulos/log/mmseqs-prep-%j.out
#SBATCH --error=/lisc/user/papadopoulos/log/mmseqs-prep-%j.err
module load mmseqs2
OUTPUT="/lisc/scratch/zoology/pycnogonum/genome/draft/contamination"
ASSEMBLY="/lisc/scratch/zoology/pycnogonum/genome/draft/GAP_sort_scaffolds_by_hic_insert/output/assembly/plit_q_0_50000_0.5FracBest_output.fasta"
mkdir -p "$OUTPUT" || exit 1
cd "$OUTPUT" || exit 1
mmseqs createdb $ASSEMBLY $OUTPUT/asm
mmseqs createindex $OUTPUT/asm "$TMPDIR" --threads 8 --search-type 2
\ No newline at end of file
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment