Skip to content
Snippets Groups Projects
Commit 2e4251db authored by Niko (Nikolaos) Papadopoulos's avatar Niko (Nikolaos) Papadopoulos
Browse files

genome annotation

parent 119cf2be
Branches
Tags
No related merge requests found
Showing
with 2101 additions and 0 deletions
File added
# Annotation
Code in this folder covers the annotation of the genome with gene models, mostly for protein-coding
genes. The starting point is the scaffolded, decontaminated draft assembly.
## Annotating protein-coding genes on the _Pycnogonum_ genome
### BRAKER3 with developmental RNA-seq data
BRAKER3 is one of the most widely used tools for eukaryotic gene prediction, and one that
consistently performs well in benchmarks. However, installing the software is a non-trivial
endeavour in a HPC system without admin privileges. We provide here a summary of our experience in
the hopes that it is useful to other hopeful users.
- [Installing BRAKER3 in a shared system without admin access](install_braker.md)
We generated RNA-seq data for a time series spanning development from late embryogenesis until the
subadult stage (embryo 3-4, instars 1 to 6, juvenile 1, subadult), and our collaborators provided
data for even earlier timepoints (zygote, early cleavage, embryo 3-5). This data was mapped to the
draft genome (average: 72.8% uniquely mapping, 15% multimapping, so pretty good completeness) and
the resulting BAM files used as input to BRAKER3.
- [`prep-RNA-mkindex.sh`](prep-RNA-mkindex.sh) - before mapping the RNA we need to index the draft
genome so that STAR can actually map to it.
- [`prep-RNA-map-all.sh`](prep-RNA-map-all.sh) - main script that will submit all mapping jobs.
- [`prep-RNA-map-single.sh`](prep-RNA-map-single.sh) - worker script that will map the RNA reads
from a developmental timepoint to the draft genome.
- [`prep-RNA-bam.sh`](prep-RNA-bam.sh) - main script that will submit the SAM-to-BAM conversion jobs
to the cluster.
- [`prep-RNA-sam2bam.sh`](prep-RNA-sam2bam.sh) - worker script that will convert a SAM file to a BAM
file in the same location.
- [`prep-RNA-index_bam.sh`](prep-RNA-index_bam.sh) - index the resulting BAM files
BRAKER3 also expects a FASTA file with protein sequences expected to be found in the genome. For
this, we used the Arthropoda partition from [orthoDB](https://www.orthodb.org).
BRAKER3 was run from a container on the Vienna LiSC (Life Science Cluster; LiSC for short).
- [annotate/annot-braker.sh](./annot1-braker.sh)
This predicted 11,451 gene models. However, manual inspection of the RNA-seq mapping on the draft
genome revealed that BRAKER3 had been rather conservative in its predictions, and that many loci
that looked like they contained exon structures did not receive any annotation.
### using Iso-seq collapsed transcripts as gene models
We generated a long-read transcriptome using PacBio Iso-seq technology. After collapsing isoforms by
mapping onto the draft genome, we were left with 8,904 transcript families, which we interpreted as
genes. By comparing the Iso-seq "genes" with the BRAKER3 models we were able to identify multiple
cases where BRAKER3 occasionally merged gene models if they were too close to each other.
We decided to combine the Iso-seq and BRAKER3 annotation by giving precedence to Iso-seq. We
filtered the BRAKER3 gene models to remove any that overlapped (same strand, any overlap) with the
Iso-seq genes, and were left with 3,596 gene models, raising our total to 3,596 + 8,904 = 12,500.
- [Map](annot1-isoseq-map.sh) the Iso-seq collapsed transcripts to the draft genome
- [isoseq_confirm.ipynb](./annot1-isoseq_confirm.ipynb) - cross-reference the Iso-seq mapping
locations with BRAKER predictions and produce a new GFF file that contains their union.
Unfortunately, the Iso-seq transcriptome we generated was part of a multiplexed run. This meant that
we "lost" almost 60% of our effective sequence depth on other organisms, making the _Pycnogonum_
Iso-seq less effective for gene annotation. Manual inspection confirmed what the numbers suggested:
there were still plenty of loci with RNA-seq peaks that looked like _bona fide_ genes but received
no annotation from either PacBio or BRAKER3.
We consider Iso-seq+BRAKER3 "round 1" of the genome annotation. In the final GFF file, gene models
that are derived from Iso-seq are have "PacBio" in the `source` column, and their IDs are in the
form `PB.XXXX`, where `X` are digits. BRAKER3 gene models have "AUGUSTUS" listed as their `source`
and their IDs have the form `gXXXXX`, with `X` again being digits.
### Re-using unannotated loci for a second round of annotation
There were a couple of indications that the genome annotation was not yet complete. First, BUSCO
completeness of the gene models was lower than the entire genome; second, the mapping rates of the
RNA-seq data was much lower for gene models (avg. uniquely mapping: 48.69%) compared to the entire
genome (avg. uniquely mapping: 72.8%); finally, manual inspection easily identified loci that looked
like genes but had no annotation.
We proceeded to a second round of annotation. We filtered the RNA-seq data to only keep the mapped
reads that did _not_ overlap with any of the gene models of the first BRAKER/Iso-seq round:
- [annot2-RNA-extracts.sh](./annot2-RNA-extracts.sh) - main script that will submit worker scripts
for each transcriptome.
- [text](annot2-RNA-extract_uncovered.sh) - worker script that, for a bulk transcriptome will
extract the reads that map on un-modelled regions.
...and supplied the same orthoDB arthropod partition as protein hints:
- [annot2-braker.sh](./annot2-braker.sh)
As the protein hints overlapped with round 1, it was to be expected that BRAKER3 would also produce
gene models that would overlap with round 1. To safely exclude these cases we used _exons_ to test
for gene overlap. Using entire gene models would preclude genes that overlap with other genes (but
on the other strand), or genes that lie in another gene's intron (such as the CHAT/VCHAT genes,
conserved in _Platynereis_).
This task was mostly achieved on the command line, using a combination of `bedtools intersect` to
identify overlaps between GFF files, `AGAT` to convert from GTF to GFF, and bash utility tools
`grep`, `cut`, `sort`, and `sed` to extract and lightly manipulate lines from the resulting files.
We consider this "round 2" of the genome annotation. In the final GFF file, gene models from round 2
have "AGAT" as `source` for gene/mRNA entries or "AUGUSTUS" for exons/CDS. Gene IDs are in the form
`r2_gXXXX`, with `r2` standing for "round 2" and the rest being the usual BRAKER3 gene ID. Round 2
produced an additional 2,223 gene models, bringing our total to 14,723.
<details>
<summary>Click here for details</summary>
First extract the round 2 exons that don't overlap:
```bash
$ grep exon braker/braker.gtf > braker_exons.gtf
$ grep exon ../annot_merge/merged.gff3 > draft_exons.gtf
$ bedtools intersect -v -b draft_exons.gtf -a braker_exons.gtf > unique_exons.gtf
```
Keep track of which genes these come from:
```bash
$ cat unique_exons.gtf | cut -f9 | cut -d" " -f 4 | sort -u > keep_candidates.txt
```
At the same time, there might be round 2 genes that are a bit bigger than round 1 genes, and thus
mostly overlap but have some unique exons. We'd like to avoid those:
```bash
$ bedtools intersect -wa -b draft_exons.gtf -a braker_exons.gtf > shared_exons.gtf
$ cat shared_exons.gtf | cut -f9 | cut -f4 -d" " | sort -u > overlap.txt
```
We can now filter the candidate genes by those that have some overlap:
```bash
$ grep -v -f overlap.txt keep_candidates.txt > keep_genes.txt
```
And finally, we can filter the GTF file:
```bash
$ grep -f keep_genes.txt braker/braker.gtf > braker2_unique.gtf
```
Now, before we can use this GTF file we need to make the gene names unique so that they don't
overlap with the round 1 gene names. We can do this by adding a suffix to the gene names:
```bash
$ sed -r 's/(g[[:digit:]])/r2_\1/g' braker2_unique.gtf > braker2_unique_renamed.gtf
```
Finally, we can convert to a GFF3 file and append to our existing annotation:
```bash
$ module load conda
$ conda activate agat-1.4.0
$ agat_convert_sp_gxf2gxf.pl -g braker2_unique_renamed.gtf -o ./braker2_unique_renamed.gff3
```
Let's copy that over to the annotation merge site:
```bash
$ cp braker2_unique_renamed.gff3 ../annot_merge/braker2_unique.gff3
```
Now we can remove the lines that contain introns or start/stop codons:
```bash
$ cd ../annot_merge
$ grep -v -E -i "(intron)|(codon)" braker2_unique_renamed.gff3 > braker2_unique_renamed_nocodon_intron.gff3
```
</details>
### Using _de novo_ transcriptomes as additional evidence
Despite the improvements, manual inspection continued to suggest that there were loci with real
RNA-seq peaks and exon-like structures that did not receive gene models. We reasoned that loci that
represented real genes would be more likely to be present in multiple time points. Since BRAKER3 had
already seen these loci and didn't identify them as genes, we hypothesized that the signal-to-noise
ratio might have played a role. Instead of using raw reads we therefore turned to _de novo_
assembled transcripts, and tried to find loci where transcripts from multiple timepoints mapped onto
the draft genome.
We used [Trinity](https://github.com/trinityrnaseq/trinityrnaseq) to perform _de novo_ assemblies
for the in-house RNA-seq datasets, as they had been sequenced at a much deeper level than the
transcriptomes provided by our collaborators, giving the tool better chances to assemble real
sequences (also refer to the [corresponding section](../05-transcriptomes/README.md)) We kept
complete ORFs, as annotated by [TransDecoder](https://github.com/TransDecoder/TransDecoder), and
mapped those against the draft genome.
- [`filter_assemblies.sh`](./annot3-RNA-01-filter_assemblies.sh) - main script that submits
filtering jobs to the cluster.
- [`extract_complete_ORFs.sh`](annot3-RNA-01-extract_complete_ORFs.sh) - worker script that will
extract coding regions that represent complete ORFs
- [`map_assemblies.sh`](./annot3-RNA-02-map_assemblies.sh) - main script that will submit
mapping jobs.
- [map complete ORFs to genome](annot3-RNA-02-map_assembly_single.sh) - worker script
- [`pack_sams.sh`](./annot3-RNA-03-pack_sams.sh) - main script that will submit formatting jobs to
the cluster.
- [convert to GFF3](annot3-RNA-03-pack_sam.sh) - worker script that will extract reads that mapped
well and save them in GFF3 format.
We chose to use the instar 3 transcriptome, which had 99% BUSCO completeness (metazoa), as a
reference. We used `bedtools intersect` to find all draft genome loci where an instar 3 transcript
mapped and overlapped (at 90% of its length) with a mapped transcript from another time point (also
at 90% of its length), and excluded all loci that overlapped with gene models from rounds 1 and 2.
We then translated the cDNA match parts of each locus to exons. Whenever there was disagreement
between the different _de novo_ transcriptomes about exon boundaries, we picked the most common exon
boundary.
- [mining de novo assembled transcriptomes](annot3-RNA-04-denovo_mining.ipynb)
While this approach is less sensitive than BRAKER3 and is incapable of distinguishing between
isoforms, it is fairly robust in finding gene and exon boundaries, at least so far as the coding
regions are concerned.
We consider this "round 3" of the genome annotation. In the final GFF file, gene models from round 3
have "Trinity" as `source`. Gene IDs are in the form `at_DNXXXX`, with `at` standing for "assembled
transcriptomes", `DN` being a tribute to Trinity transcript naming, and `X` being digits. Round 3
produced an additional 774 gene models, bringing our total to 15,497.
### Consolidating results
Finally, we can form the GFF3 file and sort it:
```bash
$ cat isoseq.gff > merged.gff3
$ cat braker.gff >> merged.gff3
$ cat braker2_unique_renamed_nocodon_intron.gff3 >> merged.gff3
$ cat denovo_txomes/overlap_translated.gff3 >> merged.gff3
$ cat ../trnascan/trnascan.gff3 >> merged.gff3
$ module load genometools/
$ gt gff3 -tidy -retainids -o merged_sorted.gff3 -force merged.gff3
```
### Testing
First, convert the GFF to GTF for convenience:
```bash
$ gt gff3_to_gtf -o merged_sorted.gtf merged_sorted.gff3
```
Now we can use [AGAT](https://agat.readthedocs.io/en/latest/tools/agat_sp_extract_sequences.html) to
extract transcripts from the genome:
```bash
$ agat_sp_extract_sequences.pl -g merged_sorted.gff3 -f ../draft_softmasked.fasta -t exon --merge -o transcripts.fa
```
We can then use this file as input for TransDecoder...
```bash
$ TransDecoder.LongOrfs -t transcripts.fa
$ TransDecoder.Predict -t transcripts.fa
```
...and finally look at BUSCO completeness:
```bash
$ conda deactivate
$ conda activate busco-5.7.1
$ busco -i transcripts.fa.transdecoder.pep -l metazoa -m protein -o metazoa -r -c 4 --offline --download_path ../../busco/busco_downloads/
```
Which returns:
```
---------------------------------------------------
|Results from dataset metazoa_odb10 |
---------------------------------------------------
|C:96.5%[S:40.7%,D:55.8%],F:1.5%,M:2.0%,n:954 |
|920 Complete BUSCOs (C) |
|388 Complete and single-copy BUSCOs (S) |
|532 Complete and duplicated BUSCOs (D) |
|14 Fragmented BUSCOs (F) |
|20 Missing BUSCOs (M) |
|954 Total BUSCO groups searched |
---------------------------------------------------
```
This is very, very close to what we had for the entire genome (except for the high prevalence of
duplicated BUSCOs, which can be explained by the fact that we included all isoforms). In particular,
it is worth noting that, for the entire genome, we had 1.6% fragmented and 1.7% missing BUSCOs.
Taken together, this suggests that we have a fairly complete set of gene models.
We can do the same with the arthropod BUSCO set:
```bash
$ busco -i transcripts.fa.transdecoder.pep -l arthropoda -m protein -o arthropoda -r -c 4 --offline --download_path ../../busco/busco_downloads/
```
Which returns:
```
---------------------------------------------------
|Results from dataset arthropoda_odb10 |
---------------------------------------------------
|C:95.8%[S:37.3%,D:58.5%],F:2.0%,M:2.2%,n:1013 |
|971 Complete BUSCOs (C) |
|378 Complete and single-copy BUSCOs (S) |
|593 Complete and duplicated BUSCOs (D) |
|20 Fragmented BUSCOs (F) |
|22 Missing BUSCOs (M) |
|1013 Total BUSCO groups searched |
---------------------------------------------------
```
Very similar results; again suggesting that we have a fairly complete set of gene models.
## tRNAscan
We also used tRNAscan to predict tRNAs in the draft genome.
- [run the tool](annot-trnascan.sh)
- [export in GFF3 format](tRNAscan_to_gff3.ipynb)
\ No newline at end of file
#!/usr/bin/env bash
#
#SBATCH --job-name=pycno_trnascan
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=2
#SBATCH --mem=500M
#SBATCH --time=1:00:00
#SBATCH --mail-type=ALL
#SBATCH --mail-user=nikolaos.papadopoulos@univie.ac.at
#SBATCH --output=/lisc/user/papadopoulos/log/pycno-trnascan-%j.out
#SBATCH --error=/lisc/user/papadopoulos/log/pycno-trnascan-%j.err
module load trnascan
# Testing was performed with a FASTA file with 60 chars/line.
# | CPUs | input size (lines) | real time | max memory | CPU util. |
# |------|--------------------|-----------|------------|-----------|
# | 1 | 78753 | 45s | 57.5Mb | 97% |
# | 2 | 78753 | 31s | 97Mb | 142% |
# | 4 | 78753 | 30s | 16.8Mb | 182% | (testing)
# | 1 | 787530 | 7:04 | 172Mb | 98% |
# | 2 | 787530 | 4:58 | 225Mb | 145% |
# | 4 | 787530 | 4:12 | 336Mb | 188% |
# |------|--------------------|-----------|------------|-----------|
# | 2 | 7875382 | 53:40 | 304Mb | 60% | (run)
ASSEMBLY=/lisc/scratch/zoology/pycnogonum/genome/draft/draft.fasta
OUTPUT=/lisc/scratch/zoology/pycnogonum/genome/draft/trnascan/
mkdir "$OUTPUT" -p || exit
cd "$OUTPUT" || exit
tRNAscan-SE \
-o trnascan.out \
-f trnascan.fasta \
-b trnascan.bed \
--stats trnascan.stats \
--progress \
--hitsrc \
--thread 2 \
"$ASSEMBLY"
\ No newline at end of file
#!/usr/bin/env bash
#
#SBATCH --job-name=pycno_braker
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=18
#SBATCH --mem=40G
#SBATCH --time=20:00:00
#SBATCH --mail-type=ALL
#SBATCH --mail-user=nikolaos.papadopoulos@univie.ac.at
#SBATCH --output=/lisc/user/papadopoulos/log/pycno-braker-%j.out
#SBATCH --error=/lisc/user/papadopoulos/log/pycno-braker-%j.err
module load conda
conda activate singularity
# container location
BRAKER=/lisc/user/papadopoulos/containers/braker3.sif
# path to the output directory, where all the magic should happen
OUTPUT=/lisc/scratch/zoology/pycnogonum/genome/draft/braker
mkdir "$OUTPUT" -p || exit
cd "$OUTPUT" || exit
# copy AUGUSTUS config files for BRAKER
cp -r ~/repos/Augustus/config .
# define input files
# the genome assembly
GENOME=/lisc/scratch/zoology/pycnogonum/genome/draft/draft_softmasked.fasta
# the RNA-seq dataset BAMs
BASE=/lisc/scratch/zoology/pycnogonum/genome/draft/transcriptome
ZYGOTE="$BASE/ZYGOTE/Aligned.out.bam"
EARLY_CLEAVAGE="$BASE/EARLY_CLEAVAGE/Aligned.out.bam"
EMBRYO0_1="$BASE/EMBRYO0_1/Aligned.out.bam"
EMBRYO3_5="$BASE/EMBRYO3_5/Aligned.out.bam"
EMBRYO9_10="$BASE/EMBRYO9_10/Aligned.out.bam"
EMBRYO3="$BASE/EMBRYO3/Aligned.out.bam"
INSTAR1="$BASE/INSTAR1/Aligned.out.bam"
INSTAR2="$BASE/INSTAR2/Aligned.out.bam"
INSTAR3="$BASE/INSTAR3/Aligned.out.bam"
INSTAR4="$BASE/INSTAR4/Aligned.out.bam"
INSTAR5="$BASE/INSTAR5/Aligned.out.bam"
INSTAR6="$BASE/INSTAR6/Aligned.out.bam"
JUV1="$BASE/JUV1/Aligned.out.bam"
SUBADULT="$BASE/SUBADULT/Aligned.out.bam"
# the Arthropoda OrthoDB protein sequences
ARTHROPODA=/lisc/scratch/zoology/db/Arthropoda.fa
# now call the container. Important to mount the current directory, /scratch/, and the $TMPDIR,
# in case it is needed.
singularity exec -B "$PWD,/lisc/scratch/zoology/,$TMPDIR" "$BRAKER" braker.pl \
--AUGUSTUS_CONFIG_PATH="${PWD}"/config \
--genome=$GENOME \
--prot_seq=$ARTHROPODA \
--bam="$ZYGOTE","$EARLY_CLEAVAGE","$EMBRYO0_1","$EMBRYO3_5","$EMBRYO9_10","$EMBRYO3","$INSTAR1","$INSTAR2","$INSTAR3","$INSTAR4","$INSTAR5","$INSTAR6","$JUV1","$SUBADULT" \
--softmasking \
--threads 18
\ No newline at end of file
#!/usr/bin/env bash
#
#SBATCH --job-name=mmap_pycno
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH --mem=15G
#SBATCH --time=15:00
#SBATCH --mail-type=ALL
#SBATCH --mail-user=nikolaos.papadopoulos@univie.ac.at
#SBATCH --output=/lisc/user/papadopoulos/log/pycno-mmap-%j.out
#SBATCH --error=/lisc/user/papadopoulos/log/pycno-mmap-%j.err
module load conda
conda activate minimap2-2.28
DRAFT=/lisc/scratch/zoology/pycnogonum/genome/draft/draft.fasta
ISOSEQ=/lisc/scratch/zoology/pycnogonum/genome/draft/transcriptome/isoseq/isoseq.fastq
/usr/bin/time minimap2 -ax splice -t16 -uf -C5 $DRAFT $ISOSEQ | samtools view -bS -o isoseq.bam
\ No newline at end of file
%% Cell type:markdown id: tags:
# Confirming gene models with isoform-level RNA-seq data
## Motivation
Our gene prediction with BRAKER3 produced a low number of gene models (~11.5k). We suspect this
might be because GeneMark-ET/StringTie are too aggressive with exons/genes that are close to each
other, merging them and messing up the gene models. We will use the isoform-level RNA-seq data to
confirm some of the gene models.
%% Cell type:code id: tags:
``` python
import numpy as np
import pandas as pd
from tqdm import tqdm
```
%% Cell type:markdown id: tags:
I extracted all transcript lines from the corresponding gff/gtf files and used bedtools to intersect
them:
```bash
$ ISOSEQ=/lisc/scratch/zoology/pycnogonum/transcriptome/isoseq/isoseq_confirmed.gff
$ BRAKER=/lisc/scratch/zoology/pycnogonum/genome/draft/braker/braker/braker_proposed.gtf
$ bedtools intersect -wao -b $ISOSEQ -a $BRAKER > brak_iso.bed
```
%% Cell type:code id: tags:
``` python
gtf = ["seqname", "source", "feature", "start", "end", "score", "strand", "frame", "attributes"]
# iso2bra = pd.read_csv("/Volumes/scratch/pycnogonum/genome/draft/annot_merge/iso_brak.bed", sep="\t", header=None)
# iso2bra.columns = ["iso_" + x for x in gtf] + ["bra_" + x for x in gtf] + ["overlap"]
# iso2bra["iso_gene"] = iso2bra["iso_attributes"].str.extract(r'gene_id "(.*?)";')
# iso2bra["bra_gene"] = iso2bra["bra_attributes"].str.split("\.").str[0]
# iso2bra["bra_gene"] = iso2bra["bra_gene"].replace("", np.nan)
bra2iso = pd.read_csv("/Volumes/scratch/pycnogonum/genome/draft/annot_merge/brak_iso.bed", sep="\t", header=None)
bra2iso.columns = ["bra_" + x for x in gtf] + ["iso_" + x for x in gtf] + ["overlap"]
bra2iso["bra_gene"] = bra2iso["bra_attributes"].str.split(".").str[0]
bra2iso["iso_gene"] = bra2iso["iso_attributes"].str.extract(r'gene_id "(.*?)";')
```
%% Cell type:markdown id: tags:
Now to find overlaps:
%% Cell type:code id: tags:
``` python
def contained(df, a_start="bra_start", a_end="bra_end", b_start="iso_start", b_end="iso_end"):
"""Checks if sequence A is completely contained within sequence B or vice versa.
Parameters
----------
df : pd.DataFrame
A dataframe of a bedtools intersect file.
a_start : str, optional
the start position of sequence A, by default "bra_start"
a_end : str, optional
the end position of sequence B, by default "bra_end"
b_start : str, optional
The start position of sequence B, by default "iso_start"
b_end : str, optional
the end position of sequence B, by default "iso_end"
Returns
-------
bool
returns True if the sequence is contained, False otherwise
"""
if df["bra_strand"] != df["iso_strand"]:
return False # make sure that we are comparing genes on the same strand
if df[b_start] < 0 or df[b_end] < 0 or df[a_start] < 0 or df[a_end] < 0:
return False
# first check if the braker gene is completely contained within the isoseq one
if df[a_start] >= df[b_start] and df[a_end] <= df[b_end]:
return True
# then check if the braker gene completely contains the isoseq one
if df[a_start] <= df[b_start] and df[a_end] >= df[b_end]:
return True
return False
```
%% Cell type:code id: tags:
``` python
genes = bra2iso.groupby("bra_gene")
```
%% Cell type:code id: tags:
``` python
def calc_overlap(df):
bra_len = df.iloc[0]["bra_end"] - df.iloc[0]["bra_start"]
# print(bra_len)
seq = np.zeros(bra_len, dtype=bool)
for i, transcript in df.iterrows():
start = transcript["iso_start"] - transcript["bra_start"]
start = np.max((0, start))
end = transcript["iso_end"] - transcript["bra_start"]
# print(end, len(seq))
end = np.min([bra_len, end])
seq[start:end] = True
return np.sum(seq) / bra_len
```
%% Cell type:code id: tags:
``` python
braker_unique = []
significant = []
unknown = []
truncate = []
for gene in tqdm(genes.groups):
try:
transcripts = genes.get_group(gene)
except KeyError:
unknown.append(gene)
continue
# first see how many candidate ISO-seq genes this BRAKER gene overlaps with
# if there is zero, then it means no possible corresponding ISO-seq gene was found
no_iso = transcripts.groupby("iso_gene").size().shape[0] # how many ISO-seq transcripts match this BRAKER gene?
if no_iso == 0:
braker_unique.append(gene)
continue
# if there is one it means we have an 1-to-1 correspondence
if no_iso == 1:
# try to verify that most BRAKER isoforms are contained within the ISO-seq isoforms
# (or vice versa)
no_contained = transcripts.apply(contained, axis=1).sum()
if no_contained / transcripts.shape[0] >= 0.4:
significant.append(gene)
continue
# test if the ISO-seq isoforms cumulatively cover at least 50% of the BRAKER gene
total_overlap = transcripts.groupby("bra_attributes").apply(calc_overlap, include_groups=False)
if np.mean(total_overlap) >= 0.5:
significant.append(gene)
continue
# if we have arrived here, all else failed and this gene needs to go for manual curation
unknown.append(gene)
```
%% Output
100%|██████████| 11451/11451 [00:03<00:00, 2913.12it/s]
%% Cell type:code id: tags:
``` python
len(significant), len(braker_unique), len(unknown), np.sum([len(significant), len(braker_unique), len(unknown)])
```
%% Output
(7591, 3596, 264, np.int64(11451))
%% Cell type:markdown id: tags:
At this point the strategy should be to copy the ISO-seq and "braker_unique" gene models from their
respective GTF files to a new, merged one. This should already take us from the 11,451 BRAKER gene
models to 8,904 + 3,442 = 12,346 gene models. We can then screen the unknown modules further.
%% Cell type:code id: tags:
``` python
isoseq_loc = "/Volumes/scratch/pycnogonum/transcriptome/isoseq/collapsed.gff"
braker_loc = "/Volumes/scratch/pycnogonum/genome/draft/braker/braker/braker.gtf"
```
%% Cell type:code id: tags:
``` python
def gene_string(gene_id, start, stop, last_line):
chunks = last_line.split("\t")
seqid = chunks[0]
source = chunks[1]
seq_type = "gene"
score = "."
strand = chunks[6]
phase = "."
attributes = f"ID={gene_id};gene_id={gene_id};"
string = f"{seqid}\t{source}\t{seq_type}\t{start}\t{stop}\t{score}\t{strand}\t{phase}\t{attributes};\n"
return string
def transcript_string(gene_id, transcript_id, line):
chunks = line.split("\t")
seqid = chunks[0]
source = chunks[1]
seq_type = "mRNA"
start = chunks[3]
stop = chunks[4]
score = chunks[5]
strand = chunks[6]
phase = "."
attributes = f"ID={transcript_id};Parent={gene_id}"
string = f"{seqid}\t{source}\t{seq_type}\t{start}\t{stop}\t{score}\t{strand}\t{phase}\t{attributes};\n"
return string
def exon_string(transcript_id, line, exon_counter):
chunks = line.split("\t")
seqid = chunks[0]
source = chunks[1]
start = chunks[3]
stop = chunks[4]
score = chunks[5]
strand = chunks[6]
seq_type = "exon"
phase = "."
attributes = f"ID={transcript_id}.exon{exon_counter};Parent={transcript_id}"
exon_string = f"{seqid}\t{source}\t{seq_type}\t{start}\t{stop}\t{score}\t{strand}\t{phase}\t{attributes};\n"
seq_type = "CDS"
phase="0"
attributes = f"ID={transcript_id}.CDS{exon_counter};Parent={transcript_id}"
CDS_string = f"{seqid}\t{source}\t{seq_type}\t{start}\t{stop}\t{score}\t{strand}\t{phase}\t{attributes};\n"
return exon_string + CDS_string
```
%% Cell type:code id: tags:
``` python
def write_gene(current_gene_id, gene_lines, edit, start, end):
chunks = gene_lines[0].split("\t")
# write the gene line first
gene = gene_string(current_gene_id, start, end, gene_lines[0])
edit.write(gene)
# now write each of the transcripts
exon_counter = 1
for l in gene_lines:
chunks = l.split("\t")
if chunks[2] == "transcript":
transcript_id = chunks[8].split(";")[1].split(" ")[2][1:-1]
transcript = transcript_string(current_gene_id, transcript_id, l)
edit.write(transcript)
else:
exon = exon_string(transcript_id, l, exon_counter)
edit.write(exon)
exon_counter += 1
edit.write("###\n")
```
%% Cell type:code id: tags:
``` python
gene_lines = []
current_gene_id = None
start = np.inf
end = 0
with open("/Volumes/scratch/pycnogonum/transcriptome/isoseq/collapsed.gff", "r") as isoseq:
with open("/Volumes/scratch/pycnogonum/genome/draft/annot_merge/isoseq.gff", "w") as edit:
for line in isoseq:
if line.startswith("#"):
edit.write(line)
continue
chunks = line.split("\t")
entry_type = chunks[2]
# print(line)
if entry_type == "transcript":
gene_id = chunks[8].split(";")[0].split(" ")[1][1:-1]
if gene_id != current_gene_id:
if current_gene_id is None: # first gene
current_gene_id = gene_id
gene_lines.append(line)
continue
# new gene! Write it out:
write_gene(current_gene_id, gene_lines, edit, start, end)
# reset the gene
current_gene_id = gene_id
gene_lines = []
start = int(chunks[3])
end = int(chunks[4])
else:
start = min(start, int(chunks[3]))
end = max(end, int(chunks[4]))
gene_lines.append(line)
# write last gene
write_gene(current_gene_id, gene_lines, edit, start, end)
```
%% Cell type:code id: tags:
``` python
gene_lines = []
current_gene_id = None
with open("/Volumes/scratch/pycnogonum/genome/draft/braker/braker/braker.gff3", "r") as braker:
with open("/Volumes/scratch/pycnogonum/genome/draft/annot_merge/braker.gff", "w") as edit:
for line in braker:
chunks = line.split("\t")
entry_type = chunks[2]
if entry_type in ["start_codon", "stop_codon", "intron"]:
continue
if entry_type == "gene":
current_gene_id = chunks[8].split(";")[0].split("=")[1]
line = line.rstrip() + "gene_id=" + current_gene_id + ";\n"
if current_gene_id in braker_unique:
edit.write(line)
```
%% Cell type:code id: tags:
``` python
import matplotlib.pyplot as plt
```
%% Cell type:code id: tags:
``` python
def plot_overlap(df):
# plot the overlapping regions as straight lines, color by transcript ID
start = np.min((df["bra_start"].min(), df["iso_start"].min()))
end = np.max((df["bra_end"].max(), df["iso_end"].max()))
fig, ax = plt.subplots()
# for i, transcript in df.iterrows():
# ax.hlines(i*2, transcript["iso_start"], transcript["iso_end"], label=transcript["iso_gene"])
# ax.hlines(i*2+1, transcript["bra_start"], transcript["bra_end"], label=transcript["bra_gene"])
ax.set_xlim(start-1000, end+1000)
ax.legend()
```
%% Cell type:code id: tags:
``` python
# a function that plots all the unique transcripts in a dataframe, coloring them by gene ID
def plot_overlap(df):
# first find out the minimum start and maximum end positions of all the real BRAKER genes
# (i.e. not the -1 values)
bra_start_min = np.min(df["bra_start"][df["bra_start"] > 0])
bra_end_max = np.max(df["bra_end"][df["bra_end"] > 0])
# now do the same for the iso-seq genes
iso_start_min = np.min(df["iso_start"][df["iso_start"] > 0])
iso_end_max = np.max(df["iso_end"][df["iso_end"] > 0])
# the plot should start 1000 bp before the minimum start and end 1000 bp after the maximum end
start = np.min((bra_start_min, iso_start_min)) - 1000
end = np.max((bra_end_max, iso_end_max)) + 1000
fig, ax = plt.subplots()
# get a list of all the unique gene IDs
gene_list = np.unique(df["iso_gene"].tolist() + df["bra_gene"].tolist())
# generate a color map for the gene IDs - this should be always less than 10, so we can take
# the Set1 colormap
colors = plt.cm.Set1(np.linspace(0, 1, len(gene_list)))
# create a dictionary that maps gene IDs to colors
cmap = dict(zip(gene_list, colors))
# each transcript should be a straight line, colored by the gene ID. Let's do BRAKER first.
braker = df.groupby("bra_attributes").first()
for i, row in braker.iterrows():
ax.hlines(i, row["bra_start"], row["bra_end"], color=cmap[row["bra_gene"]])
# now do the same for the iso-seq genes
iso = df.groupby("iso_attributes").first()
for i, row in iso.iterrows():
try:
ax.hlines(i, row["iso_start"], row["iso_end"], color=cmap[row["iso_gene"]])
except KeyError:
# there might be some empty IDs here if there is no overlap for this transcript
# ax.hlines(i, start, end, linestyles="dashed", color="black")
continue
# set the limits of the plot
ax.set_xlim(start, end)
# keep track of what chromosome we're in in case we want to go to the genome browser
ax.set_title(f"{df['bra_seqname'].iloc[0]}")
# save the plot for manual inspection
plt.savefig(f"figs/{df['bra_gene'].iloc[0]}.png")
# close plot to save memory
plt.close()
```
%% Cell type:code id: tags:
``` python
for gene in tqdm(unknown):
keep = bra2iso["bra_gene"] == gene
plot_overlap(bra2iso[keep])
```
%% Output
100%|██████████| 264/264 [00:14<00:00, 18.59it/s]
%% Cell type:markdown id: tags:
Manual inspection of the ISO-seq/BRAKER overlap for these 264 genes identifies another 49 that need
further examination. Those were aligned against uniref90 to check whether the BRAKER portions
correspond to real genes or are spurious.
#!/usr/bin/env bash
#
#SBATCH --job-name=extr_unannot
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=2G
#SBATCH --time=2:00:00
#SBATCH --mail-type=ALL
#SBATCH --mail-user=nikolaos.papadopoulos@univie.ac.at
#SBATCH --output=/lisc/user/papadopoulos/log/pycno-subset_bam-%j.out
#SBATCH --error=/lisc/user/papadopoulos/log/pycno-subset_bam-%j.err
module load bedtools
GFF=/lisc/scratch/zoology/pycnogonum/genome/draft/annot_merge/merged.gff3
OUTPUT=$1
cd "$OUTPUT" || exit
bedtools intersect -v -a Aligned.out.bam -b $GFF > unannotated.bam
\ No newline at end of file
#!/usr/bin/env bash
EXTRACT=/lisc/user/papadopoulos/repos/pycno-seq/annotate/annot2-RNA-extract_uncovered.sh
BASE=/lisc/scratch/zoology/pycnogonum/genome/draft/transcriptome/
sbatch $EXTRACT "$BASE/ZYGOTE/"
sbatch $EXTRACT "$BASE/EARLY_CLEAVAGE/"
sbatch $EXTRACT "$BASE/EMBRYO0_1/"
sbatch $EXTRACT "$BASE/EMBRYO3_5/"
sbatch $EXTRACT "$BASE/EMBRYO9_10/"
sbatch $EXTRACT "$BASE/EMBRYO3/"
sbatch $EXTRACT "$BASE/INSTAR1/"
sbatch $EXTRACT "$BASE/INSTAR2/"
sbatch $EXTRACT "$BASE/INSTAR3/"
sbatch $EXTRACT "$BASE/INSTAR4/"
sbatch $EXTRACT "$BASE/INSTAR5/"
sbatch $EXTRACT "$BASE/INSTAR6/"
sbatch $EXTRACT "$BASE/JUV1/"
sbatch $EXTRACT "$BASE/SUBADULT/"
\ No newline at end of file
#!/usr/bin/env bash
#
#SBATCH --job-name=pycno_braker
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=10
#SBATCH --mem=40G
#SBATCH --time=20:00:00
#SBATCH --mail-type=ALL
#SBATCH --mail-user=nikolaos.papadopoulos@univie.ac.at
#SBATCH --output=/lisc/user/papadopoulos/log/pycno-braker-%j.out
#SBATCH --error=/lisc/user/papadopoulos/log/pycno-braker-%j.err
module load conda
conda activate singularity
# container location
BRAKER=/lisc/user/papadopoulos/containers/braker3.sif
# path to the output directory, where all the magic should happen
OUTPUT=/lisc/scratch/zoology/pycnogonum/genome/draft/braker2
mkdir "$OUTPUT" -p || exit
cd "$OUTPUT" || exit
# copy AUGUSTUS config files for BRAKER
cp -r ~/repos/Augustus/config .
# define input files
# the genome assembly
GENOME=/lisc/scratch/zoology/pycnogonum/genome/draft/draft_softmasked.fasta
# the RNA-seq dataset BAMs
BASE=/lisc/scratch/zoology/pycnogonum/genome/draft/transcriptome
ZYGOTE="$BASE/ZYGOTE/unannotated.bam"
EARLY_CLEAVAGE="$BASE/EARLY_CLEAVAGE/unannotated.bam"
EMBRYO0_1="$BASE/EMBRYO0_1/unannotated.bam"
EMBRYO3_5="$BASE/EMBRYO3_5/unannotated.bam"
EMBRYO9_10="$BASE/EMBRYO9_10/unannotated.bam"
EMBRYO3="$BASE/EMBRYO3/unannotated.bam"
INSTAR1="$BASE/INSTAR1/unannotated.bam"
INSTAR2="$BASE/INSTAR2/unannotated.bam"
INSTAR3="$BASE/INSTAR3/unannotated.bam"
INSTAR4="$BASE/INSTAR4/unannotated.bam"
INSTAR5="$BASE/INSTAR5/unannotated.bam"
INSTAR6="$BASE/INSTAR6/unannotated.bam"
JUV1="$BASE/JUV1/unannotated.bam"
SUBADULT="$BASE/SUBADULT/unannotated.bam"
# the Arthropoda OrthoDB protein sequences
ARTHROPODA=/lisc/scratch/zoology/db/Arthropoda.fa
# now call the container. Important to mount the current directory, /scratch/, and the $TMPDIR,
# in case it is needed.
singularity exec -B "$PWD,/lisc/scratch/zoology/,$TMPDIR" "$BRAKER" braker.pl \
--AUGUSTUS_CONFIG_PATH="${PWD}"/config \
--genome=$GENOME \
--prot_seq=$ARTHROPODA \
--bam="$ZYGOTE","$EARLY_CLEAVAGE","$EMBRYO0_1","$EMBRYO3_5","$EMBRYO9_10","$EMBRYO3","$INSTAR1","$INSTAR2","$INSTAR3","$INSTAR4","$INSTAR5","$INSTAR6","$JUV1","$SUBADULT" \
--softmasking \
--threads 10
\ No newline at end of file
#!/usr/bin/env bash
#
#SBATCH --job-name=extr_unannot
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=500M
#SBATCH --time=17:00
#SBATCH --mail-type=ALL
#SBATCH --mail-user=nikolaos.papadopoulos@univie.ac.at
#SBATCH --output=/lisc/user/papadopoulos/log/pycno-extract_orfs-%j.out
#SBATCH --error=/lisc/user/papadopoulos/log/pycno-extract_orfs-%j.err
module load perl/5.40.0
module load transdecoder/5.7.1
DENOVO=$1
OUTPUT=$2
# 228.06user
# 747.90user
# say 1000s = 16-17min
mkdir -p "$OUTPUT" || exit 1
cd "$OUTPUT" || exit 1
# first we need to extract ORFs
TransDecoder.LongOrfs -t "$DENOVO"
# now predict the coding regions
TransDecoder.Predict -t "$DENOVO"
# now convert the resulting .cds file to a single-line fasta
# almost instantaneous
perl -pe '$. > 1 and /^>/ ? print "\n" : chomp' Trinity.fasta.transdecoder.cds > oneline.fasta
# extract complete ORFs. Almost instantaneous.
grep -A1 complete oneline.fasta > complete.fasta
\ No newline at end of file
#!/usr/bin/env bash
FILTER=/lisc/user/papadopoulos/repos/pycno-seq/annotate/annot3-RNA-extract_complete_ORFs.sh
IN=/lisc/project/zoology/pycnogonum/transcriptome/development
EMBRYO3_5_IN=$IN/embryonic_stage3-4/Trinity.fasta
INSTAR1_IN=$IN/instarI-protonymphon/Trinity.fasta
INSTAR2_IN=$IN/instar_II/Trinity.fasta
INSTAR3_IN=$IN/instar_III/Trinity.fasta
INSTAR4_IN=$IN/instar_IV/Trinity.fasta
INSTAR5_IN=$IN/instar_V/Trinity.fasta
INSTAR6_IN=$IN/instar_VI/Trinity.fasta
JUV1_IN=$IN/juvenile_I/Trinity.fasta
SUBADULT_IN=$IN/subadult/Trinity.fasta
OUT=/lisc/scratch/zoology/pycnogonum/genome/draft/transcriptome/
EMBRYO3_5_OUT=$OUT/EMBRYO3/
INSTAR1_OUT=$OUT/INSTAR1/
INSTAR2_OUT=$OUT/INSTAR2/
INSTAR3_OUT=$OUT/INSTAR3/
INSTAR4_OUT=$OUT/INSTAR4/
INSTAR5_OUT=$OUT/INSTAR5/
INSTAR6_OUT=$OUT/INSTAR6/
JUV1_OUT=$OUT/JUV1/
SUBADULT_OUT=$OUT/SUBADULT/
sbatch $FILTER $EMBRYO3_5_IN $EMBRYO3_5_OUT
sbatch $FILTER $INSTAR1_IN $INSTAR1_OUT
sbatch $FILTER $INSTAR2_IN $INSTAR2_OUT
sbatch $FILTER $INSTAR3_IN $INSTAR3_OUT
sbatch $FILTER $INSTAR4_IN $INSTAR4_OUT
sbatch $FILTER $INSTAR5_IN $INSTAR5_OUT
sbatch $FILTER $INSTAR6_IN $INSTAR6_OUT
sbatch $FILTER $JUV1_IN $JUV1_OUT
sbatch $FILTER $SUBADULT_IN $SUBADULT_OUT
\ No newline at end of file
#!/usr/bin/env bash
MAP=/lisc/user/papadopoulos/repos/pycno-seq/annotate/annot3-RNA-02-map_assembly_single.sh
OUT=/lisc/scratch/zoology/pycnogonum/genome/draft/transcriptome/
EMBRYO3_5=$OUT/EMBRYO3
INSTAR1=$OUT/INSTAR1
INSTAR2=$OUT/INSTAR2
INSTAR3=$OUT/INSTAR3
INSTAR4=$OUT/INSTAR4
INSTAR5=$OUT/INSTAR5
INSTAR6=$OUT/INSTAR6
JUV1=$OUT/JUV1
SUBADULT=$OUT/SUBADULT
sbatch $MAP $EMBRYO3_5/complete.fasta $EMBRYO3_5
sbatch $MAP $INSTAR1/complete.fasta $INSTAR1
sbatch $MAP $INSTAR2/complete.fasta $INSTAR2
sbatch $MAP $INSTAR3/complete.fasta $INSTAR3
sbatch $MAP $INSTAR4/complete.fasta $INSTAR4
sbatch $MAP $INSTAR5/complete.fasta $INSTAR5
sbatch $MAP $INSTAR6/complete.fasta $INSTAR6
sbatch $MAP $JUV1/complete.fasta $JUV1
sbatch $MAP $SUBADULT/complete.fasta $SUBADULT
\ No newline at end of file
#!/usr/bin/env bash
#
#SBATCH --job-name=mmap_pycno
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=3
#SBATCH --mem=10G
#SBATCH --time=5:00
#SBATCH --mail-type=ALL
#SBATCH --mail-user=nikolaos.papadopoulos@univie.ac.at
#SBATCH --output=/lisc/user/papadopoulos/log/pycno-map-txome-%j.out
#SBATCH --error=/lisc/user/papadopoulos/log/pycno-map-txome-%j.err
# with 1% of data and 3 CPUs:
# [M::main] Real time: 448.742 sec; CPU: 1260.678 sec; Peak RSS: 1.830 GB
# 1251.43user 9.26system 7:28.76elapsed 280%CPU (0avgtext+0avgdata 1919296maxresident)k
# with 10% of data and 3 CPUs:
# [M::main] Real time: 1528.465 sec; CPU: 3772.846 sec; Peak RSS: 2.758 GB
# 3743.86user 29.02system 25:28.51elapsed 246%CPU (0avgtext+0avgdata 2892008maxresident)k
# BUT THAT WAS WITH THE WRONG ARGUMENT ORDER, it is much faster the other way round
module load conda
conda activate minimap2-2.28
TXOME=$1
OUTPUT=$2
DRAFT=/lisc/scratch/zoology/pycnogonum/genome/draft/draft.fasta
cd "$TMPDIR" || exit
minimap2 -ax splice:hq $DRAFT "$TXOME" > "$OUTPUT"/txome_assembly.sam
\ No newline at end of file
#!/usr/bin/env bash
#
#SBATCH --job-name=pycno_filter
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=2
#SBATCH --mem=2G
#SBATCH --time=5:00
#SBATCH --mail-type=ALL
#SBATCH --mail-user=nikolaos.papadopoulos@univie.ac.at
#SBATCH --output=/lisc/user/papadopoulos/log/pycno-filter-%j.out
#SBATCH --error=/lisc/user/papadopoulos/log/pycno-filter-%j.err
module load samtools
TXOME=$1
cd "$TXOME" || exit
samtools view -@2 -F 0x900 -q 30 -S -b txome_assembly.sam > txome_assembly.hd.bam
module load conda
conda activate agat-1.4.0
agat_convert_minimap2_bam2gff.pl -i txome_assembly.hd.bam -o txome_assembly.gff
\ No newline at end of file
#!/usr/bin/env bash
FILTER=/lisc/user/papadopoulos/repos/pycno-seq/annotate/annot3-RNA-03-pack_sam.sh
OUT=/lisc/scratch/zoology/pycnogonum/genome/draft/transcriptome/
EMBRYO3_OUT=$OUT/EMBRYO3/
INSTAR1_OUT=$OUT/INSTAR1/
INSTAR2_OUT=$OUT/INSTAR2/
INSTAR3_OUT=$OUT/INSTAR3/
INSTAR4_OUT=$OUT/INSTAR4/
INSTAR5_OUT=$OUT/INSTAR5/
INSTAR6_OUT=$OUT/INSTAR6/
JUV1_OUT=$OUT/JUV1/
SUBADULT_OUT=$OUT/SUBADULT/
sbatch $FILTER $EMBRYO3_OUT
sbatch $FILTER $INSTAR1_OUT
sbatch $FILTER $INSTAR2_OUT
sbatch $FILTER $INSTAR3_OUT
sbatch $FILTER $INSTAR4_OUT
sbatch $FILTER $INSTAR5_OUT
sbatch $FILTER $INSTAR6_OUT
sbatch $FILTER $JUV1_OUT
sbatch $FILTER $SUBADULT_OUT
\ No newline at end of file
This diff is collapsed.
This diff is collapsed.
#!/usr/bin/env bash
CONVERTER=/lisc/user/papadopoulos/repos/pycno-seq/annotate/prep-RNA-sam2bam.sh
BASE="/lisc/scratch/zoology/pycnogonum/genome/draft/transcriptome"
ZYGOTE="$BASE/ZYGOTE/Aligned.out.sam"
EARLY_CLEAVAGE="$BASE/EARLY_CLEAVAGE/Aligned.out.sam"
EMBRYO0_1="$BASE/EMBRYO0_1/Aligned.out.sam"
EMBRYO3_5="$BASE/EMBRYO3_5/Aligned.out.sam"
EMBRYO9_10="$BASE/EMBRYO9_10/Aligned.out.sam"
EMBRYO3="$BASE/EMBRYO3/Aligned.out.sam"
INSTAR1="$BASE/INSTAR1/Aligned.out.sam"
INSTAR2="$BASE/INSTAR2/Aligned.out.sam"
INSTAR3="$BASE/INSTAR3/Aligned.out.sam"
INSTAR4="$BASE/INSTAR4/Aligned.out.sam"
INSTAR5="$BASE/INSTAR5/Aligned.out.sam"
INSTAR6="$BASE/INSTAR6/Aligned.out.sam"
JUV1="$BASE/JUV1/Aligned.out.sam"
SUBADULT="$BASE/SUBADULT/Aligned.out.sam"
sbatch $CONVERTER $ZYGOTE
sbatch $CONVERTER $EARLY_CLEAVAGE
sbatch $CONVERTER $EMBRYO0_1
sbatch $CONVERTER $EMBRYO3_5
sbatch $CONVERTER $EMBRYO9_10
sbatch $CONVERTER $EMBRYO3
sbatch $CONVERTER $INSTAR1
sbatch $CONVERTER $INSTAR2
sbatch $CONVERTER $INSTAR3
sbatch $CONVERTER $INSTAR4
sbatch $CONVERTER $INSTAR5
sbatch $CONVERTER $INSTAR6
sbatch $CONVERTER $JUV1
sbatch $CONVERTER $SUBADULT
\ No newline at end of file
#!/usr/bin/env bash
#
#SBATCH --job-name=converter
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=6
#SBATCH --mem=16G
#SBATCH --time=15:00
#SBATCH --mail-type=ALL
#SBATCH --mail-user=nikolaos.papadopoulos@univie.ac.at
#SBATCH --output=/lisc/user/papadopoulos/log/pycno-indexbam-%j.out
#SBATCH --error=/lisc/user/papadopoulos/log/pycno-indexbam-%j.err
module load samtools
INPUT=$1
BASENAME=$(dirname "$INPUT")
OUTPUT=$BASENAME/Aligned.out.sorted.bam
cd "$TMPDIR" || exit
samtools sort -@ 8 "$INPUT" > "$OUTPUT"
samtools index -@ 8 -b "$OUTPUT"
\ No newline at end of file
#!/usr/bin/env bash
MAPPER=/lisc/user/papadopoulos/repos/pycno-seq/annotate/prep-RNA-map-single.sh
ASSEMBLY=/lisc/scratch/zoology/pycnogonum/genome/draft/
# early development
BASE="/lisc/scratch/zoology/pycnogonum/raw/early_RNA"
ZYGOTE="$BASE/Plit35_S8"
EARLY_CLEAVAGE="$BASE/Plit31_S6"
EMBRYO0_1="$BASE/Plit40_S11"
EMBRYO3_5="$BASE/Plit32_S7"
EMBRYO9_10="$BASE/Plit36_S9"
# INSTAR2_5="$BASE/Plit39_S10"
sbatch $MAPPER $ASSEMBLY ZYGOTE "$ZYGOTE"_L001
sbatch $MAPPER $ASSEMBLY EARLY_CLEAVAGE "$EARLY_CLEAVAGE"_L001
sbatch $MAPPER $ASSEMBLY EMBRYO0_1 "$EMBRYO0_1"_L001
sbatch $MAPPER $ASSEMBLY EMBRYO3_5 "$EMBRYO3_5"_L001
sbatch $MAPPER $ASSEMBLY EMBRYO9_10 "$EMBRYO9_10"_L001
# instars
BASE="/lisc/scratch/zoology/pycnogonum/raw/HTT33DSX5_4_R15615_20230713/demultiplexed"
EMBRYO3="$BASE/235255/235255_S49"
INSTAR1="$BASE/235253/235253_S47"
INSTAR2="$BASE/235256/235256_S50"
INSTAR3="$BASE/235257/235257_S51"
INSTAR4="$BASE/235258/235258_S52"
INSTAR5="$BASE/235259/235259_S53"
INSTAR6="$BASE/235254/235254_S48"
JUV1="$BASE/235260/235260_S54"
SUBADULT="$BASE/235261/235261_S55"
sbatch $MAPPER $ASSEMBLY EMBRYO3 "$EMBRYO3"_L004
sbatch $MAPPER $ASSEMBLY INSTAR1 "$INSTAR1"_L004
sbatch $MAPPER $ASSEMBLY INSTAR2 "$INSTAR2"_L004
sbatch $MAPPER $ASSEMBLY INSTAR3 "$INSTAR3"_L004
sbatch $MAPPER $ASSEMBLY INSTAR4 "$INSTAR4"_L004
sbatch $MAPPER $ASSEMBLY INSTAR5 "$INSTAR5"_L004
sbatch $MAPPER $ASSEMBLY INSTAR6 "$INSTAR6"_L004
sbatch $MAPPER $ASSEMBLY JUV1 "$JUV1"_L004
sbatch $MAPPER $ASSEMBLY SUBADULT "$SUBADULT"_L004
\ No newline at end of file
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please to comment