Skip to content
Snippets Groups Projects
Commit 5f2777fb authored by Niko (Nikolaos) Papadopoulos's avatar Niko (Nikolaos) Papadopoulos
Browse files

context for analysis notebooks

parent 2e4251db
No related branches found
No related tags found
No related merge requests found
# Analysis
The code in this folder concerns downstream analysis after genome assembly and annotation.
### _Pycnogonum litorale_ in the chelicerate/arthropod context
To gain a first understanding of the _P. litorale_ genome we compared it to publicly available
genome assemblies for arthropods in general and chelicerates in particular using common metrics.
This notebook produced most of the material for figures 2 and 3 in the manuscript.
- [The genome assembly of _P. litorale_ in the broader arthropod/chelicerate context](genomic_context.ipynb)
### Hox genes
We used manually curated CDS's for _P. litorale_ Hox genes to locate them in the genome. The
putative CDS's overlap with predicted gene models and form a single cluster on pseudochromosome 56.
All Hox genes are present, except for _Hox9/abdominalA_, for which contradicting data exists for
pycnogonids. This notebook produced material for figure 4 and suppl. figure 3 of the manuscript.
- [Exploring Hox genes](hoxfinder.ipynb)
### EggNOG-mapper output
A brief look at the functional annotation provided by EggNOG-mapper for the predicted pycnogonid
peptide sequences, broken down by prediction quality (also see how the annotation [was
produced](../06-annotation/README.md)). We find that annotation quality correlates with annotation
confidence (Iso-seq > BRAKER round 1 > BRAKER round 2 > de-novo).
- [functional annotation overview](emapper_output.ipynb)
\ No newline at end of file
%% Cell type:markdown id: tags:
# Parsing functional annotation
We used EggNOG-mapper to annotate the functional annotation of the proteins. The output of
EggNOG-mapper is a tab-separated file that contains information about the seed ortholog of each
query, as well as information about the corresponding orthogroup and its phylogenetic context. Here
we will parse this output to reduce it to one description per gene, and see how well the different
annotation sources are represented.
%% Cell type:code id: tags:
``` python
import pandas as pd
import numpy as np
```
%% Cell type:markdown id: tags:
First read the data. Remember that emapper output has four comment lines in the header and another
three in the footer:
%% Cell type:code id: tags:
``` python
emapper_loc = "/Volumes/scratch/pycnogonum/genome/draft/annot_merge/out.emapper.annotations"
emapper = pd.read_csv(emapper_loc, sep="\t", skiprows=4, skipfooter=3, engine="python")
```
%% Cell type:markdown id: tags:
We need to extract the gene ID from the protein ID. This is mostly dot-separated, but PacBio IsoSeq
genes have more dots in their names, so we need to parse those ddifferently.
%% Cell type:code id: tags:
``` python
def parse_gene_id(x):
if "PB" in x:
parts = x.split(".")
return ".".join(parts[:2])
elif x.startswith("r2") or x.startswith("g") or x.startswith("at"):
return x.split(".")[0]
else:
return ValueError("Unknown gene ID format")
```
%% Cell type:markdown id: tags:
apply the parsing on each row...
%% Cell type:code id: tags:
``` python
emapper["gene"] = emapper["#query"].apply(parse_gene_id)
```
%% Cell type:markdown id: tags:
...and now group by gene ID and keep the best-scoring isoform/CDS for each gene.
%% Cell type:code id: tags:
``` python
summary = emapper.groupby("gene").apply(lambda x: x.sort_values("score", ascending=False).head(1), include_groups=False)
```
%% Cell type:markdown id: tags:
Next we need to extract the annotation type from the gene IDs, something we can easily do by the
start of the gene ID.
* P: PacBio IsoSeq (round 1)
* g: BRAKER3 (round 1)
* r: BRAKER3 (round 2)
* a: assembled transcriptomes (round 3)
We will keep track of these in a dictionary to make calculations easier. First, let's find out how
many genes we have from each source.
```bash
$ grep " gene " merged_sorted.gff3 | grep PB | wc -l
8904
$ grep " gene " merged_sorted.gff3 | grep r2 | wc -l
2223
$ grep " gene " merged_sorted.gff3 | grep at | wc -l
774
$ grep " gene " merged_sorted.gff3 | grep "=g" | wc -l
3596
```
%% Cell type:code id: tags:
``` python
translate = {
"P": "PacBio",
"g": "BRAKER3_r1",
"r": "BRAKER3_r2",
"a": "Trinity",
}
totals = {
"PacBio": 8904,
"BRAKER3_r1": 3596,
"BRAKER3_r2": 2223,
"Trinity": 774
}
rounds = {
"PacBio": 1,
"BRAKER3_r1": 1,
"BRAKER3_r2": 2,
"Trinity": 3
}
```
%% Cell type:code id: tags:
``` python
annotated = pd.DataFrame(summary["#query"].str[:1].value_counts())
annotated.index = [translate[x] for x in annotated.index]
annotated["totals"] = [totals[x] for x in annotated.index]
annotated["round"] = [rounds[x] for x in annotated.index]
```
%% Cell type:markdown id: tags:
We can now see what percentage of gene models find support for each annotation source.
%% Cell type:code id: tags:
``` python
(annotated["count"] / annotated["totals"] * 100).round(2)
```
%% Output
PacBio 73.69
BRAKER3_r1 57.40
BRAKER3_r2 51.73
Trinity 31.27
dtype: float64
%% Cell type:markdown id: tags:
As expected, the more trustworthy annotations find more orthologs. The same pattern holds if we
aggregate the round 1 sources:
%% Cell type:code id: tags:
``` python
tmp = annotated.groupby("round").sum()
(tmp["count"] / tmp["totals"] * 100).round(2)
```
%% Output
round
1 69.00
2 51.73
3 31.27
dtype: float64
%% Cell type:markdown id: tags:
Write out the reduced annotation file for downstream analysis.
%% Cell type:code id: tags:
``` python
summary_loc = "/Volumes/scratch/pycnogonum/genome/draft/annot_merge/out.emapper.best.annotations"
summary.to_csv(summary_loc, sep="\t", index=False)
```
%% Cell type:code id: tags:
``` python
annotated["count"].sum()
```
%% Output
np.int64(10017)
%% Cell type:code id: tags:
``` python
annotated["totals"].sum()
```
%% Output
np.int64(15497)
This diff is collapsed.
This diff is collapsed.
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment