Skip to content
Snippets Groups Projects
Commit fdb02d43 authored by Niko (Nikolaos) Papadopoulos's avatar Niko (Nikolaos) Papadopoulos
Browse files

add eggNOG-mapper preferred name to GFF

parent 500a24a6
No related branches found
No related tags found
No related merge requests found
%% Cell type:markdown id: tags:
# Adding functional annotation from EggNOG-mapper
%% Cell type:code id: tags:
``` python
from tqdm import tqdm
import pandas as pd
import numpy as np
```
%% Cell type:code id: tags:
%% Cell type:markdown id: tags:
``` python
best_per_gene = '/Users/npapadop/Documents/data/pycnogonum/draft/out.emapper.best.annotations'
emapper = pd.read_csv(best_per_gene, sep='\t', header=0)
```
### util functions
We are going to need three helper functions:
- extract the gene ID from the `#query` field of the EggNOG-mapper output
- break up the content of the attributes field of the GFF file into a dictionary
- find the correct protein name for a gene ID
%% Cell type:code id: tags:
``` python
def parse_gene_id(x):
"""Extract gene ID from a string
Parameters
----------
x : str
A protein ID from the eggNOG-mapper output.
Returns
-------
str
will return the gene ID in the format of 'PB.X' (PacBio genes) or 'gX' (BRAKER round 1) or 'r2_gX' (BRAKER round 2) or 'at_DNX (de-novo transcriptome-assembled genes)'
"""
if 'PB' in x:
parts = x.split('.')
return '.'.join(parts[:2])
elif x.startswith('r2') or x.startswith('g') or x.startswith('at'):
return x.split('.')[0]
else:
return ValueError('Unknown gene ID format')
```
%% Cell type:markdown id: tags:
apply the parsing on each row...
%% Cell type:code id: tags:
``` python
emapper['gene'] = emapper['#query'].apply(parse_gene_id)
```
%% Cell type:code id: tags:
``` python
emapper.set_index('gene', inplace=True)
```
%% Cell type:code id: tags:
``` python
emapper.loc['PB.3']
```
%% Output
#query PB.3.1.p1
seed_ortholog 136037.KDR12978
evalue 0.0
score 570.0
eggNOG_OGs COG0462@1|root,KOG1503@2759|Eukaryota,38F3E@33...
max_annot_lvl 33208|Metazoa
COG_category EF
Description ribose phosphate diphosphokinase activity. It ...
Preferred_name PRPSAP1
GOs GO:0001501,GO:0002189,GO:0003674,GO:0004857,GO...
EC -
KEGG_ko -
KEGG_Pathway -
KEGG_Module -
KEGG_Reaction -
KEGG_rclass -
BRITE -
KEGG_TC -
CAZy -
BiGG_Reaction -
PFAMs Pribosyl_synth,Pribosyltran_N
Name: PB.3, dtype: object
%% Cell type:code id: tags:
``` python
def parse_attributes(x):
'''Parses a semi-colon separated string into a dictionary
Parameters
----------
x : str
a semicolon-separated string that holds attributes
'''
attributes = x.split(';')
if attributes[-1] == '':
attributes.pop()
return {attr.split('=')[0]: attr.split('=')[1] for attr in attributes}
```
%% Cell type:code id: tags:
``` python
emapper.loc['PB.1']['Preferred_name']
def find_protein(gene_id, lookup): # expects a protein-coding gene as input
"""Generate the correct protein name for a gene ID. Expects a protein-coding gene as input (will not work correctly for tRNA or rRNA genes).
Parameters
----------
gene_id : str
a P. litorale protein-coding gene ID
lookup : pd.DataFrame
the eggNOG-mapper output file, filtered to maximum one entry per gene ID
Returns
-------
str
the gene symbol (if available) or "Uncharacterised protein {gene_id}" if no gene symbol is available.
"""
if gene_id in lookup.index:
name = lookup.loc[gene_id]['Preferred_name']
if name != '-':
return name
return f'Uncharacterised protein {gene_id}'
```
%% Output
%% Cell type:markdown id: tags:
'-'
Read the emapper output (filtered to best hit per gene ID) and extract the gene ID in a new column;
then set that as the index of the dataframe.
%% Cell type:code id: tags:
``` python
def find_protein(gene_id, lookup): # expects a protein-coding gene as input
if gene_id in lookup.index:
name = lookup.loc[gene_id]['Preferred_name']
if name != '-':
return name
return f'Uncharacterised protein {gene_id}'
best_per_gene = '/Users/npapadop/Documents/data/pycnogonum/draft/out.emapper.best.annotations'
emapper = pd.read_csv(best_per_gene, sep='\t', header=0)
emapper['gene'] = emapper['#query'].apply(parse_gene_id)
emapper.set_index('gene', inplace=True)
```
%% Cell type:markdown id: tags:
Loop over the entire GFF file and decorate all the (putative) protein-coding entries to include the
gene symbol (name) in some form:
- Gene: `name=Hox3;gene_name=Hox3`
- mRNA: `name=Hox3 isoform 1;gene_name=Hox3`
- CDS: `gene_name=Hox3`
- exon: `gene_name=Hox3`
All putative mRNAs should have an isoform name (even if it is just `1`); all exons and CDS's should
include the gene name as well, so that lazy GFF readers that grep parts of the GFF (like
`CellRanger`) still have access to the functional annotation.
%% Cell type:code id: tags:
``` python
gff_loc = '/Volumes/scratch/pycnogonum/genome/submission/merged_sorted.gff'
named_loc = '/Volumes/scratch/pycnogonum/genome/submission/merged_sorted_named.gff'
gff_loc = '/Volumes/scratch/pycnogonum/genome/submission/merged_sorted.gff3'
named_loc = '/Volumes/scratch/pycnogonum/genome/submission/merged_sorted_named.gff3'
with open(gff_loc, 'r') as gff:
with open(named_loc, 'w') as named:
gene = ''
mRNA = ''
for line in gff:
for line in tqdm(gff.readlines()):
line = line.strip()
conditions_skip = line.startswith('#') or 'tRNA' in line or 'name=' in line
if not conditions_skip:
seq_id, source, feature_type, start, end, score, strand, phase, attributes = line.split('\t')
attributes = parse_attributes(attributes)
if feature_type == 'gene':
gene = attributes['ID']
name = find_protein(gene, emapper)
line = f'{line}name={name} (predicted)'
name = f'{name} (predicted)'
line = f'{line}name={name}'
if feature_type == 'mRNA':
mRNA = attributes['ID']
isoform = mRNA.split('.')[-1]
line = f'{line}name={name} (predicted) isoform {isoform}'
line = f'{line}name={name} isoform {isoform};gene_name={name}'
if feature_type == 'CDS' or feature_type == 'exon':
line = f'{line}gene_name={name}'
named.write(line + '\n')
```
%% Output
100%|██████████| 779251/779251 [00:04<00:00, 183551.40it/s]
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment