" conditions_skip = line.startswith('#') or 'tRNA' in line or 'name=' in line\n",
" if not conditions_skip:\n",
...
...
@@ -190,7 +183,7 @@
],
"metadata": {
"kernelspec": {
"display_name": "ascc24",
"display_name": "jupyterhub-5.1.0",
"language": "python",
"name": "python3"
},
...
...
@@ -204,7 +197,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.19"
"version": "3.12.4"
}
},
"nbformat": 4,
...
...
%% Cell type:markdown id: tags:
# Adding functional annotation from EggNOG-mapper
%% Cell type:code id: tags:
``` python
fromtqdmimporttqdm
# from tqdm import tqdm # install it for nice progress bars
importpandasaspd
```
%% Cell type:markdown id: tags:
### util functions
We are going to need three helper functions:
- extract the gene ID from the `#query` field of the EggNOG-mapper output
- break up the content of the attributes field of the GFF file into a dictionary
- find the correct protein name for a gene ID
%% Cell type:code id: tags:
``` python
defparse_gene_id(x):
"""Extract gene ID from a string
Parameters
----------
x : str
A protein ID from the eggNOG-mapper output.
Returns
-------
str
will return the gene ID in the format of 'PB.X' (PacBio genes) or 'gX' (BRAKER round 1) or 'r2_gX' (BRAKER round 2) or 'at_DNX (de-novo transcriptome-assembled genes)'