Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
P
plit-genome
Manage
Activity
Members
Labels
Plan
Issues
Issue boards
Milestones
Wiki
Code
Merge requests
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Snippets
Deploy
Releases
Package registry
Model registry
Operate
Terraform modules
Monitor
Incidents
Service Desk
Analyze
Value stream analytics
Contributor analytics
Repository analytics
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
Community forum
Contribute to GitLab
Provide feedback
Terms and privacy
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
zoology
plit-genome
Commits
c7d11977
Commit
c7d11977
authored
3 months ago
by
Niko (Nikolaos) Papadopoulos
Browse files
Options
Downloads
Patches
Plain Diff
updated README for submission
parent
7c6fd65d
No related branches found
Branches containing commit
Tags
v1.2
Tags containing commit
No related merge requests found
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
08-submission/README.md
+52
-8
52 additions, 8 deletions
08-submission/README.md
with
52 additions
and
8 deletions
08-submission/README.md
+
52
−
8
View file @
c7d11977
# Submission
Utility scripts for data upload to ENA.
-
[
EMBL format conversion
](
convert_to_embl.sh
)
for an annotated genome (Fasta + GFF3) - inspired
from a
[
prokka issue
](
https://github.com/tseemann/prokka/issues/145
)
, leading to
[
GFF3toEMBL
](
https://github.com/sanger-pathogens/gff3toembl
)
and eventually to the real solution,
[
EMBLmyGFF3
](
https://github.com/NBISweden/EMBLmyGFF3
)
.
-
A
[
Python script
](
txome_manifest.py
)
to batch-create manifest files for de novo transcriptome
submissions to ENA.
\ No newline at end of file
Utility scripts for data upload to ENA. This was very difficult to figure out.
## Raw data submission
Constructing manifest files for the raw data was fairly straightforward following the
[
ENA documentation
](
https://ena-docs.readthedocs.io/en/latest/index.html
)
, especially the part
concerning
[
raw reads submission
](
https://ena-docs.readthedocs.io/en/latest/submit/reads.html
)
.
## Assembled transcriptome submission
Since we had a lot of time points, we wrote a custom
[
Python script
](
txome_manifest.py
)
to
batch-create manifest files for all the _de novo_ assembled transcriptomes. This was also relatively
straightforward, especially the transcriptome assembly
[
submission
guide
](
https://ena-docs.readthedocs.io/en/latest/submit/assembly/transcriptome.html
)
.
## Assembled genome with functional annotation
This was a pain to get right, and required two-three weeks to get right; and we still weren't able
to include functional information like predicted functions, PFAM domains, or EC numbers.
ENA requires that annotated genome assemblies be submitted as EMBL files. To control the process, we
wrote a
[
Makefile
](
./Makefile
)
. This was an excellent decision to make early in the submission
procedure, as we ended up having to finetune and re-run the entire thing multiple times. The steps
are:
1.
[
compose
](
gff-01-compose_gff.sh
)
the GFF by combining the protein coding gene models (Isoseq,
BRAKER3 round 1, BRAKER3 round 2, de novo assembled transcriptomes), and the tRNA prediction. We
also manually replaced three gene models that were incorrectly split up with an earlier BRAKER3
gene model. At this stage, we officially renamed pseudochromosomes 54-59 to 52-57, and sort the
resulting GFF file.
2.
[
annotate
](
gff-02-functional_annot.ipynb
)
the GFF by adding information from EggNOG-mapper. We
need a
[
wrapper script
](
gff-02-functional_annot.sh
)
for that, since using conda environments is
rather tricky with
`make`
. The information added, if present, is the proposed gene symbol,
function description, EC number, and PFAM domains identified.
3.
[
conform
](
gff-03-ENA_conform.sh
)
the GFF to ENA standards. This step flags short introns (less
than 10 bases, presumably created by polymerase slippage or post-transcriptional modifications)
with the
`pseudo`
tag, ensuring that the ENA validator doesn't complain about them. A previous
version (commented out)
[
extracted
](
gff-03-build_kill_list.py
)
the offending exons into a new GFF
file; this approach was abandoned because it had to extract entire genes, which was beside the
point.
4.
[
convert
](
gff-04-convert_to_embl.sh
)
the GFF to EMBL format by combining it with the FASTA file.
Uses
[
EMBLmyGFF3
](
https://github.com/NBISweden/EMBLmyGFF3
)
(
and
a lot of the useful information
provided in its solved and unsolved GitHub issues, the true source of most useful information
about ENA). Before running the real script, one has to run
`EMBLmyGFF3`
with the
`--expose-translations`
and modify the resulting .json file in order to suppress exons (as this
will create the "abutting features cannot be adjacent" error that the validator throws).
5.
[
submit
](
gff-05-submit_to_ENA.sh
)
the EMBL file to ENA. The manifest file and the chromosome file
still have to be written separately.
Big thanks to
[
Jacques Dainat
](
https://github.com/Juke34
)
, who wrote and is maintaining
[
AGAT
](
https://github.com/NBISweden/AGAT
)
and
[
EMBLmyGFF3
](
https://github.com/NBISweden/EMBLmyGFF3
)
,
as well as diligently answering questions and resolving issues.
\ No newline at end of file
This diff is collapsed.
Click to expand it.
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment