From c7d11977378013022ca75d129738ff2653fb3847 Mon Sep 17 00:00:00 2001 From: Niko <nikolaos.papadopoulos@univie.ac.at> Date: Mon, 9 Dec 2024 20:15:07 +0100 Subject: [PATCH] updated README for submission --- 08-submission/README.md | 60 +++++++++++++++++++++++++++++++++++------ 1 file changed, 52 insertions(+), 8 deletions(-) diff --git a/08-submission/README.md b/08-submission/README.md index fbcdcea..22eac28 100644 --- a/08-submission/README.md +++ b/08-submission/README.md @@ -1,10 +1,54 @@ # Submission -Utility scripts for data upload to ENA. - -- [EMBL format conversion](convert_to_embl.sh) for an annotated genome (Fasta + GFF3) - inspired - from a [prokka issue](https://github.com/tseemann/prokka/issues/145), leading to - [GFF3toEMBL](https://github.com/sanger-pathogens/gff3toembl) and eventually to the real solution, - [EMBLmyGFF3](https://github.com/NBISweden/EMBLmyGFF3). -- A [Python script](txome_manifest.py) to batch-create manifest files for de novo transcriptome - submissions to ENA. \ No newline at end of file +Utility scripts for data upload to ENA. This was very difficult to figure out. + +## Raw data submission + +Constructing manifest files for the raw data was fairly straightforward following the +[ENA documentation](https://ena-docs.readthedocs.io/en/latest/index.html), especially the part +concerning [raw reads submission](https://ena-docs.readthedocs.io/en/latest/submit/reads.html). + +## Assembled transcriptome submission + +Since we had a lot of time points, we wrote a custom [Python script](txome_manifest.py) to +batch-create manifest files for all the _de novo_ assembled transcriptomes. This was also relatively +straightforward, especially the transcriptome assembly [submission +guide](https://ena-docs.readthedocs.io/en/latest/submit/assembly/transcriptome.html). + +## Assembled genome with functional annotation + +This was a pain to get right, and required two-three weeks to get right; and we still weren't able +to include functional information like predicted functions, PFAM domains, or EC numbers. + +ENA requires that annotated genome assemblies be submitted as EMBL files. To control the process, we +wrote a [Makefile](./Makefile). This was an excellent decision to make early in the submission +procedure, as we ended up having to finetune and re-run the entire thing multiple times. The steps +are: + +1. [compose](gff-01-compose_gff.sh) the GFF by combining the protein coding gene models (Isoseq, + BRAKER3 round 1, BRAKER3 round 2, de novo assembled transcriptomes), and the tRNA prediction. We + also manually replaced three gene models that were incorrectly split up with an earlier BRAKER3 + gene model. At this stage, we officially renamed pseudochromosomes 54-59 to 52-57, and sort the + resulting GFF file. +2. [annotate](gff-02-functional_annot.ipynb) the GFF by adding information from EggNOG-mapper. We + need a [wrapper script](gff-02-functional_annot.sh) for that, since using conda environments is + rather tricky with `make`. The information added, if present, is the proposed gene symbol, + function description, EC number, and PFAM domains identified. +3. [conform](gff-03-ENA_conform.sh) the GFF to ENA standards. This step flags short introns (less + than 10 bases, presumably created by polymerase slippage or post-transcriptional modifications) + with the `pseudo` tag, ensuring that the ENA validator doesn't complain about them. A previous + version (commented out) [extracted](gff-03-build_kill_list.py) the offending exons into a new GFF + file; this approach was abandoned because it had to extract entire genes, which was beside the + point. +4. [convert](gff-04-convert_to_embl.sh) the GFF to EMBL format by combining it with the FASTA file. + Uses [EMBLmyGFF3](https://github.com/NBISweden/EMBLmyGFF3) (and a lot of the useful information + provided in its solved and unsolved GitHub issues, the true source of most useful information + about ENA). Before running the real script, one has to run `EMBLmyGFF3` with the + `--expose-translations` and modify the resulting .json file in order to suppress exons (as this + will create the "abutting features cannot be adjacent" error that the validator throws). +5. [submit](gff-05-submit_to_ENA.sh) the EMBL file to ENA. The manifest file and the chromosome file + still have to be written separately. + +Big thanks to [Jacques Dainat](https://github.com/Juke34), who wrote and is maintaining +[AGAT](https://github.com/NBISweden/AGAT) and [EMBLmyGFF3](https://github.com/NBISweden/EMBLmyGFF3), +as well as diligently answering questions and resolving issues. \ No newline at end of file -- GitLab