From a40f90aa00c2d0bd92b9d5fdbf2f023afa2c0658 Mon Sep 17 00:00:00 2001
From: Niko <nikolaos.papadopoulos@univie.ac.at>
Date: Tue, 19 Nov 2024 22:07:39 +0100
Subject: [PATCH] added overview of Zenodo files

---
 zenodo.md | 165 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 165 insertions(+)
 create mode 100644 zenodo.md

diff --git a/zenodo.md b/zenodo.md
new file mode 100644
index 0000000..9dc8021
--- /dev/null
+++ b/zenodo.md
@@ -0,0 +1,165 @@
+# List of files on Zenodo
+
+Data accompanying the manuscript have been uploaded on Zenodo (10.5281/zenodo.14185694). Here we
+present a brief description of each file and put them in meaningful groups.
+
+## referenced supplement
+
+Supplementary material from the manuscript.
+
+- suppl-fig1-alignment.pdf
+- suppl-fig2-genomescope.png
+- suppl-fig3-abdA_araneae.png
+- suppl-fig4-genomic_context.png
+
+- suppl-table1-data_overview.tsv
+- suppl-table2-Plit_COI-alignment_distances.pdf
+- suppl-table3-isoseq.tsv
+- suppl-table4-genome_progress.tsv
+- suppl-table5-abdA.tsv
+- suppl-table6-r2_g3735-Alignment-HitTable.csv
+- suppl-table7-chelicerate_genomes.tsv
+- suppl-table8-arthropod_repeat_content.tsv
+- suppl-table9-chelicerate_repeat_content.tsv
+- suppl-table10-Pycnognonum_microRNAs.xlsx
+
+- suppl-file1-araneae_abdA.fasta
+- suppl-file2-arthropod_genomes.tsv
+
+- unref-table-mapping_rates.tsv
+
+## figures
+
+- figs.zip: archive of raw and processed figures for the manuscript in full resolution
+
+## analysis
+
+### genomic context
+
+The broader arthropod/chelicerate context for the _P. litorale_ genome assembly. Files used/generated by the [genomic_context](https://github.com/galicae/plit-genome/blob/main/07-analysis/genomic_context.ipynb) notebook.
+
+- araneae.tsv: repeat makeup of published chelicerate assemblies
+- arthropoda.tsv: genome assembly statistics for arthropod genomes. From NCBI Genomes.
+- modern_taxids.txt: list of taxonomic IDs for species; made to be submitted to NCBI Taxonomy.
+- tax_report.txt: the full taxonomic report for each query species. Contains tax IDs for the entire lineage.
+- total_repeats.tsv: total repeat content of published chelicerate genomes.
+
+### Hox genes
+
+Files concerning the Hox gene cluster analysis.
+
+- Plit_HoxGenes.fasta: putative Hox gene sequences for P. litorale
+- putative_hox-Alignment-HitTable.csv: 
+- araneae_abdA.fasta: arachnid abdA sequences from NCBI
+- araneae_abdA.m8: alignment of arachnid abdA sequences against the draft genome
+- plit.m8: MMseqs2 alignment results for the putative Hox sequences against the draft genome.
+- r2_g3735-Alignment-HitTable.csv: NCBI BLAST results (nr) for gene r2_g3735
+
+## processed (intermediate) data
+
+### 00-kmer-jellyfish.zip
+
+k-mer spectra analysis with GenomeScope and GenomeScope2.0
+
+### 00-seq-qc.zip
+
+quality control output for raw sequencing data (e.g. FastQC output)
+
+### 01-assembly
+
+- assembly_graph.gfa: Flye output
+- assembly_graph.gv: Flye output
+- assembly_info.txt: Flye output
+- assembly.fasta: Flye output
+- backmap.hifi.sort.bam.cov-hist.pdf: coverage histogram of the back-mapped PacBio data
+- backmap.ont.sort.bam.cov-hist.pdf: coverage histogram of the back-mapped ONT data
+- BUSCO.arthropoda_odb10.txt: BUSCO completeness report (arthropoda_odb10)
+- BUSCO.metazoa_odb10.txt: BUSCO completeness report (metazoa_odb10)
+- flye.log: Flye assembler log
+- quast_report.pdf: assembly QC
+
+### 02-scaffold
+
+`yahs` output files:
+
+- asm_hic.sorted.bam
+- flye-yahs.fa
+- yahs.out_scaffolds_final.agp
+- yahs.out_scaffolds_final.fa.hic
+- yahs.out_scaffolds_final.fa.assembly
+
+Juicebox (manual curating) results:
+
+- yahs.out_scaffolds_final.fa.review.assembly
+- 02-flye-yahs-juicebox.fa
+
+GAP `sort_scaffolds` pipeline outputs
+
+- 03-flye-yahs-juicebox-merge.fasta
+- plit_q_0_50000_0.5FracBest_unseen_scaffolds.txt
+- plit_q_0_50000_0.5FracBest_insertion_stats.tsv
+- plit_q_0_50000_0.5FracBest_appended_scaffolds.tsv
+- plit_q_0_50000_0.5FracBest_inserted_scaffolds.tsv
+
+### 03-contamination
+
+Refer to the [contamination analysis](https://github.com/galicae/plit-genome/blob/main/04-contam/README.md) for details.
+
+Decontaminating the draft genome from non-metazoan scaffolds:
+
+- plit_q_0_50000_0.5FracBest_output_filtered.fasta: input draft genome
+- contam_tax.m8: Alignment results of UniRef90 against draft genome (MMseqs2)
+- scaffolds_taxonomic_distribution.tsv: summary of contam_tax.m8; number of genes from each taxonomic level per scaffold.
+- scaffolds_taxonomic_distribution_collapsed_vir.tsv: scaffolds with predominantly viral hits
+- scaffolds_taxonomic_distribution_suspect.tsv: scaffolds whose genes are <90% metazoan
+
+Checking for widespread _Metridium_ contamination:
+
+- primary_mq30.txt: list of high-quality mapping reads (presumptive "metridial")
+- metridium_scaffolds.txt_summary: no. of presumptive _Metridium_ reads per draft scaffold
+- metridium_scaffolds.txt: filtered SAM file with all high-quality "_Metridium_" hits on draft scaffolds
+- metridium_contigs.sam_summary: no. of presumptive _Metridium_ reads per Flye contig
+
+### 04-annotation
+
+Repeat analysis with RepeatModeler/RepeatMasker:
+
+- draft.fasta.tbl: output of RepeatModeler in tabular form
+- pb.sam.flagstats: summary of mapping the repeat families to the PacBio data.
+- pycno-families.fa: output of RepeatModeler - the sequences of the _P. litorale_ repeat families
+- draft.fasta.out.gff: output of RepeatModeler - repeat locations on the draft genome
+
+Protein coding gene annotation:
+
+- annot-01-isoseq.gff: GFF file with the gene models proposed using Iso-seq isoforms
+- annot-01-braker.gff: GFF file with the gene models proposed from round 1 of BRAKER3 using developmental transcriptomes
+- annot-02-braker.gff3: GFF file with the gene models proposed from round 2 of BRAKER3, using developmental transcriptome reads that weren't used in round 1
+- annot-03-denovo.gff3: GFF file with the gene models proposed from the de novo transcriptomes
+- deep_denovo_assemblies.zip: the de-novo assembled transcriptomes from the deeply sequenced developmental time points. Also available on ENA.
+
+tRNAscan output
+
+- trnascan.bed
+- trnascan.out
+- trnascan.fasta
+- trnascan.stats
+
+MirMachine output
+
+- Pli_september.PRE.gff: MirMachine output with permissive threshold
+- Pli_september.PRE-1.gff: MirMachine output with strict threshold
+- Pli_september.PRE.fasta: predicted miRNA sequences
+
+## Results
+
+- draft_softmasked.fasta: draft genome with repetitive regions softmasked
+- draft.fasta: draft genome fasta
+- hox.gff3: the position of the Hox genes in GFF3 form.
+- merged_sorted.gff3: protein-coding gene models from all rounds of annotation
+- transcripts.fa: TransDecoder-extracted putative transcripts
+- transcripts.fa.transdecoder.pep: TransDecoder predicted peptides
+- out.emapper.annotations: EggNOG-mapper functional annotation for the predicted peptides
+- out.emapper.best.annotations: filtered EggNOG-mapper annotation; best hit per gene kept
+- miRNA.fasta: FASTA sequences of predicted miRNAs
+- miRNA.lenient.gff: GFF of miRNA positions (permissive MirMachine cutoff)
+- miRNA.strict.gff: GFF of miRNA positions (strict MirMachine cutoff)
\ No newline at end of file
-- 
GitLab