Data accompanying the manuscript have been uploaded on Zenodo (10.5281/zenodo.14185694). Here we
present a brief description of each file and put them in meaningful groups.
## referenced supplement
Supplementary material from the manuscript.
- suppl-fig1-alignment.pdf
- suppl-fig2-genomescope.png
- suppl-fig3-abdA_araneae.png
- suppl-fig4-genomic_context.png
- suppl-table1-data_overview.tsv
- suppl-table2-Plit_COI-alignment_distances.pdf
- suppl-table3-isoseq.tsv
- suppl-table4-genome_progress.tsv
- suppl-table5-abdA.tsv
- suppl-table6-r2_g3735-Alignment-HitTable.csv
- suppl-table7-chelicerate_genomes.tsv
- suppl-table8-arthropod_repeat_content.tsv
- suppl-table9-chelicerate_repeat_content.tsv
- suppl-table10-Pycnognonum_microRNAs.xlsx
- suppl-file1-araneae_abdA.fasta
- suppl-file2-arthropod_genomes.tsv
- unref-table-mapping_rates.tsv
## figures
- figs.zip: archive of raw and processed figures for the manuscript in full resolution
## analysis
### genomic context
The broader arthropod/chelicerate context for the _P. litorale_ genome assembly. Files used/generated by the [genomic_context](https://github.com/galicae/plit-genome/blob/main/07-analysis/genomic_context.ipynb) notebook.
- araneae.tsv: repeat makeup of published chelicerate assemblies
- arthropoda.tsv: genome assembly statistics for arthropod genomes. From NCBI Genomes.
- modern_taxids.txt: list of taxonomic IDs for species; made to be submitted to NCBI Taxonomy.
- tax_report.txt: the full taxonomic report for each query species. Contains tax IDs for the entire lineage.
- total_repeats.tsv: total repeat content of published chelicerate genomes.
### Hox genes
Files concerning the Hox gene cluster analysis.
- Plit_HoxGenes.fasta: putative Hox gene sequences for P. litorale
- putative_hox-Alignment-HitTable.csv:
- araneae_abdA.fasta: arachnid abdA sequences from NCBI
- araneae_abdA.m8: alignment of arachnid abdA sequences against the draft genome
- plit.m8: MMseqs2 alignment results for the putative Hox sequences against the draft genome.
- r2_g3735-Alignment-HitTable.csv: NCBI BLAST results (nr) for gene r2_g3735
## processed (intermediate) data
### 00-kmer-jellyfish.zip
k-mer spectra analysis with GenomeScope and GenomeScope2.0
### 00-seq-qc.zip
quality control output for raw sequencing data (e.g. FastQC output)
### 01-assembly
- assembly_graph.gfa: Flye output
- assembly_graph.gv: Flye output
- assembly_info.txt: Flye output
- assembly.fasta: Flye output
- backmap.hifi.sort.bam.cov-hist.pdf: coverage histogram of the back-mapped PacBio data
- backmap.ont.sort.bam.cov-hist.pdf: coverage histogram of the back-mapped ONT data
- contam_tax.m8: Alignment results of UniRef90 against draft genome (MMseqs2)
- scaffolds_taxonomic_distribution.tsv: summary of contam_tax.m8; number of genes from each taxonomic level per scaffold.
- scaffolds_taxonomic_distribution_collapsed_vir.tsv: scaffolds with predominantly viral hits
- scaffolds_taxonomic_distribution_suspect.tsv: scaffolds whose genes are <90% metazoan
Checking for widespread _Metridium_ contamination:
- primary_mq30.txt: list of high-quality mapping reads (presumptive "metridial")
- metridium_scaffolds.txt_summary: no. of presumptive _Metridium_ reads per draft scaffold
- metridium_scaffolds.txt: filtered SAM file with all high-quality "_Metridium_" hits on draft scaffolds
- metridium_contigs.sam_summary: no. of presumptive _Metridium_ reads per Flye contig
### 04-annotation
Repeat analysis with RepeatModeler/RepeatMasker:
- draft.fasta.tbl: output of RepeatModeler in tabular form
- pb.sam.flagstats: summary of mapping the repeat families to the PacBio data.
- pycno-families.fa: output of RepeatModeler - the sequences of the _P. litorale_ repeat families
- draft.fasta.out.gff: output of RepeatModeler - repeat locations on the draft genome
Protein coding gene annotation:
- annot-01-isoseq.gff: GFF file with the gene models proposed using Iso-seq isoforms
- annot-01-braker.gff: GFF file with the gene models proposed from round 1 of BRAKER3 using developmental transcriptomes
- annot-02-braker.gff3: GFF file with the gene models proposed from round 2 of BRAKER3, using developmental transcriptome reads that weren't used in round 1
- annot-03-denovo.gff3: GFF file with the gene models proposed from the de novo transcriptomes
- deep_denovo_assemblies.zip: the de-novo assembled transcriptomes from the deeply sequenced developmental time points. Also available on ENA.
tRNAscan output
- trnascan.bed
- trnascan.out
- trnascan.fasta
- trnascan.stats
MirMachine output
- Pli_september.PRE.gff: MirMachine output with permissive threshold
- Pli_september.PRE-1.gff: MirMachine output with strict threshold