We collected 132 genomes from 74 species within the Asteraceae family, and 4,408,432 genes have been annotated (refer to the “Genome” module). We provide gene annotations from seven different perspectives, including homology with Arabidopsis and sunflower, gene and transcription factor families (PlantTFDB [1] and iTAK [2]), gene ontology [3], KEGG pathway (https://www.genome.jp/kegg/) and Pfam domains [4] (Table 1). Users can quickly retrieve 11,475,543 annotation data for individual genes or gene sets by searching with gene IDs, GO/KEGG/Pfam identifiers, or transcription factor/gene family names. The synteny analysis was carried out using MCScanX [5], and users can use genome synteny module to browse the alignment results among genomes.We selected genomes from 43 Asteraceae species to establish a robust pan-genome based on two criteria: i) BUSCO completeness scores greater than 80% [6], and ii) the utilization of RNA-seq data for gene structure prediction. The protein sequences of genes from 43 Asteraceae genomes were collected and input into OrthoFinder v2.5.4 [7]. This analysis resulted in a pan-genome comprising 95,770 gene clusters (groups of homologous genes).
We used Braker3 [8] to perform gene structure annotation on 24 high-quality Asteraceae genomes that lacked gene annotations. Braker3 integrates three annotation methods: ab initio gene prediction, homology protein-based gene prediction, and RNA-seq-based gene prediction. The homology protein library includes protein sequences from Arabidopsis, sunflower, cultivated Chrysanthemum, and lettuce, while the transcriptome data comprises samples from various tissues and treatments, with a mapping rate exceeding 80%.
Variations, including 30,507,963 SNPs and 12,257,327 InDels, were identified from 2,392 transcriptomes. To explore the impact of these variations on gene expression, we developed the 'Genome Variations' page. This page integrates genetic variation and gene expression data, providing descriptions of effect annotations and associations between genotype and gene expression. Users can enter a gene name, gene ID, or chromosome region to identify candidate variations that correlate with gene expression using the Single-locus module.
SNP calling was performed using the Genome Analysis Toolkit (v4.1.4.1) [9]. SNPs from the joint genotyping were further filtered to exclude sites with a minor allele frequency (MAF) < 0.05, and those with missing data. The annotations and effects of SNPs on gene function were predicted using SnpEff (v5.0) software [10].