P
lant
  
C
omparative
  
M
etabolome
  
D
atabase
——A multi-level comparison database based on predicted metabolic profiles in 530 plant species
e.g.  alpha-solanine  or  Arabidopsis thaliana
        or  RXN-13489
1. Construction of plant metabolic network

The plant metabolites were predicted using GEMs, which are computational simulations that represent all metabolic reactions occurring within a cell (Thiele & Palsson, 2010). In this study, the metabolic network for each plant was reconstructed de novo using the automated modeling tool RAVEN2.0 RAVEN 2.0 (Wang et al., 2018).

RAVEN utilizes the MetaCyc and KEGG databases in the process of de novo metabolic network reconstruction. The MetaCyc-based reconstruction module facilitates the identification of species-specific reactions and predicted metabolites by comparing amino acid sequences of enzymes to those in the MetaCyc database, generating a draft model. Similarly, the KEGG-based reconstruction module employs a Hidden Markov Model (HMM) trained on genes annotated in KEGG to assess protein sequence similarities for the target species.

MetaCyc-based draft models were generated using the “getMetaCycModelForOrganism” function with default settings. KEGG-based draft models were generated using the “getKEGGModelForOrganism” function, which employs an HMM trained on eukaryotic sequences with 100% sequence identity to query the plant proteome. The resulting draft models were then integrated using the “combineMetaCycKEGGModels” function provided by RAVEN, enhancing the comprehensiveness and accuracy of the final draft model. This integration approach allowed us to leverage the complementary information from both the MetaCyc and KEGG databases, thereby enhancing the comprehensiveness and accuracy of the final draft model. The above analysis was performed in MATLAB 2018a.

The complete metabolic networks were exported in an Excel format, and the resulting combined model integrates both KEGG and MetaCyc identifiers for metabolites. To ensure consistency and accuracy, we carefully reviewed and corrected the metabolite and reaction identifiers within the model, as well as addressed any potential absence of reaction formulas for certain reactions. The summary of predicted metabolites for 530 species has been provided in Species list.

2. Comparative analysis of metabolite similarity among different species

The Jaccard similarity coefficient was utilized to assess the similarity between two sets, such as sets of metabolites in different species. It is calculated by dividing the number of elements common to both sets (the intersection) by the total number of distinct elements present in either set (the union). In the context of two sets of metabolites, labeled set A and set B, the Jaccard similarity coefficient, denoted as J, is represented by the following equation J:

\[ J(A, B) = \frac{|A \cap B|}{|A \cup B|} \]

Here, |A ∩ B| represents the number of elements in the intersection of sets A and B, and |A ∪ B| represents the number of elements in the union of sets A and B. A ratio closer to 1 indicates higher similarity, while a ratio closer to 0 indicates lower similarity between the two sets.

3. Metabolite enrichment analysis

To assess the significance of specific metabolite enrichment within plant categories, such as families, genera, or two groups of plants, we first constructed a presence-absence matrix. In this matrix, each row represents a metabolite, and each column represents a plant. If a metabolite is present in a plant, the corresponding cell is marked as 1; otherwise, it is marked as 0. This approach allows us to visualize the presence of metabolites in each plant. We then employed hypergeometric testing to determine the statistically significance of specific metabolite enrichment in different plant categories. This method calculates the probability of observing a given number of metabolites in a particular plant category, assessing whether these metabolites are significantly enriched in that category. Specifically, we used the scipy statistical package (version 1.7.3) in Python (version 3.7.3) to calculate p-values based on the hypergeometric distribution formula.

\[ P(X=K) = \frac{\binom{M}{K} \binom{N-M}{n-K}}{\binom{N}{n}} \]
P(X=k) represents the probability that exactly k plants, out of a chosen subset of n plants, contain a specific metabolite. Here, N is the total number of plants. M is the number of plants that have a specific metabolite, and n is the number of plants in a specific category, such as a family or genus.
4. Metabolite classification

To standardize compound classification information across different databases, we combined the classification standards of MetaCyc and ChEBI. ChEBI, which is maintained by the European Bioinformatics Institute, primarily focuses on classifying “small” chemical compounds based on their chemical structure and biological relevance. On the other hand, MetaCyc categorizes compounds based on their chemical structure and biological function, with an emphasis on functional aspects. Our approach predominantly used the MetaCyc classification system, complemented by the ChEBI classification system, to leverage the strengths of both databases. This approach ensures the accuracy and reliability of data through MetaCyc’s focus on experimentally validated metabolic pathways, while simultaneously broadening the database’s scope through ChEBI’s extensive coverage of compounds.

In light of the complexity of compound chemical structures and subsequent varied levels of classification, we standardized the database’s classification to a six-tier system. This system encompasses chemical entities, molecular entities, comprehensive biochemical molecules (e.g. esters, alcohols), fundamental biomolecules (e.g. lipids, organic acids), cellular functions and signaling molecules (e.g. glycerolipids, steroids), and the compounds themselves. his classification simplifies the database structure, enhancing user understanding and navigation, while ensuring data consistency and accuracy. The six-tier classification system provides sufficient detail to comprehensively describe the chemical and biological attributes of compounds without overwhelming complexity, ultimately facilitating user understanding and retrieval.

5. Software
Software Version Function Description Parameters/Command Lines