CottonMD

A Multiomics Database for cotton biological study


  Introduction of CottonMD


1. What is CottonMD?


            CottonMD (Cotton Multiomics Database, http://yanglab.hzau.edu.cn/CottonMD)  is a pangenome-based cotton multi-omics database.
            Here, we firstly constructed a genetic variation panel of a large-scale cotton population comprising of 4,180 accessions. Then, we built a cotton pangenome with 146,881 gene clusters for 1,521,966 genes from 25 cotton genome assemblies. Based on this pangenome, we established a pangenome-based multiomics database (designated as CottonMD) by mining and integrating the data of 25 genomes, transcriptomes (from 76 tissues and 240 individuals), genetic variations (from 4,180 accessions), phenotypic data (from 20 phenotypes), epigenomes (from six germplasms) and 768 metabolites (from four tissues).
            CottonMD provides a large amount of multiomic data and easy-to-use search tools, which will be a valuable database for future cotton genetic breeding and functional genomics research.
        

2. Home Page


        The home page introduces the basic information, construction process and functional modules in CottonMD. In addition, a search box was designed for users to quickly acquire all related multi-omics knowledge of interested genes, which is convenient and helpful to understand the role of genes from different omic levels.
        

3. Genomics

4. Population

5. Variation

6. Transcriptomics

7. Epigenetics

8. Multi-omics

9. Metabolome



  How to construct CottonMD


1. Comparative genome analysis


        Genome sequences between TM-1 reference genome and other 25 genome assemblies were compared using NUCmer program with parameters "nucmer -- maxmatch --noextend" in MUMmer4 (Marcais et al 2018). After filtering one-to-one alignments with a minimum alignment length of 50 bp using the show-diff program from MUMmer4, the left alignment blocks were used for genome browser visualization (Marcais et al 2018). As for genome browser visualization, dotplot module in JBrowse 2 and Genome_syn module in Gbrowser were embedded in CottonMD (McKay et al 2010, Buels et al 2016).
        1,826,891 genes of 25 genome assemblies were used to construct the gene index. Gene synteny of every pairs from 25 genome assemblies were detected by McScanX. 149,561 gene clusters were functionally annotated based on the with homology with Arabidopsis thaliana. The gene clusters with the same Arabidopsis thaliana gene were merged. Finally, 15,950 gene clusters were left as the gene indexes in cotton.
        

2. SNP and InDel calling


        The genome resequencing data of each accessions were mapped to the TM-1 reference genome using BWA-MEM with default parameters (Li 2013, Huang et al 2020). Then we removed the reads with the mapping quality value < 20 by SAMtools (v1.6) (Li et al 2009). SNPs and small InDels were identified using Sentieon DNAseq pipeline for each accession (Kendig et al 2019). SNPs with low mapping quality were filtered out by GATK VariantFiltration with parameters "QUAL < 30.0 || MQ < 50.0 || QD < 2" (McKenna et al 2010). All SNPs and InDels with minor allele frequencies (MAF) < 0.01 and missing > 0.1 were discarded by VCFtools software (Danecek et al 2011). As for the left SNPs and InDels, genotype imputation was performed using beagle (v5.1) (Browning B L et al 2016).
        

3. Transcritome analysis


        After clipping the adaptor sequences and removing the low-quality reads by Trimmomatic software (v0.36), the RNA-seq clean data from accessions were mapped to the TM-1 reference genome using Hisat2 (v2.1.2) with default parameters. Gene expression level was normalized using the number of fragments per kilo-base of transcript per million mapped reads (FPKM) by StringTie software (v1.3.5) with default settings. The co-expression network is obtained by calculating the Pearson correlation coefficient of pairwise gene pairs, the gene modules including the gene pairs with a Pearson coefficient greater than 0.8 were retained as a co-expression network.
        

4. Epigenome analysis


        Clipping the adaptor sequences and removing the low-quality reads were filtered out using Trimmomatic software (v0.36). As for ChIP-seq and ATAC-seq, the clean data from accessions were mapped to the TM-1 reference genome using bowtie2 (v2.3.2) with default parameters. PCR duplicated reads were removed using Picard tools (v2.19). Peaks were called using callpeak module of MACS2 software (v2.1.2) with parameters "--nomodel -q 0.05 --extsize 200 --shift -100 -g 8.84e8 --keep-dup all -B --call-summit".
        As for Hi-C, clean data of each accessions were mapped to the TM-1 reference genome using BWA-MEM with default parameters (Li 2013, Huang et al 2020). Then, Hi-C interaction matrix were created using Juice pipeline. KR normalized matrix were extracted from Hi-C format files at the resolutions of 10kb, 50kb, 100kb using Juicer_tools (v1.7.6) for JBrowse. 
        As for BS-seq data, clean data of each accessions were mapped to the TM-1 reference genome using Bismark (v0.13.0) with default parameters. Then, deduplicated reads were filtered out using deduplicate_bismark. After extracting the methylation information by Bismark methylation extractor, the degree of methylation of all genes in TM-1 genome were calculated.
        Bigwig files of all epigenome data analysis can be visited by JBrowser in tools module.
        

5. Population genetic analysis


        Population structure of all accessions was analyzed using fastStructure with K from 2 to 10 (Raj et al 2014). Principal component analysis (PCA) was performed using GCTA (Yang et al 2011).
        As for 7 subpopulation, genetic diversity (π) and Tajima's D were calculated using VCFtools with the window size of 100 kb and the step size of 20 kb. To identify domestication and improved effects, the population fixation statistics FST and XP–CLR among subpopulations were used to scan domestication-sweep regions. FST was calculated using VCFtools within 100 kb windows sliding 20 kb. Population-level FST was estimated as the average of all sliding windows. Windows with an empirical FST cutoff (top 5%) were regarded as highly differentiated regions. XP–CLR was calculated with the parameters "--maxsnps 600 --size 100000 --step 20000".
        

6. Genome-wide association study (GWAS)


        The SNPs and InDels with a minor allele frequency (MAF) of less than 5% were filtered for genome-wide association study (GWAS). GWAS were performed for 23 traits using the GEMMA. The population structure was controlled for by including the first two principal components as covariates, as well as an IBS kinship matrix derived from all variants (SNP and InDels) calculated by GEMMA. The cutoff for determining significant associations was set as -log10(1/n), where n represents the total number of variations.
        

7. Expression quantitative trait loci (eQTL) mapping


        The gene expression values used as the phenotype for eQTL mapping. Only genes expressed in more than 95% of the accessions were defined as expressed genes for eQTL mapping. Variations with MAF > 5% were used to perform eQTL mapping using GEMMA to detect the associations for variations and genes. The cutoff for determining significant associations was set as -log10(1/n), where n represents the total number of variations. Based on the distance between eQTL and targeted-genes, we subdivide all eQTL into cis-eQTL if the variation was found within 1Mb of transcription start site or transcription end site of the target gene, otherwise as trans-eQTL. In CottonMD, regulation relationship of trans-eQTL was visualized using BioCircos.js (Cui et al 2016).
        

8. Transcriptome-wide association studies (TWAS)


        TWAS in this study provided an approach to integrate GWAS and gene expression datasets to identify gene-trait associations. We conducted TWAS by EMMAX module using the gene expression data of fiber at 20 DAP with 6 phenotypic data from cis-eQTL of upstream 1 Mb to downstream 1 Mb of target-genes to compute the gene expression weights (Gusev et al 2016, Wainberg et al 2019). Models were considered 'transcriptome-wide significant' if they passed the Bonferroni correction for all genes.
        

9. Colocalization and SMR analysis


        Co-localization of GWAS and eQTL were performed to generate posterior causal probabilities for each of variants in the GWAS and eQTL analyses. All variations within 1Mb flanking regions around gene were tested for colocalization using 'COLOC' R package with default parameters (Hormozdiari et al 2016).
        SMR analysis integrates summary-level data from GWAS with data from eQTL to identify genes whose expression levels are associated with a complex trait because of pleiotropy. Online SMR analysis in CottonMD were performed using SMR software (Zhu et al 2016).
        

10. Implementation


        CottonMD (http://yanglab.hzau.edu.cn/CottonMD) was constructed based on the ThinkPHP (v5.0) framework, and runs on the Apache 2 web server (v2.4.18) with MySQL (v5.7.34) as its database engine. The database is available online without registration and optimized for Chrome (recommended), Internet Explorer, Opera, Firefox, Windows Edge and macOS Safari.
        

  References


[1] Browning B L, Browning S R. Genotype Imputation with Millions of Reference Samples. Am J Hum Genet, 2016, 98: 116-126
[2] McKay S J, Vergara I A, Stajich J E. Using the Generic Synteny Browser (GBrowse_syn). Current protocols in bioinformatics, 2010, Chapter 9: Unit-9.12
[3] Yang J, Lee S H, Goddard M E, Visscher P M. GCTA: a tool for genome-wide complex trait analysis. Am J Hum Genet, 2011, 88: 76-82
[4] Buels R, Yao E, Diesh C M, Hayes R D, Munoz-Torres M, Helt G, Goodstein D M, Elsik C G, Lewis S E, Stein L, Holmes I H. JBrowse: a dynamic web platform for genome visualization and analysis. Genome Biol, 2016, 17: 66
[5] Cui Y, Chen X, Luo H, Fan Z, Luo J, He S, Yue H, Zhang P, Chen R. BioCircos.js: an interactive Circos JavaScript library for biological data visualization on web applications. Bioinformatics, 2016, 32: 1740-1742
[6] Danecek P, Auton A, Abecasis G, Albers C A, Banks E, DePristo M A, Handsaker R E, Lunter G, Marth G T, Sherry S T, McVean G, Durbin R, Group G P A. The variant call format and VCFtools. Bioinformatics, 2011, 27: 2156-2158
[7] Gusev A, Ko A, Shi H, Bhatia G, Chung W, Penninx B W J H, Jansen R, de Geus E J C, Boomsma D I, Wright F A, Sullivan P F, Nikkola E, Alvarez M, Civelek M, Lusis A J, Lehtimäki T, Raitoharju E, Kähönen M, Seppälä I, Raitakari O T et al.Integrative approaches for large-scale transcriptome-wide association studies. Nat. Genet., 2016, 48: 245-252
[8] Hormozdiari F, van de Bunt M, Segre A V, Li X, Joo J W J, Bilow M, Sul J H, Sankararaman S, Pasaniuc B, Eskin E. Colocalization of GWAS and eQTL Signals Detects Target Genes. Am J Hum Genet, 2016, 99: 1245-1260
[9] Huang G, Wu Z, Percy R G, Bai M, Li Y, Frelichowski J E, Hu J, Wang K, Yu J Z, Zhu Y. Genome sequence of Gossypium herbaceum and genome updates of Gossypium arboreum and Gossypium hirsutum provide insights into cotton A-genome evolution. Nat Genet, 2020, 52: 516-524
[10] Kendig K I, Baheti S, Bockol M A, Drucker T M, Hart S N, Heldenbrand J R, Hernaez M, Hudson M E, Kalmbach M T, Klee E W, Mattson N R, Ross C A, Taschuk M, Wieben E D, Wiepert M, Wildman D E, Mainzer L S. Sentieon DNASeq Variant Calling Workflow Demonstrates Strong Computational Performance and Accuracy. Front. Genet., 2019, 10: 736
[11] Li H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. ArXiv, 2013, 1303
[12] Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, Genome Project Data Processing S. The Sequence Alignment/Map format and SAMtools. Bioinformatics, 2009, 25: 2078-2079
[13] Marcais G, Delcher A L, Phillippy A M, Coston R, Salzberg S L, Zimin A. MUMmer4: A fast and versatile genome alignment system. PLoS Comput. Biol., 2018, 14: e1005944
[14] McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, DePristo M A. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res., 2010, 20: 1297-1303
[15] Raj A, Stephens M, Pritchard J K. fastSTRUCTURE: Variational Inference of Population Structure in Large SNP Data Sets. Genetics, 2014, 197: 573-589
[16] Wainberg M, Sinnott-Armstrong N, Mancuso N, Barbeira A N, Knowles D A, Golan D, Ermel R, Ruusalepp A, Quertermous T, Hao K, Bjorkegren J L M, Im H K, Pasaniuc B, Rivas M A, Kundaje A. Opportunities and challenges for transcriptome-wide association studies. Nat Genet, 2019, 51: 592-599
[17] Zhu Z, Zhang F, Hu H, Bakshi A, Robinson M R, Powell J E, Montgomery G W, Goddard M E, Wray N R, Visscher P M, Yang J. Integration of summary data from GWAS and eQTL studies predicts complex trait gene targets. Nat. Genet., 2016, 48: 481-487