CottonMD

A Multiomics Database for cotton biological study

Methods

1 Comparative genome analysis

Genome sequences were compared between TM-1 reference genome and other 25 genome assemblies using the NUCmer program (v 4.0.0beta2) with parameters "nucmer --maxmatch –noextend" in MUMmer4. After filtering of the one-to-one alignments with a minimum alignment length of 50 bp using the show-diff program from MUMmer4, the remaining alignment blocks were used for genome browser visualization. For genome browser visualization, dotplot module in JBrowse2 and Genome synteny module in Gbrowser were embedded in CottonMD

A total of 1,826,891 genes from 25 genome assemblies were used to construct the gene clusters. Firstly, protein sequences of every pair from 25 genome assemblies were aligned using diamond (v 0.9.14.115, http://github.com/bbuchfink/diamond). Then, gene synteny was detected by McScan (python version). The genes with gene synteny were grouped to one cluster. Finally, 1,521,966 genes were grouped to 146,881 gene clusters.

2 SNP and InDel calling

The genome resequencing data of each accession were mapped to the TM-1 reference genome using BWA-MEM with default parameters. Then, the reads with the mapping quality value < 20 were removed by SAMtools (v 1.6). SNPs and small InDels were identified using Sentieon DNAseq pipeline for each accession. SNPs with low mapping quality were filtered out by GATK VariantFiltration with parameters "QUAL < 30.0 || MQ < 50.0 || QD < 2". All SNPs and InDels with minor allele frequencies (MAF) < 0.01 and missing rate > 0.1 were discarded by VCFtools (v0.1.16). As for the remaining SNPs and InDels, genotype imputation was performed using beagle (v 5.1).

3 Transcriptome analysis

After clipping the adaptor sequences and removing the low-quality reads by Trimmomatic software (v 0.36), the RNA-seq clean data from accessions were mapped to the TM-1 reference genome using Hisat2 (v 2.1.2) with default parameters. Gene expression level was normalized using the number of fragments per kilo-base of transcript per million mapped reads (FPKM) by StringTie software (v 1.3.5) with default settings. The co-expression network was obtained by calculating the Pearson correlation coefficient of pairwise gene expression levels, and the gene modules including the gene pairs with a Pearson coefficient of larger than 0.8 were retained as a co-expression network.

4 Epigenome analysis

The adaptor sequences were removed and the low-quality reads were filtered out using Trimmomatic (v 0.36). As for ChIP-seq and ATAC-seq, the clean data from accessions were mapped to the TM-1 reference genome using bowtie2 (v 2.3.2) with default parameters. PCR duplicated reads were removed using Picard tools (v 2.19). Peaks were called using the callpeak module of MACS2 software (v 2.1.2) with the parameters "--nomodel -q 0.05 --extsize 200 --shift -100 -g 8.84e8 --keep-dup all -B --call-summit".

As for Hi-C, clean data of each accession were mapped to the TM-1 reference genome using BWA-MEM with default parameters. Then, the Hi-C interaction matrix was created using Juicer pipeline. KR normalized matrix was extracted from Hi-C format files at the resolutions of 10 kb, 50 kb and 100 kb using Juicer_tools (v 1.7.6) for JBrowser.

As for BS-seq data, clean data of each accession were mapped to the TM-1 reference genome using Bismark (v 0.13.0) with parameter settings "-N 1, -L 30". Bigwig files of all epigenome data analysis can be visited by JBrowser in the Tools portal.

5 Population genetic analysis

SNPs and InDels were filtered based on LD using PLINK (v 1.90b4.4) with the parameters "--indep-pairwise 100 50 0.8". Variations passing filtering were used for the downstream analysis. Phylogenetic tree of 4180 cotton accessions in 4K-CP was constructed using FastTreeMP (v 2.1) with the default parameters. Population structure of all accessions was analyzed using fastStructure with K from 2 to 10. Principal component analysis (PCA) was performed using GCTA (v 1.92.4 beta2).

For each subpopulation, we calculated the level of genetic diversity (π) and Tajima's D statistic in each 100-kb interval across the cotton genome by VCFtools. We calculated the level of population differentiation between cultivated populations and landraces, wild varieties and island cotton populations by using FST with 100-kb windows sliding 20 kb by VCFtools. We also used the XP–CLR method to scan for domestication-sweep regions (--maxsnps 600 --size 50000 --step 10000).

6 Genome-wide association study (GWAS)

The SNPs and InDels with a minor allele frequency (MAF) lower than 5% were filtered for genome-wide association study (GWAS). GWAS was performed for six traits using the GEMMA (v 0.98.1). The population structure was controlled by including the first two principal components as covariates, as well as an IBS kinship matrix derived from all variants (SNPs and InDels) calculated by GEMMA. The cutoff for determining significant associations was set as -log10(1/n), where n represents the total number of variations.

7 Expression quantitative trait loci (eQTL) mapping

The gene expression values were taken as the values of the phenotype for eQTL mappinG. Only those genes expressed in more than 95% of the accessions were defined as expressed genes for eQTL mappinG. Variations with MAF > 5% were used to perform GWAS for each gene by using GEMMA to detect the associations for variations and genes. The cutoff for determining significant associations was set as -log10(1/n), where n represents the total number of variations. Then, eQTL mapping was performed as previously described. Based on the distance between eQTL and targeted-genes, we subdivided all eQTL into cis-eQTL if the variation was found within 1 Mb of the transcription start site or transcription end site of the target gene, otherwise as trans-eQTL. In CottonMD, the regulatory relationship of trans-eQTL was visualized using BioCircos.js.

8 Transcriptome-wide association studies (TWAS)

TWAS was used to integrate GWAS and gene expression datasets to identify gene-trait associations. TWAS was conducted by the EMMAX module using the gene expression data of fiber at 20 DAP with the data of six phenotypes from cis-eQTL in the region of 1 Mb upstream to 1 Mb downstream of target genes to compute the gene expression weights. Models were considered as 'transcriptome-wide significant' if they passed the Bonferroni correction for all genes.

9 Colocalization and SMR analysis

Colocalization of GWAS and eQTL results was performed to generate posterior causal probabilities for each of the variants in the GWAS and eQTL analyses. All variations within 1 Mb flanking region around the gene were tested for colocalization using the "COLOC" R package with default parameters.

SMR analysis integrated the summary-level data from GWAS with eQTL data to identify genes associated with a complex trait because of pleiotropy. SMR analysis in CottonMD was performed using the SMR software (v 1.03).