BnVIR: Brassica napus variation information resource

1. Tutorials

2. Methods

Collection of re-sequencing data from 2,311 B. napus accessions

The B. napus population used in this study comprises 2,311 B. napus accessions, including 1,259 from Asia, 929 from Europe, 60 from North America, two from South America, 38 from Oceania and four from Africa (Table 1) (Lu et al., 2019; Song et al., 2020; Tang et al., 2021; Wu et al., 2019). Three ecotypes, including spring (354 accessions), winter (756 accessions), and semi-winter (1,122 accessions) were included in the population.

Table 1. Detailed information of the 2,311 B. napus accessions

Read mapping and variation detection

The re-sequencing reads with low-quality were firstly filtered out by Trimmomatic (v0.36) (Bolger et al., 2014). Next, the remaining reads were mapped to ZS11 and Darmor-bzh v4 (Chalhoub et al., 2014) reference genome using BWA-MEM (v0.7.17) (Li and Durbin 2009) with default parameters, respectively. Reads with MAPQ lower than 10 were removed by SAMtools (v1.9) (Li et al., 2009), and reads of PCR duplication were removed by Picard (v2.17.6, http://broadinstitute.github.io/picard/). Then, SNP and InDel calling was performed with Sentieon (v201704.01). To obtain high-quality variations, raw SNPs and InDels were filtered using the VariantFiltration module of GATK (v 3.6-0) (McKenna et al., 2010) with the parameter "QUAL < 30.0 || MQ < 50.0 || QD < 2". The SNPs and InDels with a minor allele frequency (MAF) lower than 0.01 and a missing ratio higher than 30% were filtered out by VCFtools (v0.1.16) (Danecek et al., 2011), and the SNPs and InDels with heterozygosity lower than 0.5 in the population were also removed. After that, genotype imputation for the SNP set was performed using beagle (v5.1) (Browning and Browning 2007). SV identification was also conducted in the population. Finally, 8,709,516 SNPs, 1,287,483 InDels, and 93,562 SVs were obtained based on ZS11 reference. The SVs included 61,002 insertions, 31,552 deletions, 647 inversions and 361 duplications. A total of 5,781,762 SNPs and 906,714 InDels were obtained based on Darmor-bzh reference.

Haplotype analysis

Gene haplotype analysis was conducted by mainly referring to the method of Yano et al (Yano et al., 2016). Firstly, gene haplotype identification based on polymorphisms localized on the gene regions was performed by PHP codes written in house. Then, frequencies of haplotypes were estimated using PLINK (v1.90b4.4) (Purcell et al., 2007) with the parameter "--hap-freq". Haplotypes with a frequency lower than 0.01 were filtered.

Phylogenetic analysis and population structure analyses

SNPs with strong linkage disequilibrium (LD) were discarded with PLINK (v1.90b4.4) (Purcell et al., 2007) with parameter "--indep-pairwise 100 50 0.8" for phylogenetic tree construction using TreeBeST (v1.9.2) (Vilella et al., 2009) with the "treebest nj -b 1000" parameter. The population structure analysis of all accessions was performed using fastStructure (v1.0) (Raj et al., 2014) with the number of ancestry kinships (K) from 2 to 10, and K = 3 was used. Principal component analysis (PCA) was performed using GCTA (v1.92.4 beta2) (Yang et al., 2011). Linkage disequilibrium (LD) between pairs of SNPs on each chromosome was estimated as the correlation coefficient (r²) in PLINK (v1.90b4.4) (Purcell et al., 2007) and visualized using LDheatmap R package (Shin et al., 2006).

Genome-wide analysis of selection signals

To explore the candidate regions potentially effected by selection, nucleotide diversity (π) and population fixation statistics (). π was calculated with expected heterozygosity per site derived from the average number of sequence differences in three subpopulations (winter, spring and semi-winter) with VCFtools (v0.1.13) (Danecek et al., 2011) in a 500-kb sliding window with a step size of 50 kb. , which indicates genomic differentiation between populations, was calculated for each pair of subpopulations using VCFtools (v0.1.13) (Danecek et al., 2011) in a same windows size as that of π. The outliers of were determined as windows with the top 5% of values, indicating divergent selection of the regions. To explore the selection pressure of genes in B. napus genome during domestication, K_a, K_s and K_a/K_s ratio were calculated for orthologous gene pairs between B. napus and the ancestors, B. rapa and B. oleracea, with K_a/K_s_Calculator (v 2.0) (Wang et al., 2010). Selection signals were calculated based on ZS11 and Darmor-bzh reference genome, respectively.

Transcription data analysis of B. napus population

RNA-seq data of seeds from 309 B. napus accessions at 20 and 40 days after flowering (DAF) were obtained from a previous study (Tang et al., 2021). For RNA-seq analysis, adaptors and low-quality read pairs with read quality < 80 were removed using Trimmomatic (v0.38) (Bolger et al., 2014). Then, the filtered reads were mapped to the ZS11 and Darmor-bzh v4 (Chalhoub et al., 2014) reference genome using Hisat2 (v2-2.1.0) (Kim et al., 2015) with default parameters, respectively. Finally, FPKM of genes were calculated with StringTie (v1.3.3b) (Pertea et al., 2015). Moreover, RNA-seq data of multiple tissues, including root, flower, bud, leaf, silique, silique wall, were retrieved from BnTIR (Liu et al., 2021). For each tissue, samples were merged and then about 25 Gb of reads were randomly extracted for visualization in JBrowse.

Phenotype data of B. napus population

Eighteen traits of 1,703 B. napus accessions from multiple planting areas and years were collected from published studies (Liu et al., 2016; Song et al., 2020; Tang et al., 2021; Wu et al., 2019; Zhao et al., 2019) (Table 2). The best linear unbiased prediction (BLUP) values of all planting areas and years were used as final phenotypic values for further analysis.

Table 2. Summary of phenotype data

Trait	Source
Seed oil content (SOC)	(Tang et al., 2021)
Palmitic acid (C16:0)	(Zhao et al., 2019)
Stearic acid (C18:0)	(Zhao et al., 2019)
Oleic acid (C18:1)	(Zhao et al., 2019)
Linoleic acid (C18:2)	(Zhao et al., 2019)
Linolenic acid (C18:3)	(Zhao et al., 2019)
Eicosenoic acid (C20:1)	(Zhao et al., 2019)
Erucic acid (C22:1)	(Zhao et al., 2019)
Seed glucosinolate content (SGC)	(Zhao et al., 2019)
Plant height	(Liu et al., 2016)
Main inflorescence length	(Liu et al., 2016)
Main inflorescence silique number	(Liu et al., 2016)
Silique length	(Liu et al., 2016)
Thousand seed weight	(Liu et al., 2016)
Seed number per silique	(Liu et al., 2016)
Branch height	(Liu et al., 2016)
Branch number	(Liu et al., 2016)
Flowering time	(Wu et al., 2019; Song et al., 2020)

Genome-wide association study

The BLUP results of Eighteen traits and the SNPs and InDels with a minor allele frequency (MAF) higher than 0.05 were used for genome-wide association study (GWAS). GWAS was performed for the Eighteen traits with GEMMA (Zhou and Stephens 2012) (v0.98.1). The population structure was controlled by the first two principal components as covariates, as well as an IBS kinship matrix derived from SNPs and InDels calculated by GEMMA (Zhou and Stephens 2012) (v0.98.1). The cutoff for determining signiﬁcant associations was set as -lg(1/n), where n represents the total number of variations.

System architecture and software for database construction

The BnVIR was built with the software of MySQL (https://www.mysql.com), PHP 7.4 (http://www.php.net), ThinkPHP 5.0 (http://www.thinkphp.cn/) and Apache 2.4 (http://www.apache.org), and all procedures were run on a Linux Ubuntu 7.5 (https://ubuntu.com/) operation system. To construct a user-friendly web interface, JQuery (http://jquery.com) and Bootstrap (https://getbootstrap.com) were used in all database pages. A variety of graphs, such as geomap, polygenetic tree, gene structure, heatmap and violin plot, were drawn by multiple plugins (Table 3). We used the Jbrowser (http://www.jbrowse.org) (Buels et al., 2016) as a genome browser for the visualization of gene model, expression and variation in B. napus genome.

Table 3. Tools used in the database

Graph	Module	Tool
Point plot	Sample, Variation	Plotly.js (v1.2.0)
Geomap	Sample, Variation, Tools	amcharts (v4.10.19)
Ploygenetic tree	Sample, Variation	phylotree.js (v1.0.0), D3.js (v4.13.0)
Histograph	Sample, Evolution	Echarts.js (v4.3.0)
Gene structure	Variation	D3.js (v4.13.0), Highchart.js (v9.0.1)
Pie plot, Box plot, Bar plot, Violin plot	Variation	Plotly.js (v1.2.0)
Line plot	Evolution	Echarts.js (v4.3.0)
LD heatmap	Variation, Tools	LDheatmap (v0.99), PHP (v7.4)

Public data used in database

Gene index of B. napus genomes was obtained from BnTIR (http://yanglab.hzau.edu.cn/BnTIR) (Liu et al., 2021) for convenience in variation search. Reference genome, gene annotation, TE annotation of ZS11 (Song et al., 2020) and Darmor-bzh (Chalhoub et al., 2014) were used for visualization of genetic variations genome in JBrowse.

Statistic analysis in database

Differences in phenotypes and gene expression of accessions with different genotypes were determined by Wilcoxon test and Student's t test in Variation module.

3. References

[1] Bolger, A.M., Lohse, M., and Usadel, B. (2014). Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 30:2114-2120.
[2] Browning, S.R., and Browning, B.L. (2007). Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering. Am. J. Hum. Genet. 1084-1097.
[3] Buels, R., Yao, E., Diesh, C.M., Hayes, R.D., Munoz-Torres, M., Helt, G., Goodstein, D.M., Elsik, C.G., Lewis, S.E., Stein, L., et al. (2016). JBrowse: a dynamic web platform for genome visualization and analysis. Genome Biol. 17: 66.
[4] Chalhoub, B., Denoeud, F., Liu, S., Parkin, I.A., Tang, H., Wang, X., Chiquet, J., Belcram, H., Tong, C., Samans, B., et al. (2014). Early allopolyploid evolution in the post-Neolithic Brassica napus oilseed genome.Science 345: 950-953.
[5] Danecek, P., Auton, A., Abecasis, G., Albers, C.A., Banks, E., DePristo, M.A., Handsaker, R.E., Lunter, G., Marth, G.T., Sherry, S.T., et al. (2011). The variant call format and VCFtools. Bioinformatics 27:2156-2158.
[6] Kim, D., Langmead, B., and Salzberg, S.L. (2015). HISAT: a fast spliced aligner with low memory requirements. Nat. Methods 12: 357-360.
[7] Li, H., and Durbin, R. (2009). Fast and accurate short read alignment with Burrows-Wheeler transform.Bioinformatics 25: 1754-1760.
[8] Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G., Durbin, R., and Genome Project Data Processing, S. (2009). The Sequence Alignment/Map format and SAMtools. Bioinformatics 25: 2078-9.
[9] Liu, D., Yu, L., Wei, L., Yu, P., Wang, J., Zhao, H., Zhang, Y., Zhang, S., Yang, Z., Chen, G., et al. (2021). BnTIR: an online transcriptome platform for exploring RNA-seq libraries for oil crop Brassica napus. Plant Biotechnol. J. 19: 1895-1897.
[10] Liu, S., Fan, C., Li, J., Cai, G., Yang, Q., Wu, J., Yi, X., Zhang, C., and Zhou, Y. (2016). A genome-wide association study reveals novel elite allelic variations in seed oil content of Brassica napus. Theor. Appl. Genet. 129: 1203-1215.
[11] Lu, K., Wei, L., Li, X., Wang, Y., Wu, J., Liu, M., Zhang, C., Chen, Z., Xiao, Z., Jian, H., et al. (2019). Whole-genome resequencing reveals Brassica napus origin and genetic loci involved in its improvement. Nat. Commun. 10: 1154.
[12] McKenna, A., Hanna, M., Banks, E., Sivachenko, A., Cibulskis, K., Kernytsky, A., Garimella, K., Altshuler, D., Gabriel, S., Daly, M., et al. (2010). The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20: 1297-1303.
[13] Pertea, M., Pertea, G.M., Antonescu, C.M., Chang, T.C., Mendell, J.T., and Salzberg, S.L. (2015). StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nat. Biotechnol. 33: 290-295.
[14] Purcell, S., Neale, B., Todd-Brown, K., Thomas, L., Ferreira, M.A., Bender, D., Maller, J., Sklar, P., de Bakker, P.I., Daly, M.J., et al. (2007). PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81: 559-575.
[15] Raj, A., Stephens, M., and Pritchard, J.K. (2014). fastSTRUCTURE: variational inference of population structure in large SNP data sets. Genetics 197: 573-589.
[16] Song, J.M., Guan, Z., Hu, J., Guo, C., Yang, Z., Wang, S., Liu, D., Wang, B., Lu, S., Zhou, R., et al. (2020). Eight high-quality genomes reveal pan-genome architecture and ecotype differentiation of Brassica napus. Nat. Plants 6: 34-45.
[17] Song, J.M., Liu, D.X., Xie, W.Z., Yang, Z., Guo, L., Liu, K., Yang, Q.Y., and Chen, L.L. (2021). BnPIR: Brassica napus pan-genome information resource for 1689 accessions. Plant Biotechnol. J. 19: 412-414.
[18] Tang, S., Zhao, H., Lu, S., Yu, L., Zhang, G., Zhang, Y., Yang, Q.Y., Zhou, Y., Wang, X., Ma, W., et al. (2021). Genome- and transcriptome-wide association studies provide insights into the genetic basis of natural variation of seed oil content in Brassica napus. Mol. Plant 14: 470-487.
[19] Vilella, A.J., Severin, J., Ureta-Vidal, A., Heng, L., Durbin, R., and Birney, E. (2009). EnsemblCompara GeneTrees: Complete, duplication-aware phylogenetic trees in vertebrates. Genome Res. 19: 327-335.
[20] Wang, D., Zhang, Y., Zhang, Z., Zhu, J., and Yu, J. (2010). KaKs_Calculator 2.0: a toolkit incorporating gamma-series methods and sliding window strategies. Genomics Proteomics Bioinformatics 8: 77-80.
[21] Wu, D., Liang, Z., Yan, T., Xu, Y., Xuan, L., Tang, J., Zhou, G., Lohwasser, U., Hua, S., Wang, H., et al. (2019). Whole-genome resequencing of a worldwide collection of rapeseed accessions reveals the genetic basis of ecotype divergence. Mol. Plant 12: 30-43.
[22] Yang, J., Lee, S.H., Goddard, M.E., and Visscher, P.M. (2011). GCTA: A tool for genome-wide complex trait analysis. Am. J. Hum. Genet. 88: 76-82.
[23] Zhao, Q., Wu, J., Cai, G., Yang, Q., Shahid, M., Fan, C., Zhang, C., and Zhou, Y. (2019). A novel quantitative trait locus on chromosome A9 controlling oleic acid content in Brassica napus. Plant Biotechnol. J. 17: 2313-2324.
[24] Zhou, X., and Stephens, M. (2012). Genome-wide efficient mixed-model analysis for association studies. Nat. Genet. 44: 821-824.
[25] Yano, K., Yamamoto, E., Aya, K., Takeuchi, H., Lo, P.C., Hu, L., Yamasaki, M., Yoshida, S., Kitano, H., Hirano, K., et al. (2016) Genome-wide association study using whole-genome sequencing rapidly identifies new genes influencing agronomic traits in rice. Nat. Genet. 48: 927-934.

eg: AT5G10140

or FLC