Xintian Zhu, Willmar L. Leiser, Volker Hahn, Tobias Würschum
State Plant Breeding Institute, University of Hohenheim,70593 Stuttgart,Germany
Keywords:QTL mapping Soybean Protein content Oil content
ABSTRACT Soybean [Glycine max (L.) Merr.]is a global protein source and is currently expanding in Central and Northern Europe. Protein and oil content are two important quality traits that have been studied in different germplasm, however, their genetic architecture in earlymaturing European soybean has not been investigated yet. In this study, we therefore performed QTL mapping for both traits using 944 recombinant inbred lines derived from eight families from a half-diallel crossing design.We identified five QTL for each trait,with the QTL on chromosomes 8, 15, and 20 being identified for both protein content and oil content. The known major QTL on chromosome 20 was detected in four families whereas the other QTL were only found in single families.Further analyses revealed the QTL to have pleiotropic but inverse effects on both traits. The effect of the major QTL was comparable between families, illustrating that it is largely independent from the genetic background.Collectively,our results illustrate the quantitative nature of protein and oil content in early European soybean. Marker-assisted selection for the QTL is possible, but the inverse effect on protein and oil content should be kept in mind.
Soybean [Glycine max (L.) Merr.]is one of the most important crops worldwide, being used for human consumption and animal feed [1]. In 2017, the total global soybean acreage was 123.6 Mha with a production of 352.6 million tons, while in Europe soybean was only grown on 5.7 Mha resulting in the production of 10.7 million tons [2]. Soybean seeds contain approximately 40% protein and 20% oil, and both traits are known to be negatively correlated[3-5].In Europe,the focus is more on protein content as Europe is highly dependent on protein imports. A reduction of this dependency on soybean imports can be achieved by increasing the acreage of leguminous crops [6]. Soybean is particularly well suited for this and the breeding of adapted,early-maturing cultivars has led to the substantial increase in soybean acreage in Central and Northern Europe in recent years [7,8]. In addition to its importance in protein production for animal feed, protein content is also of high relevance for soybean used for human consumption [9]. Particularly the traditional Asian soy food tofu (soybean curd) is becoming increasingly popular with European consumers due to a trend towards a vegetarian and vegan lifestyle [10-12]. Thus, protein content remains an important breeding goal in European soybean breeding.
The genetic architecture underlying protein and oil content has been studied by QTL and association mapping in different germplasm [5,13-19]. To date, SoyBase lists 248 and 327 QTL related to protein and oil content in soybean, respectively(https://www.soybase.org/) [20]. Some QTL for protein and oil content were detected at the same position, suggesting either closely linked QTL or QTL with pleiotropic effect on both traits.Two QTL,located on chromosomes 15 and 20,were identified in several studies and the QTL on chromosome 20 was considered as a major QTL with the highest proportion of explained phenotypic variance[5,21,22].By contrast,nothing is known yet about the genetic architecture of protein and oil content in earlymaturing European soybean, representing maturity groups I to 000. A better understanding of the genetic control and identification of QTL underlying the traits,however,is an important first step towards a marker-assisted breeding of these two important quality traits in early soybean material.
The aim of this study was to bridge this gap by performing QTL mapping with 944 recombinant inbred lines from eight families derived from a half-diallel cross of five representative early European soybean cultivars. In particular our objectives were to (1) perform single and multi-family QTL mapping for protein and oil content,(2)make use of the connected design to evaluate the stability of the QTL effects across genetic backgrounds,(3)compare QTL localization for both traits,and(4) draw conclusions for soybean breeding of the two quality traits.
This study was based on 944 F5:8recombinant inbred lines(RILs) derived from eight families produced by a half-diallel mating design, and developed by single-seed descent [9].The five parental lines, ‘Gallec' (P1), ‘Primus' (P2), ‘Protina' (P3),‘Sultana'(P4),and ‘Sigalia'(P5),are representative varieties for Central Europe,with good agronomic performance. The eight families, ‘P1 × P2', ‘P1 × P3', ‘P1 × P5', ‘P2 × P5', ‘P2 × P3', ‘P2 ×P4', ‘P3 × P5', ‘P4 × P5', include 104, 117, 234, 80, 117, 106, 94,and 92 individuals, respectively.
The seed protein and oil content data have been described previously [9]. In brief, in 2014 a panel of 1008 F5:8RILs was grown at three locations in Germany. The panel was divided into three trials according to the maturity of each genotype in the previous year, classified relative to each other as early(Trial 1), mid-early (Trial 2) and late (Trial 3). The trials had overlapping genotypes,with 38 lines grown in Trial 1 and Trial 2,and 66 lines grown in Trials 2 and 3.Each of the three trials at each location was grown in a p-rep design with 20%of the lines grown in replication. At each location different lines were replicated so that across the three locations 60% of the lines were replicated.All genotypes were grown in yield plots with 4 rows and 9 m2(1.5 m × 6 m) and 65 seeds m?2. Seed protein content and seed oil content were measured as percentages using NIRS (near-infrared spectroscopy) with a Polytec 2120 spectrometer (Polytec GmbH, Waldenbronn,Germany).
Best linear unbiased estimators (BLUEs) were estimated across locations, representing entry means adjusted for location and experimental design effects (replication and incomplete blocks). These BLUEs were used for QTL mapping in this study(Fig. S1). The heritability for protein and oil content was 0.85 and 0.87,respectively.
DNA samples of the soybean lines were sent to Cornell University, where genotyping-by-sequencing (GBS) was performed as described by Elshire et al. [23]. The raw sequence data in FASTQ format was processed by the Tassel 5 GBS v2 Pipeline to call SNPs[24].Default parameters were used except that SAMtoGBSdbPlugin,indicating minimum length,was set to 40(default 0),minimum proportion was set to 0.9(default 0),and minimum mapping quality was set to 20 (default 0). Sequence tags were aligned to the reference genome Gmax Wm82.a2.v1[25]by the Burrows-Wheeler Aligner [26]. After obtaining the genotype file,subsetting was done with Tassel(taxa filter:max.missing 0.8, max. heterozygosity 0.08; site filter: remove minor SNP states,minor allele frequency 0.02,max.missing 0.7,max.heterozygosity 0.04).For 944 of the 1008 phenotyped RILs highquality genotypic data could be generated and these were subsequently used for QTL mapping.After subsetting,genotypes were imputed with LinkImpute and converted to an ABH format with Tassel GenosToABHPlugin. For further error-correction of the genotypic data,the R package ABHgenotypeR[27]was used,setting minimum haplotype length to 5.
Linkage maps were constructed in R 3.5.1[28]with the packages R/qtl [29]and ASMap [30], following a series of steps. First, all heterozygous sites were scored as missing. Next, data was checked for genetic clones,duplicated markers,switched alleles,segregation distortion and co-located markers. The genomic physical position of the SNP markers was used to assign each obtained linkage group to the correct one of the 20 soybean chromosomes. After map construction using Kosambi's mapping function, genotypes and markers with higher numbers of double cross-overs were removed.Then,the co-located markers were again integrated into the map. Finally, single markers (or groups of up to five markers)that resulted in a large change in the estimated genetic map length were dropped.
A consensus map was constructed with the R package Mapfuser [31], which uses the LPmerge algorithm [32], and consisted of 20 linkage groups (LG), with a cumulative distance of 3202.68 cM and the length of each linkage group varying from 103.01 to 211.58 cM (Table S1). The position of markers in each single-family map and in the consensus map showed a high consistency (Fig. S2). The total number of markers was 10,893 that were distributed at 3605 unique positions in the consensus map. The number of markers in each linkage group ranged from 221 to 1003, with an average distance of 0.89 cM between markers across all linkage groups.
Quantitative trait locus (QTL) mapping was performed by composite interval mapping (CIM) with an additive genetic model as implemented in PlabMQTL[33].Cofactors were selected based on the modified Bayesian information criterion [34].Scanning for putative QTL was carried out at regular intervals spaced 1 cM apart.The empirical LOD threshold corresponding to a genome-wide error rate of α ≤0.1 was determined by 2000 random permutations for each trait. The support interval was calculated as a LOD fall-off of 1.0, and in addition the 95%confidence interval was determined [35]. The QTL frequency distributions are derived from 200 fivefold cross-validation runs and show the frequency of QTL detection within the 1-LOD support interval.The proportion of explained genotypic variance(pG) was estimated asis the adjusted explained phenotypic variance and h2is the heritability of the trait.
Multi-family linkage mapping was performed with the R package mppR [36], with an additive connected model(parental model) that estimates one allele effect in each parental line and assumes that a QTL effect of one parental line is constant in all crosses[37].The model of data variancecovariance structure (VCOV) was assumed to be a homogeneous residual term(HRT)fitted by REML using ASReml-R[38],which considers residual terms to be independent and to have the same distribution.The variance of the polygenic term and the error variance of the model are merged in the unique variance residual term. The empirical LOD threshold corresponding to a genome-wide error rate of α ≤0.05 was derived from 1000 permutations for each trait and the default VCOV HRT linear model was used instead of HRT fitted by REML to reduce computational time. QTL detection was done by the following steps: at first simple interval mapping was performed to select cofactors, then followed by composite interval mapping. The confidence interval was computed with a -log 10 (P-value) drop by 1.0 from the CIM profile. For cross-validation, the LOD threshold was determined by the same strategy and 1000 fivefold cross-validation runs were performed to obtain the QTL frequencies.
For genome-wide association mapping, we chose an additive genetic model and mapping was done with three different models using the R package GenABEL [39]: a model including just a kinship matrix, a model with kinship matrix plus the first and second principal coordinates, and a model with kinship matrix plus a fixed effect defined by family. We found that the results of the three different models were highly correlated, with correlation coefficients of 0.98 or 0.99 between each other for both traits. The subsequent analyses were therefore only performed for the model incorporating a kinship matrix and a fixed effect for the family.A Bonferronicorrected threshold of P<0.05 was used to control for multiple testing. The proportion of explained genotypic variance (pG)was estimated by fitting significantly associated markers singly in a linear model (pG-single) or by fitting all significantly associated markers in a linear model in the order of the strength of their association.The pGvalues were derived from the sums of squares in these models. The allele substitution effects were derived from the regression coefficient in a linear model with only the marker under consideration.
In this study, we performed QTL mapping for protein and oil content in eight families produced by crossing five parents in a half-diallel mating design resulting in a total of 944 F5:8recombinant inbred lines (Fig. S1). QTL mapping within each family identified five QTL for both, protein and oil content(Table 1, Fig. 1), with a single QTL for each of the two quality traits identified in each family. All QTL were only detected in one of the eight families,except for the QTL qPC5 and qOC5 on chromosome 20 that were identified in four of the families(P1×P2,P1×P3,P2×P4,P3×P5).The QTL on chromosomes 8,15, and 20 were identified as QTL for both protein and oil content. For protein content the proportion of genotypic variance explained by a QTL ranged from 15.64% to 63.55%and for oil content from 17.17%to 49.82%.The major QTL for both traits was the QTL on chromosome 20, followed by the QTL on chromosome 15.qPC5 and qOC5 on chromosome 20 explained a comparably high proportion of genotypic variance in three of the four families,but approximately half of that in the fourth family P3 × P5. The additive genetic effect of this QTL was comparable between all four families, being approximately 1% protein content and 0.4% oil content. Importantly, however, the sign was reversed,such that the allele that increases protein content decreases oil content and vice versa. The two major QTL on chromosomes 20 and 15 were detected with a high frequency in the cross-validation runs(Table 1,Figs.S3 and S4).

?

Fig.1- Genome-wide results from QTL mapping of protein content and oil content in eight families.Dashed vertical lines indicate the positions of the identified QTL.
In addition to single family mapping, we performed multifamily mapping,i.e.QTL mapping across all eight families.This identified nine QTL for protein content,of which the three located on chromosomes 2,15,20 were identical to qPC1,qPC3 and qPC5,respectively(Table S2,Fig.S5).The global proportion of explained phenotypic variance (R2) was 45.45% and the QTL on chromosome 20 was estimated to have the highest R2with 23.93%,followed by the QTL on chromosome 15 with 5.60%. For oil content,thirteen QTL were identified that explained 47.37%of the global phenotypic variance.Among them,three QTL on chromosomes 6, 15, and 20 were identical with qOC1, qOC4, and qOC5,respectively, and the major QTL was also the one on chromosome 20 with an R2of 18.50%, followed by the QTL on chromosome 15 with 6.66%. We again observed co-localization of QTL for both traits,with QTL on chromosomes 2,7,9,15,and 20 being associated with protein and oil content,the latter two being identical to the QTL pairs from the single-family mapping. As a complementary approach to the multi-family mapping, we performed genome-wide association mapping, which identified the same two major QTL on chromosomes 15 and 20(Tables S3 and S4,Fig.S5).
For fine-mapping of the QTL identified by single-family mapping,we plotted the results from single marker regression against the physical positions of the markers on the reference genome.qPC5 and qOC5 on chromosome 20 had a comparable position in all four families, being located between 28 and 33 Mb with the peak at around 32 Mb (Table S5, Fig. 2).Regarding the other two co-located QTL,qPC2 and qOC2 were located in the same genomic region between 46 and 48 Mb on chromosome 8, and qPC3 and qOC4 on chromosome 15 between 3 and 5 Mb (Table S5, Fig.S6).
Our results identified the QTL on chromosome 20 as the major QTL for protein and oil content and we therefore chose for each family the polymorphic markers closest to the QTL position to investigate its effects in more detail. qPC5 and qOC5 were identified in four families(P1×P2,P1×P3,P2×P4,and P3×P5),that in line with the estimated QTL effect showed an average difference between the two genotypic classes of around 2% for protein content and 0.8%for oil content(Fig.3A).A significant but smaller difference was also observed in family P2×P5,whereas the QTL does not segregate in the remaining three families.For oil content the picture was reversed,as the genotypic class with high protein content had a low oil content and vice versa. We next asked if this opposite effect on the two quality traits was a characteristic of all identified QTL. Indeed, for all other QTL the allele that increased one trait decreased the other,except maybe for qOC1,which has an effect on oil content,but no or only a weak effect on protein content(Fig.3B).

Fig.2-Physical fine-mapping of the major QTL for protein and oil content on chromosome 20.Results are shown from single marker regression for protein content(left)and oil content(right)in four families,with the physical positions of the markers on the reference genome.Circles with black margin represent the marker with the highest LOD value.
Finally,we investigated the phenotypic effects of combining selected protein content QTL on protein and oil and protein content across all eight families. For this approach, we considered the three QTL qPC1,qPC3/qOC4 and qPC5/qOC5,that were detected by single and multi-family mapping,and in the entire panel of recombinant inbred lines identified lines with eight different combinations of alleles at these QTL (Fig. 4). Plants carrying the allele for low protein content at all three QTL had the on average lowest protein content, which increased gradually and was highest in the lines carrying all three favorable alleles.For oil content this picture was reversed,as it was highest in the lines with all three alleles that decrease protein content and thus increase oil content, and then gradually decreased while protein content increased. The largest difference among the averages of all groups was 3.20% and 1.47% for protein content and oil content,respectively.
In Europe, soybean cultivation has just started to play an increasingly important role in agriculture and is used for animal feed as well as for human consumption. Irrespective of its use, protein content is an important breeding goal in European soybean breeding programs. Understanding the genetic architecture of the two correlated quality traits protein and oil content in early-maturing European soybean germplasm is therefore an important first step to evaluate the potential of marker-assisted selection. In this study, we therefore used a total of 944 RILs from eight families to perform QTL mapping for protein content and oil content.
In each family, only one protein content and one oil content QTL was identified, which always explained a moderate amount of the genotypic variation (Table 1). This shows that also in early-maturing European soybean the two traits must be regarded as quantitative traits, controlled by many loci, most of them with effects too small to be detected in QTL mapping. We complemented the mapping in single families by multi-family QTL mapping, as this has the potential to identify QTL that remain below the significance threshold in individual families, but if present in several families offers an increased power of detection [36]. In addition, we also performed genome-wide association mapping, which, however, appeared to have a lower QTL detection power in our panel of eight families. These methods provided further support for the QTL qPC1, qPC3, and qPC5 as well as qOC1, qOC4, and qOC5, but also identified some QTL not detected in the single families (Table S2). Owing to their highly quantitative nature, QTL for the two traits have been reported for all chromosomes of soybean. Notably, the major QTL qPC5/qOC5 on chromosome 20, was also reported as the major QTL in previous studies [5,14-16,18,19,21,40-42]. Likewise, qPC3/qOC4 on chromosome 15, that explained the second largest proportion of genotypic variance in this study, has also been reported as an important locus controlling the two traits [5,15,18,19,21,42-44]. Generally, the use of different marker systems and the size of mapping populations affecting the QTL confidence interval makes the comparison of QTL across studies difficult. Thus, the QTL qPC1 [45,46],qPC2/qOC2 [16], qPC4 [40], qOC1 [19,47], and qOC3 [44]identified here may correspond to previously reported QTL or may be novel. Collectively, our results show that also in early-maturing European soybean the quality traits protein and oil content are complex traits,but with an at least in part shared genetic architecture with other soybean germplasm.

Fig.3-Allelic effects of identified QTL.(A)Boxplots showing the allelic effect of the major QTL on chromosome 20 on protein and oil content in all eight families.(B)Boxplots showing allelic effects of six QTL in the family in which they were detected on both protein and oil content.? PC and ? OC show the average of protein content and oil content,respectively.Segregating markers closest to the identified QTL position were used as proxy for the QTL.n indicates the number of individuals in each genotypic class.Asterisks indicate significant differences between two genotypic groups within a family at the 0.05,0.01,and 0.001 probability levels,respectively;ns is not significant.
The QTL on chromosome 20 was identified by us and previous studies as a major locus for protein and oil content and we therefore aimed at fine-mapping this QTL towards a future cloning of the underlying gene(s) and a more targeted utilization in soybean breeding. Across all four families in which the QTL was identified, we consistently narrowed down its position to the chromosomal region between 28 and 33 Mb on the reference genome,with the peak likely being at around 32 Mb (Fig. 2). This is the same physical region identified in other studies,which lends further support to this region harboring the causal polymorphism(s) [48]. Recent work has also suggested candidate genes for this locus[5,40,49]and Valliyodan et al. [50]reported copy number variation of three candidate genes in this region to be associated with protein content. Nevertheless, further work is required to validate the gene(s) underlying this QTL and to elaborate its/their molecular function in the regulation of protein and oil content in soybean.

Fig.4-Effect of QTL stacking.Boxplots showing protein and oil content of plants carrying either the protein content increasing allele(+)or the protein content decreasing allele(?)at the three QTL qPC1,qPC3,and qPC5.Protein content is shown in green and oil content in blue.n indicates the number of individuals within each group,? PC and ? OC is the average protein content and oil content,respectively.
An advantage of this study being based on several families from a half-diallel crossing design is,that it allows to estimate the QTL effects in different genetic backgrounds. Regarding the major QTL qPC5/qOC5 we observed an effect in five families, while the other three did not segregate for the QTL.This is expected in a crossing design with connected crosses and in combination with the estimated QTL effects shows that the two high-protein parental lines ‘Primus' and ‘Protina'carry the allele for high protein content (Fig. S7). Consequently,the cross between these two parents(P2×P3)is fixed for the high-protein allele and both genotypic classes had a high average protein content (Fig. 3). Conversely, the crosses P1×P5 and P4×P5 also did not segregate for the QTL,but are fixed for the low-protein content allele and in line with the strong effect of this QTL had a low average protein content.
The proportion of genotypic variance explained by this QTL varied in the four families in which it was detected suggesting that different numbers of additional QTL affect protein and oil content in each family.Interestingly,however,the QTL effects for both protein and oil content were rather stable,indicating that the effect of this major QTL is largely independent from the genetic background (Table 1). This is advantageous for breeding,as it suggests that a similar effect can be expected of this QTL in most if not all crosses. By contrast, regarding all other identified QTL, the fact that we only identified them in single families,even though they have to segregate in several families owing to the half-diallel design, indicates their dependency on the genetic background.
Both protein content and oil content are two important traits in soybean breeding,with the former being of more relevance in European breeding programs. However, as in other crops, the two quality traits are negatively correlated, and this inverse relationship makes it challenging for breeders to simultaneously increase both traits [51-53]. In this study, three QTL were identified for both traits, including the major QTL on chromosomes 20 and 15.This suggests either pleiotropic action of the QTL or close linkage of QTL acting separately on each of the two traits.The QTL alleles,however,were found to always have opposite effects on protein and oil content. With maybe one exception, this even held true for the QTL that were only identified for either protein or oil content(Fig.3).This suggests pleiotropy of the QTL, which consequently hinders the simultaneous improvement of both traits. Moreover, our results corroborate recent work, suggesting that the effect on protein content is approximately twice that on oil content, i.e. an increase of,for example,2%protein content will result in a 1%decrease in oil content [21,22]. This is likely due to the higher energy demand for lipids than for proteins as storage products.Taken together, our results suggest that the strong inverse relationship between the two quality traits has a moleculargenetic basis, as it appears more likely that it is caused by pleiotropically acting QTL with opposite effects on protein and oil content.Consequently,selection for one trait will inevitably affect the other and whether protein content or oil content is the primary target trait in soybean breeding programs should be determined based on the intended end-use.
The QTL on chromosome 20 is a major QTL for both traits and several of the other identified QTL can also be classified as medium-effect QTL. We therefore investigated the effects of stacking such QTL, i.e. combining the favorable alleles as would result of marker-assisted selection. This revealed a continuous increase in protein content the more the protein content-increasing alleles of the three QTL were combined,with a maximum difference of 3.20%. This indicates an additive gene action of these QTL, such that their effects can be combined to increase protein content. However, owing to the just mentioned negative effects on oil content of the same QTL alleles, the increase in protein content has to be paid for with an approximately half as strong decrease in oil content.Obviously, the opposite is achieved when selecting for the oil content-increasing QTL alleles. This illustrates the potential to improve protein and oil content by marker-assisted selection, utilizing previously identified QTL. Notably, however, marker-assisted selection cannot solve the problem of the strong inverse relationship between both traits, as this appears to be mainly based on shared, pleiotropic QTL. Thus,a simultaneous improvement of both traits by markerassisted selection based on identified QTL appears difficult if not impossible. Potentially though, among the many presumed small-effect QTL are also some that act on only one of the traits without affecting the other. Lee et al. [5]recently reported such trait-specific QTL for protein and oil content,which however were a minority compared to the pleiotropic QTL. Thus, if the target trait is chosen and when combined with phenotypic selection for both traits, an increase in the target trait with no or only a minor decrease in the second trait might be achieved if sufficient trait-specific background QTL counteract the effects of the QTL for the primary target trait. This is probably what breeding has achieved in the last decades, resulting in cultivars with a high protein content and a respectable oil content.
In summary, in this study we employed a large panel of soybean lines derived from eight families to dissect the genetic control underlying protein and oil content in early European soybean material. Our results suggest a quantitative inheritance with some moderate- to large-effect QTL, several of which are likely common with other soybean germplasm.Most, if not all of the identified QTL appear to act pleiotropically and thus selecting for a certain QTL allele will increase one trait but decrease the other. Consequently,breeding for these two important quality traits might be assisted by marker-based selection, but this should be accompanied by phenotypic selection to find a balance in improving both traits.
Supplementary data for this article can be found online at https://doi.org/10.1016/j.cj.2020.06.006.
Declaration of competing interest
The authors declare that there are no conflicts of interest.
Acknowledgments
We thank Friederike Sommermann for initial work on the genetic linkage maps. This work was funded by the German Federal Ministry of Food and Agriculture (BMEL, grants numbers 2814500110,2814EPS011)and by the Federal Ministry of Education and Research of Germany(BMBF,grant numbers 031B0339A, 031B0339B).
Author contributions
Tobias Würschum and Volker Hahn designed the study;Volker Hahn and Willmar L. Leiser collected phenotypic and genotypic data; Xintian Zhu performed the analyses; Xintian Zhu, Volker Hahn, Willmar L. Leiser, and Tobias Würschum wrote the paper.