Parental selection based on molecular information under a population genetics approach

The correct choice of parents that will compose optimal segregating populations is the key to success for breeding programs. It was postulated the hypothesis that this choice of these parents could be made based on information of molecular markers analyzed in the context of population structure. Ten parental populations were simulated and 45 hybrid combinations were obtained from the dialel crosses. Each population consisted of 200 individuals with 50 independent loci. The populations were evaluated for the Hardy-Weinberg Equilibrium (HWE), Coefficient of Inbreeding (F), Heterozygosity (H), and the Polymorphic Information Content (PIC). Genetic diversity between pairs of parental populations was evaluated using five dissimilarity measures. Values of Mantel correlation were obtained for the pairs of the dissimilarity matrices, and the PIC, H, and F values were obtained in the hybrid combinations. All parental populations were under HWE, and the combination that emerged from this condition was the hybrid 3x5, with only 26% of the loci manifesting HWE. This same hybrid was among those with lower F estimates and higher values of H, which indicated the existence of greater divergence between their parentals. There was agreement on the indication of the more and less divergent hybrid combinations for the dissimilarity measures. This fact is important because the variability, associated with the good average potential, are important criteria for the formation of an initial population in breeding programs of any kind, involving sexual processes.


INTRODUCTION
One of the most important steps in a breeding program is the selection of parents to compose promising segregating populations since this determines the success of the subsequent stages and the effectiveness of the program (Bertan, Carvalho, & Oliveira, 2007;Pereira, Santos, Abreu, & Couto, 2007). In this context, the selection of the most potential segregating populations optimizes the use of resources spent on a breeding program (Pimentel et al., 2013).
Emphasis has been given to the study of genetic diversity in several crops such as in cotton (Santos et al., 2017), bean (Carović-Stanko et al., 2017) and soybean (Santos et al., 2014), in order to identify promising genotypes for genetic breeding purposes, to quantify the genetic variability (Rigon et al., 2012;Hiremath & Nagaraja, 2016), and to conduct the breeders to the most appropriate choices for the formation of superior hybrids (Ferreira et al., 2012) among others. In general, the parental selection for the formation of a base population has been made from phenotypic information of traits of agronomic importance, in test crosses which are modeled and analyzed following the principles of quantitative or biometric genetics. However, different data sets may be used, including pedigree data (Teixeira-Neto, Cruz, Carneiro, Malhado, & Faria, 2013), biochemical data (Signorini, Renesto, Machado, Bespalhok, & Monteiro, 2013), molecular markers data (Silva et al., 2017), and others.
Molecular markers have become important and efficient tools, and, and its combined evaluation with agronomic traits can increase the selection process accuracy, can optimize field work, and can ensure greater success in breeding programs (Annicchiarico, Nazzicari, Carelli, Wei, & Brummer, 2016). Using molecular markers in the studies of genetic diversity guarantees the possibility of using different biometrical techniques, based on means and variances. However, it is worth emphasizing that the choice of a method will always depend on the study objective, the level of response required, the necessary technological infrastructure and the available time.
When molecular information is available, there is a possibility of studying genetic diversity at different levels, from the point of view of population genetics that includes individual genotypes, germplasm accessions, and populations. The diversity analysis at the population level is considered the most complex, since it is influenced by the number of individuals sampled, number of loci, genotypic constitution, and effective size (Cruz, Ferreira, & Pessoni, 2011).
The population structure is defined by the frequency of the alleles that compose the different genotypes that constitute it, and their understanding can direct decision-making in breeding programs (Cruz et al., 2011). In the analysis of the population structure, identifiying the occurrence of Hardy-Weinberg Equilibrium (HWE), linkage disequilibrium (LD), and estimating parameters such as polymorphic information content (PIC), coefficient of inbreeding (F), and heterozygosity (H) are primordial (Santos et al., 2012). This is a work that reflects the studies done in the area of population genetics. Thus, it is sufficient to consider only the genotypic information of individuals and populations generated with the observance of a meiotic process and the gametic encounter, subject to observance of the type of mating involved that, in the study, were random mating to generate targeted parents and crossbreeding to generate the hybrids.
The phenotypic evaluation of the potential of populations and their genetic diversity has been indispensable in the choice and orientation of crossing between potential parents in any breeding program based on sexual reproduction processes. The objective of this study was to evaluate the potential of the population and the degree of differentiation between pairs of populations, in the context of conventional breeding while using a population genetic approach based on molecular data, inbreeding, heterozygosity, Hardy-Weinberg equilibrium, and differentiation aspects.

MATERIALS AND METHODS
For simulation purposes the only parameters required are the number of loci, the number of alleles per locus and the dominance relationship between these alleles. It is considered, at random, a certain allele frequency, for each locus, from which a gametic pool is established for possible ancestors. The genotype of each individual formed from the parent population is a consequence of the union between two gametes taken at random from a set of 10,000 gametes of the ancestors. The validation parameters of the simulation are inherent to the result of this work, which is based on the manifestation of Hardy-Weinberg equilibrium, heterozygosity and PIC. Additional information on simulation in the genes program can be found in Cruz (2006).
The Genes program (Cruz, 2016) is capable of simulating genotypic genome data (with parameters related to the size of the link group, distance, link phase, etc.), genotypic and phenotypic data of individuals and populations (with information on heritability, dominance, epistasis, averages, etc.). It is also capable of generating data on individuals and populations derived by random mating, self-fertilization and hybridizations.
Ten parental populations in Hardy-Weinberg equilibrium were simulated. Each population had 50 independent loci and two co-dominant alleles per locus, and consisted of 200 individuals. Moreover, 45 hybrid combinations were obtained from the crossing of these ten populations in a dialel scheme.

Step 1 -Assessment of population potential
In order to validate the simulation process, the ten parental populations were evaluated based on their Hardy-Weinberg equilibrium condition by the chi-square test, which is presented in detail in Cruz et al. (2011). After confirming the HWE condition, data from the hybrid populations were used to estimate the descriptors of the population structure, including the coefficient of inbreeding, based on the heterozygote frequencies in the population, compared to the expected heterozygote frequencies in the all population, and relative to the polymorphic information content (PIC), which were analyzed according to Botstein, Skolnick and Davis (1980).

Step 2 -Degree of population differentiation
In order to estimate the degree of differentiation among pairs of parental populations, three distance measures were estimated: Euclidian, Angular, and Genotypic of Hedrick. Furthermore, two fixation indices were estimated: Nei's fixation index (GST) (Nei, 1973) and Wright's fixation index (FST) (Wright, 1965). The description of these methods are also presented in Cruz et al. (2011). Step

-Correlation between matrices
In order to predict the hybrid populations performance, the Mantel test (Manly, 1997) was used to correlate the distance matrices generated for the parental populations and the matrices generated for the hybrid populations considering the population descriptors, including the polymorphic information content, heterozygosity, and the coefficient of inbreeding.
Step 4 -Computational resources for data analysis The simulation and data analysis were performed at the Biometrics Laboratory of the Department of General Biology of the Federal University of Viçosa, using the computational resources of the Genes software (Cruz, 2013).

RESULTS AND DISCUSSION
The parental populations were all in Hardy-Weinberg equilibrium (HWE), with at least 86% of the loci manifesting HWE (Table 1). All estimates of the coefficient of inbreeding in both parental and hybrid population were negative or close to zero. A null estimate for F is expected in sufficiently large populations and under random mating. Note that the species can be considered allogamous, but not exclusively. Any species that has mating at random occurring naturally (which is the case with allogamous) or guided by human action. This concept applies to plants and animals. Negative coefficients for F translate to an excess of heterozygous forms in the population resulting from crosses between divergent parents. Thus, the absence of disturbing factors, including natural and artificial selection, mutation, migration, and inbreeding, can be confirmed. The value of F = 0 means that the genotypic frequencies were maintained as expected in HWE, which are p 2 , 2pq and q 2 . If there was an inbreeding, the genotypic frequencies would pass to p 2 + pqF, 2pq (1-F) and q 2 + pqF. Inbreeding does not affect the allele frequency, but it does affect the genotype frequency. Mutation and migration, if it existed, would alter the allele frequency and, consequently, the genotype frequency. Mayo (2008) claims that we can compare the HWE rule with Newton's first Law of Motion, which states that a physical body will either remain at rest, or continue to move at a constant speed, unless forces act upon it. If stability is the rule, it will also be the basis for identifying effects on the population. These results, obtained from simulation, were evidence of practical applications of these measures in the conventional breeding, whereby the endogamic phenomenon and the genetic complementarities could not be understood by the simple examination of means and variances of phenotypic values.
The hypothesis was built on a fact. This fact is that information about the potential of possible parents and diversity are important for the formation of a base population for improvement. It is known that conventional breeding has some means to infer about genetic diversity using distance measurements based on phenotypic or genotypic information, but population structure measures are neglected. Thus, the hypothesis of this work was to work with molecular information, within the perspective of population genetics, but with the purpose of helping conventional breeding. There would be several possibilities for studies, using genetic designs such as dialel, or analysis of generations (P1, P2, F1 and F2), or by segregating generations advanced by self-fertilization or random mating or backcrossing generations, among others. The choice of this work, for the use of the dialel, proved to be adequate and points out that statistics, common in the area of population genetics, but not used in genetic improvement, such as F, PIC and EHW measures, were important and useful. Currently, conventional breeding has successfully included information on molecular markers for the purpose of prediction, classification and pattern recognition in broad genomic selection approaches and little credit has been given to the dynamics of the population under study.
In biometric studies based on phenotypic information, inference on gene complementation is done in a predictive way by means of distance measurements, or directly, by quantifying heterosis or specific capacity. As expected, the hybrid populations demonstrated that the equilibrium condition was lost. The hybrid combination originated from the crossing between the parental populations 3 and 5, which showed 98% and 88% of the loci in HWE, respectively, had only 26% of the loci manifesting HWE (Table 1). This finding was expected, since this hybrid population also stood out with a high heterozygosity value -0.56 (Table 2) and was created from the parental populations most divergent (Table 5). Thus, it is a satisfactory combination for use in cross-breeding systems that maximize genetic variability (Santos et al., 2017).   Other information that may collect knowledge about the potential variability of a population, available through population approaches, were H and PIC. Table 2 shows the heterozygosity values and the polymorphic information content observed in the parental populations and their respective hybrid combinations. The highest values of H were observed in the hybrid populations, with emphasis on the combinations 8 x 10 (H = 0.5645), 3 x 10 (H = 0.5620) and 3 x 7 (H = 0.5507). The PIC values ranged between 0.2353 and 0.2890 in the parental populations, and between 0.2933 and 0.3477 in the hybrid populations. All these values were lower than the heterozygosity values, which ranged from 0.2942 to 0.5645, in hybrid and parental populations. The loci studied, considering the content of the polymorphic information, were not very informative, as they should be between 0.25 and 0.50 for the marker to be considered moderately polymorphic and greater than 0.50 for highly polymorphic information (Bolstein et al., 1980). According to Ott (1992), PIC values should always be lower than those estimated for heterozygosity, such that the PIC values observed in Table 2 were in agreement with their expectations. Several authors confirmed this expectation, where these values reinforce the importance of using markers since they present high quality of information that can be extracted for different studies, such as characterization and genetic diversity studies, as well as paternity analyzes Reis et al. (2011) and Crispim, Silva, Banari, Seno and Grisolia (2012).
The estimates of association between the predicted diversity in the parental populations and the performance observed in the hybrids were shown in Table 3. Four of the five diversity measures were strongly correlated with estimated performance of hybrids with high magnitudes and statistically significant levels at the level of 1 percent probability. Hedrick's Genotypic Distance was the only one with significant values at 1% for heterozygosity and coefficient of inbreeding, and at 5% for polymorphic information content. In general, correlation estimates of low magnitude were observed between the dissimilarity measures and the PIC values of the hybrids. These low values may be due to the low variation observed in PIC values. The PIC values depend on two factors associated with the studied location. The first refers to the quantity of the locus allele, which in this study was fixed in two. The second refers to the frequency of the allele in the population considered and is, in this study, the cause of the largest and smallest variation observed. Table 3. Relationship between the parents prediction and the hybrids performance. * , ** Significant at 1 and 5% probability level according Mantel test, respectively. 1 Polymorphic Information Content (PIC), Heterozygosity (H) and Coefficient of Inbreeding (F).
The populations that had deviated the least from the HWE condition were those generated from the crosses between populations 4 & 9, 7 & 8, and 7 & 10, with 62% of the loci manifesting HWE. These populations were considered the least divergent. There was a great agreement among the more and less divergent populations, even under different measures of similarity (Tables 4 and 5). This result was surprising, since each dissimilarity measure addresses different philosophies. The Euclidean and Angular distances were based on geometric properties and considered the allelic frequencies information for discriminating populations that were more and less divergent. According to Dias (1988), for two populations were considered similar only if they occur in the same region of the geometric space, with a small distance between them. Hedrick's genotypic distance proposes an alternative way to quantify the dissimilarity among populations, considering statistics based on the genotype frequencies and not solely on the allelic frequencies.   5  3  3  3  3  3  5  2  2  6  2  2   6  10  10  10  10  10  6  3  3  5  3  3   7  2  2  2  2  2  7  8  8  9  8  8   8  10  10  10  10  10  8  3  3  3  3  3   9  6  1  6  6  6  9  4  4  4  4  4   10  8  8  8  6  8  10  1  1  1  4  4 Hedrick (1971) states that his methodology was advantageous in relation to the others, since populations that were completely distinct genotypically will not be wrongly labeled as identical, based on their allelic constitution. Phenotypic information has allowed the use of a range of biometric procedures to infer information about populations for the purpose of forming a base population that manifests high vigor and wide variability to be explored by selection (Mohammadi & Prasanna, 2003;Cruz, Carneiro, & Regazzi, 2014).
Biometric techniques based on measures of dissimilarity (distances) and clustering have been extensively explored (Santos, Carneiro, Silva Junior, Cruz, & Soares, 2019). However, unique information from molecular markers can provide equally important information when processed and analyzed using population parameters (Milligan et al., 2018). Recent technological advances have allowed the routine evaluation of genetic diversity at the genome level (Narum, Buerkle, Davey, Miller, & Hohenlohe, 2013;Meirmans, 2015;Garner et al., 2016).

CONCLUSIONS
The relative diversity of the parental populations, based on the five FOR dissimilarity measures, agreed with the population descriptors of their respective hybrid combinations. Thus, estimates of H, PIC, HWE, and F, which were measured in the parental populations, may help in the prediction of the genetic diversity of hybrid combinations and assist in the assertiveness of the parents' choice for the formation of base populations.