1 Scatter plot of (q-q^{2}/2,M(q)) and the gamma distribution curve fit

Kasthuri_Kannan_Natural_Selection.bib

Why Mutant Allele Frequencies in Oncogenes Peak Around 0.40 and Rapidly Decrease?

[20pt] Kasthuri Kannan111New York University - School of Medicine222Corresponding author and Adriana Heguy11footnotemark: 1

[20pt]

Abstract

The mutant allele frequencies in oncogenes peak around and rapidly decrease. In this article, we explain why this is the case. Invoking a key result from mathematical analysis in our model, namely, the inverse function theorem, we estimate the selection pressures of the mutations as a function of germline allele frequencies. Under complete dominance of oncogenic mutations, this selection function is expected to be linearly correlated with the distribution of the mutant alleles. We demonstrate that this is the case by investigating the allele frequencies of mutations in oncogenes across various cancer types, validating our model for mean effective selection. Consistent with the population genetics model of fitness, the selection function fits a gamma distribution curve that accurately describes the trend of the mutant allele frequencies. While existing equations for selection explain evolution at low allele frequencies, our equations are general formulas for natural selection under complete dominance operating at all frequencies. We show that selection exhibits linear behavior at all times, favoring dominant alleles with respect to the change in recessive allele frequency. Also, these equations show, selection behaves like power-law against the recessive alleles at low dominant allele frequency.

Keywords: oncogenes, allele frequencies, natural selection, gamma distribution, inverse function theorem

Why Mutant Allele Frequencies in Oncogenes Peak Around 0.40 and Rapidly Decrease?

[20pt] Kasthuri Kannan111New York University - School of Medicine222Corresponding author and Adriana Heguy22footnotemark: 2

[20pt]

1. Introduction

Cancer is an evolutionary disease with mutation acting as a unit of selection and, therefore, quantifying selection is necessary. Since the distribution of the allele frequencies of somatic mutations reflects the nature of selection underlying the mutant clones, modeling this distribution is essential. Recently, Williams et al. [williams] showed neutral tumor evolution results in a power-law distribution of the mutant allele frequencies, and this law fits of cancers of various types. While neutral evolution remains an important aspect in several cancer types, the distribution of the allele frequencies of mutations in genes that undergo positive selection, has not been determined so far. This is difficult to model because the allele frequencies of positively selected mutations are related to their functional consequences and therefore we have to take into account the degree of dominance exhibited by these mutations. However, in the case of oncogenes, which are primarily dominant, determining the distribution of the mutant allele frequencies should be feasible. It is worth noting that in the context of human polymorphisms, the allele frequencies of new deleterious mutations has been studied and fitness distribution is shown to follow a gamma distribution [eyre]. Nonetheless, the fitness function and distribution of allele frequencies of oncogenic mutations in tumors have not been described.

The mutant allele frequencies in oncogenes peak around and rapidly decrease. We built a mathematical model to describe the trend of these frequencies. We assumed complete dominance of oncogenic mutations, although we realize dominance (or partial dominance) can be modeled as a function of the functional impact of these mutations, which can be derived from algorithms such as PolyPhen and SIFT [polyphen] [sift]. The scope of this article is restricted to describe the general tendency of the frequencies rather than considering the impact of the mutations on their frequencies. By taking advantage of a key result in mathematical analysis, namely, the inverse function theorem, we estimate the mean effective selection of the mutations as a function of germline allele frequencies. Under complete dominance of oncogenic mutations, this selection function is expected to be linearly correlated with the distribution of the mutant alleles. We demonstrate that this is the case by investigating the allele frequencies of mutations in oncogenes across various cancer types, validating our model for mean effective selection. Consistent with population genetics model of fitness, the selection function fits a gamma distribution curve that accurately describes the trend of the mutant allele frequencies.

This model infers mean effective selection for oncogenic mutations without considering other alterations in the DNA that could change the dynamics of the tumor micro-environment. Moreover, this measure is an effective selection coefficient in the sense that it is selection coefficient with respect to gain in mutant allele frequencies. Combining this estimate with other modifications, such as copy number changes, and integrating the functional impact of the resulting protein will help in understanding the evolution of the mutant clones in various tumors. Although the equations that we derive are currently applied in the context of oncogenes, these are general formulas for natural selection under complete dominance operating at all frequencies. While being consistent with known formulas that explain evolution at low frequencies, one of these equations show that under low frequencies of the dominant alleles, selection against recessive alleles behave in a power-law like manner, reiterating the powerful role of natural selection. This would also explain the reason some tumors undergo rapid clonal evolution. Further, at high frequencies of the dominant alleles, the linear expansion exhibited by selection with respect to change in recessive germline allele frequencies could partly be the reason behind drug resistance, under dominance.

2. Determining Selection Coefficients

We use the standard model described in [falconer] for selection under complete dominance. An illustration of this model for mutations when the frequencies are not time dependent is shown in Supplementary Figure 1. Under this model, if is the coefficient of selection, and are the allele frequencies with , the new genotype frequencies , in terms of and are given by,

 f(p)=p2+pq1−sq2;f(q)=1−f(p)=q−sq21−sq2 (1)

Using this model for time-dependent allele frequencies, we derive a new equation for the number of somatic mutations in terms of the germline allele frequency.

At time , let be a mutation with and as allele frequencies of mutant homozygous, mutant heterozygous and germline genotypes of , respectively, with . Let the strength of the selection be expressed as a coefficient of selection, , which is proportional to the reduction of the germline genotype, compared to the mutant genotypes which are favored for the tumor growth. In reality, the selection coefficient should be a function of time. However, we can consider as mean selection acting over time and so the time dependence can be omitted. Thus, in our model, we assume the frequencies are time-dependent, but the selection is time-independent. If the fitness of the homozygous and heterozygous mutant genotypes are taken to be as it is likely the case in oncogenes, the fitness of the germline genotype which is selected against, is then . Thus, after one generation, the new mutation frequency, , in terms of the recessive germline allele is given by equation 1, which is,

 Mqt=qt−sq2t1−sq2t. (2)

Hence, change in mutation frequency, , resulting in a small time interval of selection is,

 ΔMqt=[Mqt−qt]Δt=−sq2t(1−qt)1−sq2tΔt.

Let and be the time of tumor initiation and the time of tumor biopsy, respectively. Then,

 M(qT)=∫TT0dMqt=−∫TT0sq2t(1−qt)1−sq2tdt.

Denoting , and applying inverse function theorem,

 dudt=q′t=1[q−1]′(u)=1F′(u)

where for any time point and . We see that, if and are initial and final frequencies, then

 M(uT)=−∫uTu0su2(1−u)1−su2F′(u)du (3)

This formula allows us to express the number of mutations purely in terms of allele frequencies. If is the fraction of reduction of germline alleles due to the selection pressure , we expect to follow the function,

 F(u)=1−λuu−λu⇒F′(u)=−1(1−λ)u2

and therefore, equation 3 is given by,

 M(uT)=s1−λ∫uTu01−u1−su2du≈sλ∫uTu0(1−u)du=sλ(uT−u2T2)+sλC

where can be interpreted as mean effective selection coefficient favoring mutant genotypes. If the coefficient is extremely small at , then , and hence the number of mutations in terms of the observed germline allele frequency at time is given by,

 M(q)≈sλ(q−q22) (4)

and is therefore,

 sλ≈M(q)[q−q22]−1 (5)

Also, from equation 1, since we have,

 M(p)≈p+sλp(1−p)2, (6)

if is very small or when is small, the number of mutations in terms of the mutant allele frequency can be approximated by . Therefore, we expect to be correlated with the distribution of the mutant allele frequencies for small values of . However, we note from equation 5 and equation 6 that extremely small values of or will make unbounded or negative, and hence correlation may not be valid.

3. Results

We determined the selection coefficents for mutations in oncogenes to see if they are correlated with the distribution of the mutant allele frequencies. To do this, we first identified proto-oncogenes from the Uniprot database [uniprot] out of which were exclusive to Homo sapiens. A total of mutations in these genes were queried from the cBio portal [cBio1, cBio2] and mutations for which mutant allele frequencies were available, was retained (Supplementary Table 1 available upon request). The germline allele frequencies were computed by subtracting the mutant allele frequencies from and was normalized with its standard Eucledian norm.

Equation 4 essentially states should be correlated with the random variable under the random variable . Therefore, it is natural to fit the data . Since two fitness distributions, the gamma, and the exponential have been traditionally applied to model selection coefficients [gillespie], we employed both functions. Consistent with the fitness function of deleterious mutations in human polymorphisms [eyre], the gamma distribution curve fitted well with minimal residual error of with fitting function with respect to , given by,

 g(α,β,ρ,δ;x)=ρβαΓ(α)xα−1e−xβ+δ

where shape, scale, amplitude and offset parameters, , , and were identified as , , , , respectively. MATLAB’s nlinfit was used to fit the data with initial conditions for the parameters.

Figure 1 shows the scatter plot of and the gamma distribution curve fit. This function allows us to compute for each mutation through equation 5, i.e., , where is the transformed allele frequency, . Since coefficients are as much as the number of mutations and is restricted by the frequency bin size, to find the correlation, we sub-sampled for iterations and considered mean correlation. Also, for the reasons discussed following equation 6, mutant allele frequencies greater than and were not considered. The mean correlation was determined to be with mean p-value . We also determined the line of best fit (MATLAB’s polyfit routine with degree ) to infer the slope to find the optimal that would fit with the mutant allele frequencies. Figure 2 (a) shows the correlation between coefficients and and figure 2 (b) shows optimal that fits the mutant allele frequencies.

4. Discussion

Good correlation between selection coefficients and the mutant allele frequencies explain why the frequencies are centered around and rapidly decrease. Selection coefficients are maximized around this region and reduce exponentially.

In population genetics it is known that at low frequencies under dominance, selection for/against dominant alleles follow and selection for/against recessive alleles grow at . This is because equation 1 tells us that,

 Δf(p,q)=±spq21−sq2.

While this is true when the unit of measure is generation time, integrating the selection equation with respect to time establishes that at all frequencies selection for the dominant alleles with respect to change in recessive allele frequency is linear. This can be seen by differentiating equation 4:

 dM(q)dq=sλ(1−q)=sλp. (7)

Similar analysis on selection against recessive allele with respect to dominant allele frequency would reveal (See Supplementary Method ),

 dM(p)dp=−sθq2p. (8)

where can be interpreted as effective selection against recessive alleles. These formulas, equations 7 and 8, as relative increase and decrease, define natural selection under complete dominance. Equation 7 is natural after all - for a given dominant allele frequency, the rate of change of dominant alleles in terms of (the loss of) germline alleles, should be proportional to the amount of selection that takes place. While both equations are consistent with known observation under low frequencies, equation 8 suggests that at low frequencies of the dominant alleles, selection against the recessive alleles act in a power-law like manner, demonstrating why positive (natural) selection is a potent force. It is worth noting that this power-law growth of relative fitness has been observed in long term evolution experiments [lenski1, lenski2]. Under complete dominance, selection would maximize the dominant alleles at all costs.

Equations 7 and 8 also allows us to compute the rate of change of gene/mutation frequencies with respect to time. If and are the rate of growth of dominant and recessive alleles, respectively, we can expect an exponential growth. Therefore,

 dpdt=αp;dqdt=βq.

Hence, denoting by and by , we see,

 dMqdt=dMqdq⋅dqdt=sλβpq, (9)

and

 dMpdt=dMpdp⋅dpdt=−sθαq2. (10)

Thus, if the genes/mutations don’t directly depend on time, the total derivative, i.e., selection acting over time, is given by,

 dMdt=sλβpq−sθαq2.

Further, equations 9 and 10, helps us write the general equation for evolution through natural selection for diploid genomes under complete dominance,

 α2sθdqdMq+β2sλdpdMp=0.

In the context of cancer, especially in the evolution of mutant clones in oncogenes when mutant allele frequencies are small, the normal linear growth of the mutant alleles along with the power-law like loss of recessive germline alleles will doubly accelerate the progression of the mutant clones, possibly contributing to heterogeneity in the presence of competing mutations. Similarly, when recessive germline allele frequencies are small, and they grow quadratically, the progression of the mutant alleles will proceed linearly, still dominating and possibly conferring more resistance to therapy and giving rise to metastatic clones. Therefore, incorporating selection measures in evaluating functional impact of the mutations and assessing the aggressiveness of the tumors by taking the degree of selection into account will lead to better understanding of this complex evolutionary disease.

Acknowledgements

The author likes to acknowledge Dr. Friedrich Philipp who suggested the use of inverse function theorem in an online forum.

Notes

The MATLAB routine used to generate the results is available upon request.

\printbibliography

Supplementary Figure 1

[20pt]

Supplementary Method 1

[20pt]

In the case of dominant allele frequencies,

 Mpt=pt1−sq2t.

Hence, change in mutation frequency, , resulting in a small time interval of selection is,

 dMpt=(Mpt−pt)dt=+sptq2t1−sq2tdt. (11)

Denoting , and applying inverse function theorem,

 dvdt=p′t=1[p−1]′(v)=1G′(v)

where for any time point and .

If is the fraction of reduction in the germline alleles due to the selection pressure , we expect to follow the function,

 G(v)=1−λ(1−v)v⇒G′(v)=−1−λv2

and therefore, equation 11 is given by,

 dM(p)dp=−s(1−λ)q2p[1−s(1−p)2]≈−s(1−λ)q2p
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters