Selection Operating on Quasispecies

The previous page described how the high mutation rates of RNA viruses give rise to populations of progeny genomes with enormous sequence diversities. These ensembles of diverse and inter-converting sequences are called quasispecies by Eigen [Eigen and Biebricher, 1988]. We will now look at how selection might operate on these populations.

Substrates for selection

The particular composition of the progeny population depends on the size of the population.

Small populations, e.g., 104 genomes, perhaps resulting from a infection of a single cell, can behave quite differently than larger populations, e.g., 109 genomes, perhaps resulting from bulk passaging, or genomes produced in an infected individual. Indeed, we need to consider even larger populations. For example, there are 30 million HIV infected individuals globally, each of whom produces as many as 109 to 1010 virueses per day. This corresponds to a global virus load of about 1017 or more! This kind of large numbers is not unusual, when one considers the number of people infected by, e.g., dengue viruses, or even measles virus.

The previous page derived the composition and diversity of viral populations with sizes of 104, 109, and 1017. The salient results are shown in Table 1.

Table 1a
 104 Population109 Population1017 Population
CopiesDiversityCopies per
CopiesDiversityCopies per
CopiesDiversityCopies per
0310313103 3.1 x 10813.1 x 108 3.1 x 101613.1 x 1016
1363136311 3.6 x 108351091025 3.6 x 1016351091.0 x 1012
2212521251 2.1 x 1082.1 x 1081 2.1 x 10166.2 x 1083.4 x 107
38298291 8.3 x 1078.3 x 1071 8.3 x 10157.2 x 10121.1 x 103
42422421 2.4 x 1072.4 x 1071 2.4 x 10152.4 x 10151
0 to 499306828- 9.9 x 1083.2 x 108- 9.9 x 10162.4 x 1015-
aAssuming genome length = 11703, mutation rate = 10-4. (You can download the spreadsheet used to obtain the results shown here, to look at other genome lengths, mutation rates, and population sizes).

All three populations consist of about 31% non-mutated sequences, and include a large number of mutant sequences, some 69% of the population, that provide sequence diversities almost the size of the population.

The main difference between the populations is the sampling density of mutant sequences. For example, the 104 population contains a random subset of about 10% of the 35109 possible 1-error sequences, while the 109 population contains all possible 1-error sequences, with over 1000 copies of each. In comparison, the 1017 population includes all possible mutants with up to 3 changes.

Note that these populations are the direct products of replication, before selection.

The adaptive landscape

The diverse sequences in the population constitute the raw material for selection. It is likely that some of the mutations will be deleterious or lethal. Others, however, may be adaptive under different environmental conditions.

For example, if resistance to an antiviral drug can be achieved by a specific point mutation, there is a 10% chance that the drug-resistant mutant is in the 104 population. In contrast, it would be astonishing if it is not present in the 109 population, as all possible 1-error sequences are expected to be present in the population, at over 1000 copies each. While all other genomes are inhibited in the presence of the drug, the drug-resistant mutant will be able to grow normally, to become the new master sequence. As it replicates, it will re-generate populations that contain large number of mutants, almost all of which will be at least one mutation distant from the original master sequence. The exception is the mutation of the new master sequence that reverts precisely to the sequence of the former master sequence. These revertants will be discussed below.

In general, given some set of defined conditions (host, temperature, etc.),we can assign a fitness value, Si, to each of all possible genotypes/points in sequence space. (A natural definition of fitness is the number of progeny produced by a virus [Eigen and Biebricher, 1988]). If we plot Si versus sequence space, we produce a version of Sewall Wright's adaptive landscape (Wright, 1982).

The topography of the adaptive landscape

Unfortunately, little can be deduced a priori about the topography. It obviously depends on the precise environmental conditions, as they determine the fitness of each genotype. It must have peaks, depths, and intermediate heights, corresponding to points of high, low and intermediate fitnesses. The precise locations of these features must be determined empirically, but some crude estimates may be made of the frequencies of each type of feature. Let us start with a specific master sequence, and consider the sequence space immediately surrounding this genotype.

For an order-of-magnitude overview, the sequence space will be classified into only 3 types of locations: those with only synonymous changes, those with at least one conservative change but no non-conservative changes, and those with one or more non-conservative changes. This classification makes the simplifying assumption that the effect of each of multiple mutations on viral fitness is independent of each other, and that on average, synonymous mutations have the least impact on viral fitness, and non-conservative mutations have the greatest impact on fitness. This might not be generally true. It is possible that the behavior of different combinations of mutations will depend on the specific mutations and perhaps also the specific genetic locus involved. Unfortunately, the frequency of these kinds of non-independence is not known, so it is not yet possible to account for them in the derivation here.

It can be shown that the frequency of locations in error class E that have

synonymous changes only = YE

1 or more conservative changes but no non-conservative changes = equation for frequency of conservative changes

1 or more non-conservative changes = equation for frequency of non-conservative changes


Y= frequency of synonymous changes among all possible point mutations

C = frequency of 'conservative' changes

R = frequency of non-conservative changes = 1-Y-C

Figure 1 shows the frequency of each type of location in the local sequence space

with Y=0.23, C=0.23, R=0.54.

The overlay shows the results when C is decreased to 0.08 (the average amino acid residue can be conservatively replaced by only 2 instead of 6 other residues). (You can download the spreadsheet used to obtain the results shown here, to look at other values of Y, C, and R).

The number of each type of location in the local sequence space is also plotted, assuming a genome length of 11703 and a mutation rate of 10-4.

sites in the local sequence space

With either value of C,

As expected, when C is decreased, the frequency of conservative locations decreases, by 3- to 60-fold over the range of E=1 to 10. Otherwise, the overall picture of the local sequence space remains much the same.,/p>

Coverage of local sequence space by real populations

With a crude picture of what the local sequence space and adaptive landscape look like, let us now consider real populations, that, because of their finite sizes, might sample only a random subset of the possible sites in the local sequence space. Table 3 shows the number of each type of change for populations of 109 and 1017.

Table 2. Coverage of local sequence space by different population sizes
Population size = 109Population size = 1017
Error ClassSynonymous
18.2 x 1038.1 x 1031.9 x 1048.2 x 1038.1 x 1031.9 x 104
21.2 x 1073.4 x 1071.7 x 1083.4 x 1079.9 x 1074.8 x 108
31.1 x 1067.2 x 1067.5 x 1079.3 x 10106.3 x 10116.5 x 1012
47.3 x 1041.1 x 1062.3 x 1077.3 x 10121.1 x 10142.3 x 1015
54.0 x 1031.2 x 1055.6 x 1064.0 x 10111.2 x 10135.6 x 1014
61.8 x 1021.1 x 1041.1 x 1061.8 x 10101.1 x 10121.1 x 1014
Assuming Y=0.23, C=0.23, R=0.54; genome length = 11703, mutation rate = 10-4. (You can also download the spreadsheet used to obtain the results shown here, to look at other population sizes and mutation rates).

Traversing the adaptive landscape

As the exact topography is unknown, it is only possible to render a qualitative description of how populations might traverse the adaptive landscape.

The master sequence occupies some point in the sequence space, surrounded by a large number of mutants. Tables 1 and 3 show that the sampling of sequence space adjacent to the master sequence can be quite dense if not saturated. Farther away the sampling density depends on the population size. (Pictures of globular star clusters help to visualize this distribution). This diverse collection of genotypes can be mapped onto the adaptive landscape. Some of the mutants will be at locations where Si is 0 or low (the valleys and lowlands of the adaptive landscape). They will likely die out. Others will have intermediate, equal or even higher Si (the hills, slopes, ridges and mountain peaks) when compared to the master sequence. In the next round of replication, all these will in turn be new master sequences. They will each generate progeny genomes, both non-mutated and mutated, that are even farther away from the original master sequence. Thus, starting from a single master sequence, the quasispecies can potentially grow to cover an ever-increasing swath of sequence space. This may be visualized using the globular star cluster analogy: Imagine that some of the 'non-master' stars in the cluster can generate daughter clusters. In turn, some stars in the daughter clusters can go on to create another generation of globular clusters, and so forth.

Selection focuses the spreading by driving the extended population uphill, towards regions of higher fitness. Remarkably, selection-driven peak climbing by RNA virus populations can go through a local region of low fitness, provided that it is not too low. Figure 2 shows an example of this.

Figure 2. Peak climbing via paths of lower fitness

traversing low fitness regions to reach a high fitness region Suppose the starting genotype has the sequence CC and a relative fitness of 4 (gray bar), and that the sequence UU has a higher fitness of 10 (blue bar).

Two mutations are required for CC to mutate directly to UU. This occurs at some low probability, indicated by the thin red arrow.

UC (yellow bar) and CU (green bar) are 1-error mutants of the CC sequence, with fitness of 6 and 1 respectively. Both 1-error mutants will assuredly be present in a population of 109 progeny genomes. They are very likely to be present in populations as small as 106. (See calculations that support this)

Mutation to UC moves up the fitness slope. The UC mutant in turn generates the UU mutant. Thus the virus can move upslope with ease.

Although mutation to the CU sequence moves down the fitness slope, it too will generate the UU sequence, provided that it can produce a large enough pool of progeny genome. Thus, the virus can move up the fitness slope via intermediate, lower fitness paths.

It should be apparent that RNA viruses have the potential to adapt very rapidly to new environmental conditions. This may be the reason that the majority of viruses known to man are RNA viruses, and correspondingly, the majority of viruses pathogenic to humans, animals and plants are RNA viruses.

As alluded to above, it also demands that we design our antiviral drugs with care. Not any target will do. A particular target might be essential for virus growth, but if it is sufficiently mutable, viruses resistant to the drug will readily arise.

Environmental change

The topography of the adaptive landscape depends on the precise environmental conditions. When the environment changes, the topography will also change. The ability to adapt rapidly is especially advantageous, as the environment can change rapidly for viruses:

The course of selection depends both on the type and the tempo of environmental change. Under constant conditions and for reasonably large population sizes (so that stochastic events can be ignored), a virus with a higher fitness will ultimately out-compete a virus with lower fitness, even if the fitness difference is small. In contrast, as the fitness of the viruses may be different in different environments, the time available for competition and, thus, the extent of selection is limited by how long a particular set of environmental conditions persists before changes occur.

As a first approximation, ignoring mutation and sampling due to finite population sizes, the composition of a population of progeny genomes produced during each passage (round of infection) can be obtained by multiplying the abundance of each parental genotype by its fitness, defined as the number of progeny produced by that genotype under a given set of environmental conditions.

Suppose that a particular mutant has a fitness of S relative to the master sequence, and its initial abundance is A-fold the abundance of the master sequence.

After 1 passage, its abundance is A * S. In parallel, the master sequence has a new abundance of 1 * 1 = 1.

After P passages, the abundance of the mutant will be A * SP, provided that the environment remains constant so that S stays the same. In the meantime, the abundance of the master sequence becomes 1* 1P = 1.

Thus, for the mutant to become F-fold as abundant as the master sequence after P passages

A * SP = F * 1     [Eqn 1]

i.e., the number of passages needed for a mutant with fitness S to go from an initial abundance of A to a final abundance of F is given by:

P = (logF - logA) / logS

This is plotted in Figure 3.
a The choice of F = 300,000-fold change in abundance is because the abundance of any point mutant in a population of 109 progeny genomes is 300,000-fold less than that of the master sequence (Table 1). Figure 3. Passages needed for a 300,000-fold changea in abundance,
for mutant fitness of 0.01 to 100.

Passages needed for 3E5-fold abundance change

Eqn 1 may be rearranged to S = 10(logA - logF)/P, to calculate the minimal fitness difference required to achieve the same 300,000-fold abundance change for different durations of environment constancy (Figure 4):

Figure 4. Fitness required for the master sequence to decrease in abundance by 300,000-fold within a defined number of passages
Fitness needed for 3E5-fold change in specific number of passages

For example, for the master sequence to decrease by 300,000-fold in 10 passages, its fitness must be 0.28 relative to that of its competitor. Equivalently, the master sequence will decrease in abundance by 300,000-fold in 10 passages if it has to compete with a mutant that is 3.6-fold more fit.

Similarly, with 100 or 1000 passages available for selection to operate, the master sequence will decrease in abundance by 300,000-fold if its fitness is respectively 0.88 or 0.987 as fit as the competitor (i.e., the competitor is 1.14-fold or 1.01-fold more fit).

It should be remembered that the above assumes infinite populations. With real populations, the population size may decrease to small numbers (e.g., a small infecting inoculum), such that rare sequences may be lost. Alternately, these kinds of bottlenecks may by chance sample only the non-majority sequence. This needs to be accounted for in considering the competition between the virus variants in the real world.

Maintenance of the 'wildtype' sequence

The forces described above lead to the following picture of adaptation of a virus population exposed to a new environment:

  1. When the environment changes (e.g., during adaptation of a 'wild' virus to tissue culture conditions), genotypes hitherto most fit and most abundant might have decreased fitnesses under the new conditions.

  2. As usual, large number of mutant genotypes are created during each round of replication. Some might be more fit than the master sequence, and, with time, might out-compete it.

  3. Any new master sequence that emerges is in turn tested against the mutants that it produces. The mutants that are sampled during this process, though extremely diverse, is finite. Their Hamming distance from the master sequence is relatively small:

  4. If the environment stays constant, ever more fit genotypes might be sampled, and successively come to dominate the population. The successive master sequences are in this sense increasingly optimized.

Thus, if we sample a virus population that has enjoyed a long period of constant environmental conditions, it would not be surprising that the 'wildtype' sequence will be repeatedly isolated, even over long periods of time, despite the high mutation rates.

Bottlenecks, random drift, and Muller's ratchet

See: Chao, 1990; Duarte, et al., 1994

The population dynamics is different when the population goes through a bottleneck, e.g., the sampling of a few individual genomes during plaque purification or during virus transmission from host to host. It is possible that the sample contains only mutants. The probability that a sample size of N contains only mutant genomes is simply mn, where m is the relative abundance of the mutants.

Even if m is as high as 69% (Table 1; assuming genome length = 11703 and mutation rate = 10-4), the probability that a random sample of 2 genomes will be all mutants is only ca. 50%. The probability that a sample of 12 contains only mutants is 1%.

In fact, m is likely to be < 69%, because some of the mutants will be inviable, and thus are 'invisible' (e.g., they do not form plaques). The fraction of inviables in each error class likely increases as the number of errors increases.

Consider the E=1 class of mutants. About 23% have synonymous changes, and another 23% have "conservative" changes (when the mutation rate = 10-4). Most / many of these are expected to be viable. The remaining mutants have non-conservative changes (Table 2). We expect some unknown fraction of these to be viable. If we assume that most of them are inviable, then the E=1 class of mutants would have m1 close to 46%.

Similarly, for the E=2 class, 22% of the mutants have synonymous or conservative changes, and m2 is probably close to 22%, if we assume that most non-conservative replacements are inviable.

After adjusting for the estimated frequency of viable mutants in each error class, the frequency of viable mutants (of any error class) among all viables (mutant or wildtype) is actually ca. 42% (when the mutation rate = 10-4). With this estimate of m, a random sample of 2 viable genomes has a 18% probability of being all mutant. Thus, even though enormous numbers of mutants are generated anew with each round of replication, the master sequence is (statistically) rather resistant to replacement by some mutant as long as the bottleneck is greater than a few genomes.

Nevertheless, if m is 42%, the probability of picking a mutant during a series of plaque-to-plaque transfers is quite good: this can be modeled with the geometric distribution, that predicts that a mutant will be first encountered after an average of 1 / 0.42 = 2.4 (standard deviation = 1.8), or 2 to 3 transfers.

As advantageous mutations are generally assumed to be rare relative to deleterious mutations, mutant viruses obtained from a series of bottleneck events are likely to have decreased fitnesses.

Recovery of fitness and the re-emergence of the 'wildtype' virus

Suppose that sampling did pick out an 1-error mutant. Among the very many possible kinds of mutant progeny genomes that it might produce, there is only 1 way to revert precisely to the original 'wildtype' sequence. The probability of this is

Prob. of a specific point muattion

Thus, once the wildtype is lost, it is not likely to reappear, unless the population is allowed to grow to large numbers. The large numbers required is due to the probabilistic nature of mutant sampling: it can be achieved either by growth to large numbers at any instant, or by low levels of growth, but over an extended period of time. For populations of 109 genomes, the wildtype revertant, generated by a single mutation, is expected to be present at an abundance of about 104 copies among a total of 109 progeny genomes (when the mutation rate = 10-4, and a genome size L = 11703 nt). So, if the 1-error mutant grows well enough, then it will be able to expand in numbers when the opportunity arises, to produce relatively large populations of progeny genomes, and to produce the wildtype revertant with high probabilities.

On the other hand, if the mutant does not grow well (such that the population remains small), there is only a low probability (e.g., 1-in-10 chance when the population size is 104, mutation rate =10-4, and L = 11703) that the wildtype sequence will be generated during any replication cycle.

If and when the wildtype virus is re-generated, it will initially be at low abundance. If the wildtype virus is more fit than the 1-error mutant, then, as the wildtype virus increases in relative abundance, the average fitness of the population will increase in parallel.

However, this does not guarantee that the wildtype virus will ulitmately dominate the population. It depends on the fitness of the wildtype relative to the fitness of the currently dominant variant.

In addition, as the wildtype revertant is only one of very many possible mutants, there is a chance that some non-wildtype but fitness-improving (second-site suppressor) mutation might be generated. The fate of these mutants is similar to that of the wildtype revertant. Whether their relative abundances will increase depends on their fitness relative to that of the other viruses in the population, and on the constancy of the environment.

Indeed, Burch and Chao, 1999 showed that fitness recovery in small populations is usually by small steps, confirming Fisher's prediction that advantageous mutations of small effect should be more common than advantageous mutations of large effect.

The significance of non-wildtype, second-site suppressors is that mutants with 2 or more errors revert precisely to the wildtype sequence with a probability of

Prob. of a specific E-hit mutant

where E is the number of mutations. For E=2, the probability that a wildtype sequence is created by mutation, and present in a large population of 109 is only about 1-in-3, and it is very unlikely that it would be present in a population of 104. Mutants with E>2 are even less likely to generate wildtype revertants.

Adjacent and Contingent Sequence Space

The interconvertibility property of a quasispecies is encapsulated in the N-dimensional hypercube discussed by Eigen and Biebricher [1988]. The discussion above shows that at some point, the divergence of a virus lineage away from the original wildtype sequence becomes large enough that mutation at each replication event is by itself very unlikely to re-generate the wildtype sequence. While it is conceivable that a sequence might proceed through an evolutionery pathway involving, e.g., a point change at a time, to change to any other sequence, by a combination of random and/or selective events, this is true only for infinite populations, given infinite time. Clearly, because of finite time and the vastness of sequence space, there are limits to how much interconvertibility there is in reality: e.g., it is quite unlikely for dengue virus to mutate to become yellow fever virus.

It is therefor useful to distinguish between the region of the adjacent sequence space that is accessible to a virus during each and every replication cycle versus those locations that are only accessible through some series of replication plus addiitonal, contingent events. The former, 'adjacent' space is the focus of these pages, and underlies much of the discussion here. The latter, 'contingent' space depends on the adjacent space available at each step, but adds a temporal, historical dimension to the problem. The contingent nature of these trajectories makes it much too complicated to contemplate here. Suffice it to say that, again, because of the vastness of sequence space, the several trajectories will likely diverge and never cross paths.

In summary, most of the interconversion that occurs in an RNA virus lineage during evolution will be among genomes within a Hamming distance of 1. However, this level of interconversion may well be sufficient for the wildtype genome to be maintained whenever the founder population is greater than about 10 genomes. Even if the wildtype sequence is lost, it is possible for it to be repeatedly re-created in relatively large populations (or in small populations, given enough time), and for it to re-dominate the population. Only when the bottleneck is 1 or a few genomes is it likely that only mutants are present in the founder population. While some of these mutant populations might have decreased fitness, they may grow well enough to produce progeny with further mutations. Continued mutation and selection will drive the populations to gain in fitness, and to establish distinct lineages in their own right.



Those that are not available through PubMed of the National Library of Medicine, USA.