Mutation and RNA Virus Populations
Introduction
It is now widely appreciated that retroviruses and RNA viruses have very high mutation rates. It is obvious that these viruses should generate many mutants, including drugresistant mutants. What is the magnitude of the problem? A quantitative consideration is essential, for addressing this and other questions about how these viruses evolve.
Mutation During Replication
Suppose we start with a single 'wildtype' sequence, called the 'master sequence' by Eigen [Eigen and Biebricher, 1988]
During replication of this sequence, mutations will occur.
 Assume that the mutation rate, the probability that an incorrect base is inserted at any nucleotide position, is µ.
 For convenience below, we will assume that mutation to each of the 3 mutant bases is equally probable, with a rate of µ/3.
 At any nucleotide position, there is a probability of µ/3 for inserting a particular incorrect base, and a probability of (1µ) for inserting the correct base during replication.
 We will use E to designate the number of mutations in a progeny genome. The wildtype or master sequence has E=0, singlehit mutants have E=1, etc..
 E is the Hamming distance, the number of differences between a copy and the original item of information, regardless of the nature of the change, or where it occurs.
 Assume that the genome length is L nt.
 Conveniently, sitespecific low mutation rates can be compensated for, e.g., by decreasing L, if necessary. As L is usually large (>>10^{3}), the few unusual sites that might exist can simply be ignored for the following calculations.
 A population of a virus with a long genome and relatively low mutation rates is similar to a population of a virus with a shorter genome but with higher mutation rates (view the data). Thus, one can choose some arbitrary L for the examples shown below, and generalize the results for other L's by shifting the graphs left or right along the Xaxis. Alternately, you can also download the spreadsheet used to obtain the results shown here, to look at other genome sizes.
Some nomenclature and symbols
 Error class. It is convenient to consider all sequences with the same number of mutations as a group. (The mutations may be anywhere in the genome). Such a group is called an error class. For example, all mutants with E mutations belong to the Eerror class. They are also refered to as mutants that have a Hamming distance of E from the reference, master sequence.
 Sampling density = Fraction of the total possible sequences that are actually present in a population. When all possible sequences are present in the population, the sampling density is 1, and the sampling is said to be saturated.
 Exponents: These pages use superscripts (e.g. 2 x 10^{3}) and subscripts (L_{i}), that are not supported by older browsers. If you see, e.g., 2 x 103, it signifies 2 x 10 to the 3rd power = 2000.
The probability of producing a perfect copy of the master sequence
The polymerase makes no errors. It inserts the correct base, with a probability of (1µ), at each of all L nt. The probability of this happening, i.e., the relative abundance of the master sequence in the progeny population, is:
p{E = 0) =
Many families of RNA viruses have segmented genomes (e.g., Arenaviruses, Bunyaviruses, Orthomyxoviruses, Reoviruses). The probability of replicating a wildtype copy of a virus with a segmented genome is the product of the probability of making a wildtype copy of each of the segments:
Where L_{i} is the length of the ith genome segment. Thus viruses with segmented genomes produce wildtype progeny genomes with the same probability as an unsegmented virus whose genome length equals the sum of the genome segment lengths of the segmented virus.
The probability of producing a copy with 1 specific mutation
At nucleotide position i; the polymerase inserts a particular incorrect base, with a probability of µ/3; and it inserts the correct base, with a probability of (1µ) at the remaining (L1) positions:
p(E = 1, at position i) =
Note that a different mutant, with a single mutation at a different site, has an identical probability.
p(E = 1, at position j) =
The number of different sequences, all with the same number of mutations, will be considered below.
The probability of producing a copy with E specific mutations
At each of E specific positions, the polymerase inserts a specific incorrect bases, each with a probability of µ/3; and it inserts the correct base with a probability of (1µ) at the remaining (LE) positions. The probability of this happening, i.e., the relative abundance of this specific sequence, is:
[Eqn 1]
p(sequence with E mutations, to E specific bases, at positions i, j, k,...) =
Effect of mutations on the population of progeny genomes
The relative abundance of each sequence in any population is given by the equations above. These relative abundances multiplied by the population size give the actual abundance of each sequence in the population.
We will look at populations of:
 10^{4} progeny genomes, e.g., those produced by a single master sequence during a single infection cycle;
 10^{9} progeny genomes, e.g., those produced in a culture infected by ca. 10^{5} master sequences, each producing 10^{4} progeny genomes.
To reiterate, the graphs shown below plot the probabilities, as calculated above, multiplied by the population size (10^{4} or 10^{9}). To obtain the corresponding values for some other population size, just divide the values in either graphs, and multiply by the population size desired. You can also download the spreadsheet used to obtain the results shown here, to look at other population sizes.
Figure 1. The abundance of individual sequences in the progeny population
The results are based on µ = 10^{5} to 10^{3}, and L=11703 (The genome length of Sindbis virus, an Alphavirus in the Togaviridae family).
Abundance of individual sequences with 0, 1 or 2 errors in a population of 10^{4} progeny genomes. 
Abundance of individual sequences with 0 to 3 errors in a population of 10^{9} progeny genomes. 
For the 10^{4} population:
 The abundance of the master sequence (E=0) decreases with increasing mutation rates. It is the most abundant individual sequence over the range of mutation rates considered here, with some 1 x 10^{4} to 3 x 10^{3} copies present when mutation rates are in the range of 10^{5} to 10^{4}.
 Note that the remaining sequences in the population are all mutants
At higher mutation rates, the number of copies of the master sequence decreases rapidly. At 10^{3} mutation rate, there is only an 1in10 chance that the master sequence is present in the population.
 If misincorporation occurs about once every 1000 nt, the likelihood of making no errors when copying a 10000 nt genome is not very high.
 At a mutation rate of 10^{3}, essentially all sequences in the population are mutants.
 Any individual 1error sequence has at most a 1in10 chance of being present in the population, i.e., the sampling density of the 1error class of mutants is 0.1 or less.
 Any individual 2error sequence is extremely unlikely to be present in the population.
 For all sequences with E>0, the abundance shows a maximum.
 At lower mutation rates, fewer mutant sequences are made
 At higher mutation rates sequences with more mutations are made at the expense of those with fewer changes.
For the 10^{9} population:
 As with the smaller population, the abundance of the master sequence (E=0) decreases with increasing µ, and it is the most abundant individual over the range of mutation rates considered here.
 There are up to 10^{4} copies of any arbitrary 1error sequence
 Corollary: All possible 1error sequences are present, each with abundance of up to 10^{4} copies! Coffin [1995] reached an identical conclusion, though using a different approach.
 Thus, increasing the population size from 10^{4} to 10^{9} dramatically increases the abundance and sampling density of the mutants
This is an example of the obvious: as more virus genomes are produced, more kinds of mutants are generated, and they will be more abundant. It should be equally obvious that the earlier the intervention, the less likely resistance to antivirals will develop.
 Imagine that a virus can mutate to drug resistance with a point mutation. The drugresistant variant will be produced and is certainly present in any population more than a million in size, even before drug treatment. This prediction has been verified for drugresistant mutants of HIV, in that resistant viruses can be found in populations that have never encountered the drug [e.g., Nájera, et al., 1995; Tucker, et al., 1998].
 The probability for the presence of any specific 2error sequence in the 10^{9} population is 10^{3} to 10^{1}, with a maximum of 0.43.
 However, if the population size had been 10^{10}, then there are several copies of any specific 2error sequence in the population, for mutation rates from 4 x 10^{5} to 4 x 10^{4}.
 The probability for the presence of any individual mutant with 3 or more errors is low (<10^{4}).
The number of mutants in each error class
The derivation above applies to individual, specific sequences (e.g., a mutant with a G to C change at nucleotide position 3456).
It is important to realize that there are many different sequences all with the same number of mutations.
The abundance of each error class in the population
The relative abundance of each error class is the product of the abundance of a specific sequence [Eqn 1] and the number of possible sequences [Eqn 2] in that error class:
[Eqn 3]
i.e., the binomial distribution
Thus, even though the probability of a specific sequence, e.g., one with 2 mutations, is very low, there are very many of them. So, they are quite abundant when considered as a group. Illustrative data are shown in Figure 2.
(Reminder: the graphs below plot the probabilities multiplied by 10^{4} or 10^{9}. Just divide the values in the graphs by 10^{4} or 10^{9}, and multiply by a different population size to get data for the latter. You can also download the spreadsheet used to obtain the results shown here, to look at other population sizes.).
Figure 2. Abundance of each error class in a population.
The results were calculated using a genome length of L = 11703 nt, and µ = 10^{5} to 10^{3}.
Population of 10^{4} progeny genomes. 
Population of 10^{9} progeny genomes. 
Considering the population of 10^{4}, e.g., the progeny from a single round of infection by a single master sequence:
 The master sequence (E=0) goes from more to less abundant than the mutants as a group.
 The population contains large numbers of mutants, even mutants with large numbers of errors are present in the population.
For example, when µ = 10^{4}:
 The master sequence constitutes 31% of the genomes in the population. i.e., 69% of the genomes in the progeny from a single round of infection are mutants!
 About 3600 (ca. 10%) of the 35109 possible 1error sequences are present. The abundance of the 1error class as a whole (ca. 3.6 x 10^{3} copies) is actually a little higher than the abundance of the master sequence (ca. 3.1 x 10^{3} copies).
 About 2100 of the 2error sequences, ca. 3 x 10^{6} of the possible, are present, at ca. 1 copy each.
 About 830 different 3error mutants are present.
 Mutants with 4, 5, or more errors are present at progressively lower numbers; down to a few copies of the 7error class of mutants
 Mutants with 8 or more errors are unlikely to be present.
Similar, but even more dramatic conclusions apply to the larger population:
 The master sequence (E=0) goes from more to less abundant than the mutants as a group.
 The population contains large numbers of mutants, even mutants with large numbers of errors are present in the population.
For example, when µ = 10^{4}:
 The master sequence constitutes 31% of the genomes in the population. i.e., 69% of the genomes are mutants!
 About 10^{4} copies of each of all 35109 possible 1error sequences are present. The abundance of the 1error mutant class as a whole, about 3.6 x 10^{8} copies, is a little higher than the abundance of the master sequence (ca. 3.1 x 10^{8} copies).
 About onethird of the total possible 2error sequences , ca. 2 x 10^{8} different mutants, are present, at ca. 1 copy each. Had the population been 10fold larger, each of the 620 million different 2error sequences would be present in the population, for mutation rates from 4 x 10^{5} to 4 x 10^{4}.
 Almost 10^{8} different 3error sequences are present. They represent only 0.001% of the possible 3error sequences.
 Mutants with 4, 5, or more errors are present at progressively lower numbers.
 Remarkably, there are several hundred 10error mutants in the population. However, only an astronomically small fraction (ca. 10^{37}) of the possible 10error sequences is sampled.
 The typical manual or automated sequencing procedure cannot detect bases whose abundance is less than about 20% of the total. Thus the master sequence and the consensus sequence of the population are the same over much of the mutation rates considered here.
The sequence diversity of the population consists of the single master sequence, plus all the different mutants in the population.
Diversity of progeny genomes in a population of 10^{4}
The sequence diversity in a population size of 10^{4}, e.g., progeny from a single infection, is shown using a linear (left panel) or a log (right panel) scale, with L=11703, and µ = 10^{3} to 10^{5}. (Again, you can download the spreadsheet used to obtain the results shown here, to look at other population sizes):
Figure 3. Sequence diversity in a population of 10^{4} progeny genomes
 The single master sequence (E=0) contributes a diversity of 1 over the entire range of mutation rates considered.
 At mutation rates between 10^{5} and 10^{4}
 mutants of the 1error class contribute the bulk of the diversity, with some 1000 to 3600 sequences. Sampling of the 1error class is not saturated in this range. Only 1 or very few copies of each sequence is present in the population.
 The 2error class contributes from 60 to 2100 sequences. Only 1 or very few copies of each of these sequences is expected to be present.
 At higher mutation rates, mutants of the 2, 3, and higher error classes in turn contribute the bulk of the diversity. Again, each of the error classes is sampled at very low densities, and each sequence, when present, is present at a low copy number.
 The sequence diversity of the population, e.g., almost 7000 at a mutation rate of 10^{4}, is only slightly less than the population size!
Similar conclusions apply when we consider a population of 10^{9}:
Figure 4. Sequence diversity in a population of 10^{9} progeny genomes
 The single master sequence (E=0) contributes a diversity of 1 over the entire range of mutation rates considered.
 Figure 1 shows that any of the 1error sequences is present at 1 to 10^{4} copies each, over the entire range of mutation rates considered. As there are 3.5 x 10^{4} 1error sequences, their contribution is a constant 3.5 x 10^{4} over the whole range of mutation rates.
 Other error classes contribute variable amounts of diversity (e.g., about 2 x 10^{8} sequences are contributed by the 2error class at µ=10^{4}), because the sampling of these classes is not saturated.
 At mutation rates below ca. 10^{4}, mutants of the 2error class contribute the bulk of the diversity. At these mutation rates, about 1% to onehalf of the possible 2error sequences is randomly sampled (see Figure 1). Those that are present occur at low copy numbers.
 At higher mutation rates, mutants of the 3, 4, 5error and higher error classes in turn contribute the bulk of the diversity. Again, each of the error classes is sampled at very low densities, and each mutant is present at a low copy number.
 The sequence diversity of the population is only slightly less than the population size!
An analogy may be useful, to help visualize the quasispecies
This is a picture of the globular star cluster called M13 (Messier 13), in the Hercules constellation.

We can imagine that each point in regular 3dimensional space corresponds to a sequence. The stars represent those sequence that are actually present in the population. At the center of the cluster is the master sequence. Immediately surrounding it are sequences with 1 error. Sequences with 2, 3, and more errors are progressively farther out.

The sampling of sequence space by the virus population is locally dense, with very diverse, random, but low density sampling at farther distances from the master sequence. For example, with L= 11703, µ=10^{4}, and a population size of 10^{9}, the space within a Hamming distance of 0 or 1 is saturated, and it is 33% saturated at a Hamming distance of 2.
What about even larger virus populations?
Take HIV as an example: some 30 million humans are infected globally. Before the onset of AIDS, each produces about 10^{9} to 10^{10} progeny genomes per day. This is equivalent to a daily global HIV production of ca. 10^{17} genomes.
With 10^{17} genomes, all of error classes 0 to 3 are saturated, error class 4 is about 4% saturated, and the saturation of error class 5 is ca. 10^{6}. The saturation of error classes 6 and higher is less than 10^{10}. 

Summary
We have so far limited ourselves to considering only the number and diversity of mutants produced during viral replication. A priori, it is not possible to predict whether the mutants are viable, or what their fitnesses might be.
Nevertheless, we can hazard some reasonable deductions, based on what is known about specific viral mutants. For example, some HIV mutants with multiple mutations are known to be viable, and are multiply drugresistant. The fact that error class 3 is saturated with viral populations as small as 10^{14} suggests that any specific, triple mutant of HIV that is multidrug resistant is generated a thousand times each day in the global population.
How selection might act upon the mutants is considered in more detail in the next page.
ReferencesThose that are not available through PubMed of the National Library of Medicine, USA.
 Eigen, M., and C. K. Biebricher. 1988. Sequence space and quasispecies distribution, p. 211245. In E. Domingo and J. J. Holland and P. Ahlquist (ed.), RNA Genetics: Variability of RNA Genomes, vol. 3. CRC Press Inc., Boca Raton, LA.
