Mutation and RNA Virus Populations

Introduction

It is now widely appreciated that retroviruses and RNA viruses have very high mutation rates. It is obvious that these viruses should generate many mutants, including drug-resistant mutants. What is the magnitude of the problem? A quantitative consideration is essential, for addressing this and other questions about how these viruses evolve.

Mutation During Replication

Suppose we start with a single 'wildtype' sequence, called the 'master sequence' by Eigen [Eigen and Biebricher, 1988]

During replication of this sequence, mutations will occur.

  • Assume that the mutation rate, the probability that an incorrect base is inserted at any nucleotide position, is .

    • For convenience below, we will assume that mutation to each of the 3 mutant bases is equally probable, with a rate of /3.

  • At any nucleotide position, there is a probability of /3 for inserting a particular incorrect base, and a probability of (1-) for inserting the correct base during replication.

  • We will use E to designate the number of mutations in a progeny genome. The wildtype or master sequence has E=0, single-hit mutants have E=1, etc..

    • E is the Hamming distance, the number of differences between a copy and the original item of information, regardless of the nature of the change, or where it occurs.

  • Assume that the genome length is L nt.

    • Conveniently, site-specific low mutation rates can be compensated for, e.g., by decreasing L, if necessary. As L is usually large (>>103), the few unusual sites that might exist can simply be ignored for the following calculations.

    • A population of a virus with a long genome and relatively low mutation rates is similar to a population of a virus with a shorter genome but with higher mutation rates (view the data). Thus, one can choose some arbitrary L for the examples shown below, and generalize the results for other L's by shifting the graphs left or right along the X-axis. Alternately, you can also download the spreadsheet used to obtain the results shown here, to look at other genome sizes.

Some nomenclature and symbols

  • Error class. It is convenient to consider all sequences with the same number of mutations as a group. (The mutations may be anywhere in the genome). Such a group is called an error class. For example, all mutants with E mutations belong to the E-error class. They are also refered to as mutants that have a Hamming distance of E from the reference, master sequence.
  • Sampling density = Fraction of the total possible sequences that are actually present in a population. When all possible sequences are present in the population, the sampling density is 1, and the sampling is said to be saturated.
  • Exponents: These pages use superscripts (e.g. 2 x 103) and subscripts (Li), that are not supported by older browsers. If you see, e.g., 2 x 103, it signifies 2 x 10 to the 3rd power = 2000.

The probability of producing a perfect copy of the master sequence

The polymerase makes no errors. It inserts the correct base, with a probability of (1-), at each of all L nt. The probability of this happening, i.e., the relative abundance of the master sequence in the progeny population, is:

p{E = 0) = equation, probability of 0-error sequence

Many families of RNA viruses have segmented genomes (e.g., Arenaviruses, Bunyaviruses, Orthomyxoviruses, Reoviruses). The probability of replicating a wildtype copy of a virus with a segmented genome is the product of the probability of making a wildtype copy of each of the segments:

prob. of wildtype for segmented genomes

Where Li is the length of the ith genome segment. Thus viruses with segmented genomes produce wildtype progeny genomes with the same probability as an unsegmented virus whose genome length equals the sum of the genome segment lengths of the segmented virus.

The probability of producing a copy with 1 specific mutation

At nucleotide position i; the polymerase inserts a particular incorrect base, with a probability of /3; and it inserts the correct base, with a probability of (1-) at the remaining (L-1) positions:

p(E = 1, at position i) = equation, probability of 1-error sequence

Note that a different mutant, with a single mutation at a different site, has an identical probability.

p(E = 1, at position j) = equation, probability of 1-error sequence

The number of different sequences, all with the same number of mutations, will be considered below.

The probability of producing a copy with E specific mutations

At each of E specific positions, the polymerase inserts a specific incorrect bases, each with a probability of /3; and it inserts the correct base with a probability of (1-) at the remaining (L-E) positions. The probability of this happening, i.e., the relative abundance of this specific sequence, is:

[Eqn 1]

p(sequence with E mutations, to E specific bases, at positions i, j, k,...) = equation, probability of E-error sequence

Effect of mutations on the population of progeny genomes

The relative abundance of each sequence in any population is given by the equations above. These relative abundances multiplied by the population size give the actual abundance of each sequence in the population.

We will look at populations of:

  1. 104 progeny genomes, e.g., those produced by a single master sequence during a single infection cycle;
  2. 109 progeny genomes, e.g., those produced in a culture infected by ca. 105 master sequences, each producing 104 progeny genomes.
To reiterate, the graphs shown below plot the probabilities, as calculated above, multiplied by the population size (104 or 109). To obtain the corresponding values for some other population size, just divide the values in either graphs, and multiply by the population size desired. You can also download the spreadsheet used to obtain the results shown here, to look at other population sizes.

Figure 1. The abundance of individual sequences in the progeny population

The results are based on = 10-5 to 10-3, and L=11703 (The genome length of Sindbis virus, an Alphavirus in the Togaviridae family).

Abundance of individual sequences with 0, 1 or 2 errors in a population of 104 progeny genomes. Abundance of individual sequences with 0 to 3 errors in a population of 109 progeny genomes.

plot of probability of specific sequences

For the 104 population:

  • The abundance of the master sequence (E=0) decreases with increasing mutation rates. It is the most abundant individual sequence over the range of mutation rates considered here, with some 1 x 104 to 3 x 103 copies present when mutation rates are in the range of 10-5 to 10-4.
    • Note that the remaining sequences in the population are all mutants

    At higher mutation rates, the number of copies of the master sequence decreases rapidly. At 10-3 mutation rate, there is only an 1-in-10 chance that the master sequence is present in the population.

    • If misincorporation occurs about once every 1000 nt, the likelihood of making no errors when copying a 10000 nt genome is not very high.
    • At a mutation rate of 10-3, essentially all sequences in the population are mutants.

  • Any individual 1-error sequence has at most a 1-in-10 chance of being present in the population, i.e., the sampling density of the 1-error class of mutants is 0.1 or less.
  • Any individual 2-error sequence is extremely unlikely to be present in the population.
  • For all sequences with E>0, the abundance shows a maximum.
    • At lower mutation rates, fewer mutant sequences are made
    • At higher mutation rates sequences with more mutations are made at the expense of those with fewer changes.

For the 109 population:

  • As with the smaller population, the abundance of the master sequence (E=0) decreases with increasing , and it is the most abundant individual over the range of mutation rates considered here.
  • There are up to 104 copies of any arbitrary 1-error sequence
    • Corollary: All possible 1-error sequences are present, each with abundance of up to 104 copies! Coffin [1995] reached an identical conclusion, though using a different approach.
    • Thus, increasing the population size from 104 to 109 dramatically increases the abundance and sampling density of the mutants

      This is an example of the obvious: as more virus genomes are produced, more kinds of mutants are generated, and they will be more abundant. It should be equally obvious that the earlier the intervention, the less likely resistance to antivirals will develop.

    • Imagine that a virus can mutate to drug resistance with a point mutation. The drug-resistant variant will be produced and is certainly present in any population more than a million in size, even before drug treatment. This prediction has been verified for drug-resistant mutants of HIV, in that resistant viruses can be found in populations that have never encountered the drug [e.g., Nájera, et al., 1995; Tucker, et al., 1998].

  • The probability for the presence of any specific 2-error sequence in the 109 population is 10-3 to 10-1, with a maximum of 0.43.
    • However, if the population size had been 1010, then there are several copies of any specific 2-error sequence in the population, for mutation rates from 4 x 10-5 to 4 x 10-4.

  • The probability for the presence of any individual mutant with 3 or more errors is low (<10-4).

The number of mutants in each error class

The derivation above applies to individual, specific sequences (e.g., a mutant with a G to C change at nucleotide position 3456).

It is important to realize that there are many different sequences all with the same number of mutations.

  • For 1-error mutants,
    • the mutation can occur at each of the L nucleotide positions in the genome
    • At any nucleotide position, any of 3 possible mutant bases may be inserted
    • thus, there are 3L different 1-error sequences

    • For example, if L=11703 nt, there are 3 x 11703 = 35109 different 1-error sequences
    • (Remember that 104 copies of each of these are present in the population of 109 progeny genomes).

  • For 2-error or double mutants:
    • There are L possible positions where the first mutation can occur.
    • Given that the first mutation has occurred at some site, there are (L-1) possible positions at which the second mutation can occur.
    • The number of distinct choices is therefor L(L-1). However, this assumes that the order in which the sites are chosen is relevant. That is, the outcome with mutation at sites 123 and 456 is considered different from the outcome with mutation at sites 456 and 123. For our purposes, these pair of sites are considered equivalent, as all we care about is how many different pairs of sites there are. When the ordering effect is corrected for (remember that there are n! ways to permute n objects), there are L(L-1) / 2! ways to choose the 2 mutant sites, disregarding the order in which they are chosen.
    • At each of these sites, any of 3 mutant bases may be inserted
    • The number of different 2-error sequences is therefor = 3 x 3 x L(L-1) / 2
      • For example, if L=11703 nt, there are 3 x 3 x 11703 x 11702 / 2 = 6.2 x 108 possible 2-error sequences.
      • In preparation for the general case, note that L(L-1) = L! / (L - 2)!

  • In general, the number of different mutants each with exactly E errors (error class E) is

    [Eqn 2]

    equation for number of sequences in each error class

    Table 1. The number of possible sequences in each error class
    Error classNumber of possible sequences
    when L=11703 nt
    01
    13.5 x 104
    26.2 x 108
    37.2 x 1012
    46.3 x 1016
    54.4 x 1020
    62.6 x 1024
    71.3 x 1028
    85.7 x 1031
    92.2 x 1035
    107.8 x 1038

  • The Table paints a picture of the size of the local sequence space. Just a few Hamming steps away from the starting sequence lies an astronomically large number of possible choices.

    While suitably impressive, these numbers pale in comparison with the vastness of the global sequence space: there are 411703 or ca. 107056 possible sequences each 11703 nt long. (There are ca. 1078 atoms in the known universe).

The abundance of each error class in the population

The relative abundance of each error class is the product of the abundance of a specific sequence [Eqn 1] and the number of possible sequences [Eqn 2] in that error class:

[Eqn 3]

equation for relative abundance of each error class

i.e., the binomial distribution

Thus, even though the probability of a specific sequence, e.g., one with 2 mutations, is very low, there are very many of them. So, they are quite abundant when considered as a group. Illustrative data are shown in Figure 2.
(Reminder: the graphs below plot the probabilities multiplied by 104 or 109. Just divide the values in the graphs by 104 or 109, and multiply by a different population size to get data for the latter. You can also download the spreadsheet used to obtain the results shown here, to look at other population sizes.).

Figure 2. Abundance of each error class in a population.

The results were calculated using a genome length of L = 11703 nt, and = 10-5 to 10-3.
Population of 104 progeny genomes. Population of 109 progeny genomes.

plot of abundance of each error class

Considering the population of 104, e.g., the progeny from a single round of infection by a single master sequence:

  • The master sequence (E=0) goes from more to less abundant than the mutants as a group.
  • The population contains large numbers of mutants, even mutants with large numbers of errors are present in the population.
      For example, when = 10-4:
    • The master sequence constitutes 31% of the genomes in the population. i.e., 69% of the genomes in the progeny from a single round of infection are mutants!
    • About 3600 (ca. 10%) of the 35109 possible 1-error sequences are present. The abundance of the 1-error class as a whole (ca. 3.6 x 103 copies) is actually a little higher than the abundance of the master sequence (ca. 3.1 x 103 copies).
    • About 2100 of the 2-error sequences, ca. 3 x 10-6 of the possible, are present, at ca. 1 copy each.
    • About 830 different 3-error mutants are present.
    • Mutants with 4, 5, or more errors are present at progressively lower numbers; down to a few copies of the 7-error class of mutants
    • Mutants with 8 or more errors are unlikely to be present.

Similar, but even more dramatic conclusions apply to the larger population:

  • The master sequence (E=0) goes from more to less abundant than the mutants as a group.
  • The population contains large numbers of mutants, even mutants with large numbers of errors are present in the population.
      For example, when = 10-4:
    • The master sequence constitutes 31% of the genomes in the population. i.e., 69% of the genomes are mutants!
    • About 104 copies of each of all 35109 possible 1-error sequences are present. The abundance of the 1-error mutant class as a whole, about 3.6 x 108 copies, is a little higher than the abundance of the master sequence (ca. 3.1 x 108 copies).
    • About one-third of the total possible 2-error sequences , ca. 2 x 108 different mutants, are present, at ca. 1 copy each. Had the population been 10-fold larger, each of the 620 million different 2-error sequences would be present in the population, for mutation rates from 4 x 10-5 to 4 x 10-4.
    • Almost 108 different 3-error sequences are present. They represent only 0.001% of the possible 3-error sequences.
    • Mutants with 4, 5, or more errors are present at progressively lower numbers.
    • Remarkably, there are several hundred 10-error mutants in the population. However, only an astronomically small fraction (ca. 10-37) of the possible 10-error sequences is sampled.

  • The typical manual or automated sequencing procedure cannot detect bases whose abundance is less than about 20% of the total. Thus the master sequence and the consensus sequence of the population are the same over much of the mutation rates considered here.

The sequence diversity of the population

The sequence diversity of the population consists of the single master sequence, plus all the different mutants in the population.

Diversity of progeny genomes in a population of 104

The sequence diversity in a population size of 104, e.g., progeny from a single infection, is shown using a linear (left panel) or a log (right panel) scale, with L=11703, and = 10-3 to 10-5. (Again, you can download the spreadsheet used to obtain the results shown here, to look at other population sizes):

Figure 3. Sequence diversity in a population of 104 progeny genomes

plot of diversity of each error class, 10<sup>4</sup> population

  • The single master sequence (E=0) contributes a diversity of 1 over the entire range of mutation rates considered.
  • At mutation rates between 10-5 and 10-4
    • mutants of the 1-error class contribute the bulk of the diversity, with some 1000 to 3600 sequences. Sampling of the 1-error class is not saturated in this range. Only 1 or very few copies of each sequence is present in the population.
    • The 2-error class contributes from 60 to 2100 sequences. Only 1 or very few copies of each of these sequences is expected to be present.
  • At higher mutation rates, mutants of the 2-, 3-, and higher error classes in turn contribute the bulk of the diversity. Again, each of the error classes is sampled at very low densities, and each sequence, when present, is present at a low copy number.
  • The sequence diversity of the population, e.g., almost 7000 at a mutation rate of 10-4, is only slightly less than the population size!

Similar conclusions apply when we consider a population of 109:

Figure 4. Sequence diversity in a population of 109 progeny genomes

plot of diversity of each error class, 10<sup>9</sup> population

  • The single master sequence (E=0) contributes a diversity of 1 over the entire range of mutation rates considered.
  • Figure 1 shows that any of the 1-error sequences is present at 1 to 104 copies each, over the entire range of mutation rates considered. As there are 3.5 x 104 1-error sequences, their contribution is a constant 3.5 x 104 over the whole range of mutation rates.
  • Other error classes contribute variable amounts of diversity (e.g., about 2 x 108 sequences are contributed by the 2-error class at =10-4), because the sampling of these classes is not saturated.
  • At mutation rates below ca. 10-4, mutants of the 2-error class contribute the bulk of the diversity. At these mutation rates, about 1% to one-half of the possible 2-error sequences is randomly sampled (see Figure 1). Those that are present occur at low copy numbers.
  • At higher mutation rates, mutants of the 3-, 4-, 5-error and higher error classes in turn contribute the bulk of the diversity. Again, each of the error classes is sampled at very low densities, and each mutant is present at a low copy number.
  • The sequence diversity of the population is only slightly less than the population size!

The total diversity available for selection is enormous!

An analogy may be useful, to help visualize the quasispecies

This is a picture of the globular star cluster called M13 (Messier 13), in the Hercules constellation.

globular star cluster analogy We can imagine that each point in regular 3-dimensional space corresponds to a sequence. The stars represent those sequence that are actually present in the population. At the center of the cluster is the master sequence. Immediately surrounding it are sequences with 1 error. Sequences with 2, 3, and more errors are progressively farther out.

The sampling of sequence space by the virus population is locally dense, with very diverse, random, but low density sampling at farther distances from the master sequence. For example, with L= 11703, =10-4, and a population size of 109, the space within a Hamming distance of 0 or 1 is saturated, and it is 33% saturated at a Hamming distance of 2.

What about even larger virus populations?

Take HIV as an example: some 30 million humans are infected globally. Before the onset of AIDS, each produces about 109 to 1010 progeny genomes per day. This is equivalent to a daily global HIV production of ca. 1017 genomes.

With 1017 genomes, all of error classes 0 to 3 are saturated, error class 4 is about 4% saturated, and the saturation of error class 5 is ca. 10-6. The saturation of error classes 6 and higher is less than 10-10.

Summary

We have so far limited ourselves to considering only the number and diversity of mutants produced during viral replication. A priori, it is not possible to predict whether the mutants are viable, or what their fitnesses might be.

Nevertheless, we can hazard some reasonable deductions, based on what is known about specific viral mutants. For example, some HIV mutants with multiple mutations are known to be viable, and are multiply drug-resistant. The fact that error class 3 is saturated with viral populations as small as 1014 suggests that any specific, triple mutant of HIV that is multi-drug resistant is generated a thousand times each day in the global population.

How selection might act upon the mutants is considered in more detail in the next page.

 


References

Those that are not available through PubMed of the National Library of Medicine, USA.

  • Eigen, M., and C. K. Biebricher. 1988. Sequence space and quasispecies distribution, p. 211-245. In E. Domingo and J. J. Holland and P. Ahlquist (ed.), RNA Genetics: Variability of RNA Genomes, vol. 3. CRC Press Inc., Boca Raton, LA.