Improved Name Generation

In a previous post, we coded a fairly basic Markov name generator. While it works quite well for the purpose of randomly generating fantasy names, actual Markov generators can be significantly more complicated. What we’ve been doing – selecting letters based on a preceding sequence of letters – is essentially building a language model to recognise n-grams. An n-gram is a sequence of n words (or, in our application, letters), but in the fields of natural language processing, computational linguistics and especially corpus linguistics, an n-gram also refers to the probabilistic language model used to generate it. In this post we’ll explore some methods to generate realistic n-grams, ending with some improvements to our previous Markov name generator.

N-grams

So far what we’ve been trying to do is come up with a way to predict the letter that follows a sequence of letters (based on an initial namelist). Let’s switch up the terminology a little bit; we want to predict the word that follows a sequence of words (i.e the word history), given the information we can obtain from corpora. A corpus is a sample of real world text of a language (such as this very blog post, a fragment of an ancient papyrus scroll or an RSS feed). Suppose we have some sentence that begins “It’s a fine day with you…”. Players of Skyrim intending to woo Ysolda will no doubt fill in the blank with “around”. What we want to do is determine the probability \( P(\text{around} | \text{It’s a fine day with you}) \) or, in general \[ P(w|h) \] where w is a word and h is its history. Intuitively, this is exactly what our brains do when filling out a cloze test (i.e fill the blank with the correct word). For a second language exam, our knowledge of appropriate words comes with textbook readings and prior study. Similarly, our model needs a corpus in order to estimate the probabilities of choosing the appropriate words. It should come as no surprise, given the code in the previous post, that these probabilities are obtained by counting the occurrences of the word (plus its history) in the corpus. The history here is problematic; things can get messy with very long sequences. This is where our notion of order comes in, for instead instead of predicting the next word based on all current words, we only use some of the words. A bigram uses just the preceding word \( P(w_n | w_{n-1} ) \); a trigram uses the two preceding words \( P(w_n | w_{n-1}w_{n-2}) \), and so on. An n-gram hence uses \( n-1\) preceding words. For a general n-gram, we have \[ P(w_n | w^{n-1}_{n-N+1}) = \frac{C(w^n_{n-N+1})}{C(w^{n-1}_{n-N+1})} \] where \( C \) denotes the count, \( N \) is the order and \( w^{n-1}_{n-N+1}\) is the sequence \( w_{n-N+1}w_{n-N+2}\ldots w_{n-1} \). Believe it or not, this is essentially a type of maximum likelihood estimation (MLE), where our intuitive goal is to maximise the likelihood of matching the corpus given our model. This is why the corpus is often referred to as a training set, although it’s important to differentiate this from the two-component training sets often used with neural networks.

Even with a modest training corpora, chances are any random generator will run into a sequence of words that it has not encountered in the training set. Now we could just halt and call it a day, but in practice we can’t simply assign a probability of 0 to these words. There are some ways to account for these situations by modifying the overall probability distribution of our n-gram model. These ways can be grouped into three main strategies smoothing, backoff and interpolation.

Smoothing

Simply put, smoothing involves applying some uniform increase to all counts (including zero counts). In the case of add-one smoothing (a.k.a Laplace smoothing), we simply add one to every word count, resulting in \[ P(w_i) = \frac{c_i + 1}{N_T+V} \] where \( N_T \) is the number of tokens (the total number of words in the corpus regardless of repeats) and \( V \) is the vocabulary (the number of unique words).

Backoff

Sometimes we run into sequences where there are no examples in the training set. Suppose this sequence was a trigram (order 2). Instead of smoothing (and introducing problematic zero counts into the equation), we can instead estimate the trigram using a bigram. This is the essence of backoff; when we encounter a history we have not seen, we simply decrement the order by 1 (and keep doing so until we find known histories). Thus this is inherently a recursive formula; use order n if possible, else n-1, else n-2, etc. However, because we are continually looking back, we have the issue that the probability distribution is no longer normalised. As we are adding probability mass by choosing a lower order sequence, the total probabilities no longer sum to 1. So we must discount the higher order n-grams and give more weight to the lower order n-grams. This kind of backoff is called Katz backoff, and has the general formula \[ P(w_n | w^{n-1}_{n-N+1}) = \begin{cases} P^* (w_n | w^{n-1}_{n-N+1}) & \text{ if } C(w^n_{n-N+1}) > k \\ \alpha(w^{n-1}_{n-N+1}) P(w_n | w^{n-1}_{n-N+2}) & \text{ otherwise } \end{cases} \] for some order \( N \). Here \( \alpha \) is the backoff weight, \( k \) is a constant usually set to 0 (although is generally chosen based on empirical performance) and the modified probability \( P^* \) includes a discount factor \( d \) \[ P^* = d \frac{C(w^n_{n-N+1})}{C(w^{n-1}_{n-N+1})} \] where \( d \) is obtained using the Good-Turing estimation (described below). The calculation of \( \alpha \) is a little more complicated, as we must first consider the “probability mass” left over as we ignore the current n-gram and proceed to the \(n-1\)gram. This leftover mass is usually given the symbol \( \beta\), and is calculated as follows: \[ \beta_{w^{n-1}_{n-N+1}} = 1-\sum_{w_n:C(w^n_{n-N+1})>k} d \frac{C(w^n_{n-N+1})}{C(w^{n-1}_{n-N+1})} \] where the summation is over all words \( w_n \) with counts above \( k \). Finally, we can calculate the backoff weight \( \alpha \): \[ \alpha_{w^{n-1}_{n-N+1}} = \frac{\beta_{w^{n-1}_{n-N+1}}}{\sum_{w_n:C(w^n_{n-N+1})\leq k} P(w_n | w^{n-1}_{n-N+2})} \]

The Good-Turing estimation has its roots in Alan Turing and I. J. Good’s cryptanalysis work at Bletchley Park during World War II. In this approach, we consider the probability \( \theta(r) \) that a given word occurs exactly \( r \) times in the training corpus. We hence compute the adjusted probability as \[ \theta(r) \approx \frac{1}{N}(r+1)\frac{N_{r+1}}{N_r} \] where \( N_r \) is the number of tokens that occur exactly \( r \) times. For a given count \( C(w_{n-1}w_n) \), we can use the Good-Turing estimation to estimate the adjusted counts \( C^*(w_{n-1}w_n ) \): \[ C^* = (C+1)\frac{N_{r + 1}}{N_r} \] and hence calculate the discount factor \( d \) to be used with the Katz method \[ d = \frac{C^*}{C} \]

Interpolation

So far we’ve seen how smoothing is used to modify the probability distribution and how backoff can step back to a lower order if it is stuck. Interpolation is a method of combining the probability estimates from all the different n-gram models. The simplest method is linear interpolation, where the probability becomes \[ \hat{P}(w_n | w^{n-1}_{n-N+1}) = \lambda_1 P(w_n | w^{n-1}_{n-N+1}) + \lambda_2 P(w_n | w^{n-1}_{n-N+2}) + \ldots + \lambda_{N-1} P(w_n | w_{n-1}) + \lambda_N P(w_n) \] where the weights \( \lambda_i \) are such that \[ \sum_i^N \lambda_i = 1 \] Alternatively, we can also make these weights \( \lambda_i \) their own functions that depend on the current prefix, i.e \( \lambda_i(w^{n-1}_{n-N+1}) \). This way the weights change depending on the current prefix. In order to populate the values for such weight functions, we need to use some handy machine learning, essentially performing a hyperparameter search on a held-out corpus (i.e a corpus, different to the training corpus, used to train hyperparamaters) to set the optimal values for \( \lambda \). Another method of determining the \( \lambda_i \) is through cross-validation (a.k.a deleted interpolation).

Improving the name generator

Let’s return to our original name generator and compare some of its outputs when implementing smoothing, backoff and interpolation. Here is a sample of 20 names returned from our original Markov name generator with add-one smoothing (initialised with an order of 3).

['Aelcecvvb', 'Ahadj', 'Alconhf', 'Amveynnn', 'Edixuddis', 'Eldty', 'Galdrq', 'Icvtton', 'Keneukied', 'Landon', 'Ldonttxau', 'Llobj', 'Lobigrrlm', 'Nkbturonw', 'Olbitberv', 'Otssxyura', 'Rikss', 'Scockerdy', 'Tontthaxg', 'Wegdy']

Because add-one smoothing also modifies zero-count words, this means that sequences not seen in the training suddenly become a possibility. This also means that some names are complete gibberish, with unrealistic consonant combinations (at least, to an English speaker). One way to retain this realism is to explicitly exclude zero counts and apply the smoothing only over nonzero counts (i.e the letter combinations we have seen). Suppose ‘ken’ is observed twice and ‘kev’ is observed once. If we adjust the counts to something like 'ken': 3, 'kev': 2, then ‘ken’ is no longer twice as likely to be picked; the probabilities have been smoothed.

['Adwonahad', 'Alenalen', 'Asterold', 'Darint', 'Dewiant', 'Domentera', 'Enton', 'Fareiasen', 'Ienttenn', 'Illcifonk', 'Intart', 'Lkeikerl', 'Maxwollia', 'Novac', 'Olmarittt', 'Reinwilev', 'Ugugugol', 'Unovas', 'Venenanto', 'Wiugus']

Here we (mostly) keep existing letter combinations, but introduce more randomness.

Now let’s consider backoff and interpolation. Below are three sets of names, one with the original generator, the second with backoff, and the third with interpolation. The generator uses up to order 3; hence backoff drops back to 2nd order and then 1st order. Interpolation combines all three orders equally.

# base generator
['Addeus', 'Alker', 'Alton', 'Arlan', 'Ckinley', 'Ddeus', 'Enard', 'Enneth', 'Eston', 'Germanden', 'Haddeus', 'Ichal', 'Incoln', 'Inton', 'Kendon', 'Nnett', 'Obias', 'Onovan', 'Rendrick', 'Rrillian']

# with Katz backoff
['Adwit', 'Aiastonwt', 'Averrl', 'Aydicick', 'Dalendeln', 'Dorin', 'Enden', 'Entus', 'Eston', 'Melierel', 'Nadellave', 'Narien', 'Ncelifova', 'Nderal', 'Orsalat', 'Rlico', 'Rronddete', 'Schaton', 'Stoorit', 'Trenfotte']

# with interpolation
['Antucian', 'Atein', 'Atonittel', 'Cherk', 'Denind', 'Echaphart', 'Enide', 'Gratonoll', 'Hastonori', 'Lasory', 'Letan', 'Ltonn', 'Luityaphe', 'Riciuceto', 'Shart', 'Tetyal', 'Uguinfooy', 'Verdolici', 'Vitty', 'Xweick']

It’s clear that both backoff and interpolation have introduced a considerable amount of variety compared to the base names that mostly contain truncated chunks taken directly from the training set. This is since backoff and interpolation take into account multiple orders, leading to a more authentic generation process. It is important to note that backoff and interpolation are fundamentally different: backoff only ever uses one order at a time (if it can’t use order 3, it uses order 2, and so on), whilst interpolation combines the probability of all orders. Combined, these strategies can ultimately provide much more realistic names.