Procedural Name Generation

If you’ve ever dabbled with fantasy cartography, short story challenges or naming planets in Stellaris, you’ve most likely come to a point where you can’t think of any appropriate names. While there are many, many thousands of online fantasy-themed or even realistic name generators, only a select few allow you to create random names from a custom namelist. In this post, we’ll discuss how to procedurally generate your own names using Markov chains.

Context: Markov Chains

No doubt the word “Markov” triggers flashbacks of probability lectures and awfully vague assignments, but the concept itself is fairly intuitive. A Markov chain is any random (or, to use the fancy term, stochastic) process of events (also called states) where the conditional probability of future states depends only on the current state, not on any past states (i.e the future is independent of the past). Such a process is said to be “memoryless”.

The canonical example of a Markov chain, given by textbooks and blogs alike, is that of simple weather forecasting. I have no intention of breaking that tradition here. Let’s suppose we have the following model for how the weather changes each day.

If it is sunny, then there’s an 80% chance tomorrow will be sunny
If it is rainy, then there’s a 40% chance tomorrow will be sunny

We can represent this as a matrix
\begin{matrix} \ & \text{Sunny} & \text{Rainy} \\ \text{Sunny} & 0.8 & 0.2 \\ \text{Rainy} & 0.4 & 0.6 \end{matrix}
Now suppose the weather on Monday is rainy. With what probability will it rain on Wednesday?

We can solve this by using the recursive matrix equation
\[ x_{n+1} = x_n P \] where \( P \) is a transition matrix (a.k.a stochastic matrix) encoding the probability of transitioning between all the various states (a Markov chain is so-named because of this recursion). It should be immediately clear that the above equation satisfies the Markov property (memoryless-ness) since \( x_{n+1} \) depends only on \( x_n \) (i.e the future is independent of the past). The problem of finding the probability that it rains on Wednesday (given that it rains on Monday) can be neatly abstracted as “given some initial state \( x_0 \), compute the vector of probabilities \( x_n \)”. This is as simple as
\[ x_n = x_0 P^n \] where \( P^n \) is the transition matrix \( P \) raised to the nth power. So out weather conundrum is just
\[ x_n = [0 \quad 1] \begin{bmatrix}0.8 & 0.2\\0.4 & 0.6\end{bmatrix}^2 = [0.56 \quad 0.44] \] so given that it rains on Monday, there’s a 44% chance it will rain on Wednesday.

A Markov chain for name generation

Perhaps you can get a sense of where we are going. Conceptually, we are going to construct our name by choosing letters one at a time, building it up from some initial sequence of letters. Essentially, this is just a Markov chain, for when we select a letter, that letter is combined with the old sequence to create a new sequence with which to select the next letter (eventually repeating up to a desired maximum word length or until we encounter a terminating sequence).

In order to actually decide what letters to choose given some sequence letters, we need a training set of words. We will use this training set to come up with a transition matrix \( P \), although (as we will see) this is not necessarily represented as a matrix, and is a little different to the typical square transition matrices we see in probability textbooks.

As a running example, I will be using the default namelist on donjon’s Markov Name Generator as our training list. donjon is an excellent go-to resource with a suite of fantasy and sci-fi themed random generators. I’ll also be providing some code examples in Python.

namelist=['Alden', 'Alec', 'Anton', 'Arden', 'Arlen', 'Armand', 'Arron', 'Augustus', 'Avery', 'Benedict', 'Bennett', 'Branden', 'Brendon', 'Britt', 'Broderick', 'Carter', 'Chadwick', 'Chas', 'Chet', 'Colby', 'Cole', 'Cordell', 'Dalton', 'Damien', 'Dante', 'Darell', 'Darius', 'Darron', 'Darwin', 'Dewitt', 'Diego', 'Dillon', 'Dirk', 'Domenic', 'Donovan', 'Dorian', 'Dorsey', 'Edison', 'Elden', 'Elvin', 'Erich', 'Galen', 'Garret', 'Gaston', 'Gavin', 'German', 'Graham', 'Hal', 'Hank', 'Harlan', 'Hayden', 'Herschel', 'Hoyt', 'Hunter', 'Isaias', 'Issac', 'Jacinto', 'Jarred', 'Jonas', 'Kendrick', 'Keneth', 'Kennith', 'Keven', 'Leif', 'Lenard', 'Lincoln', 'Linwood', 'Lucius', 'Lynwood', 'Malcolm', 'Malik', 'Maxwell', 'McKinley', 'Merlin', 'Merrill', 'Michal', 'Monty', 'Newton', 'Nolan', 'Porter', 'Quinton', 'Raphael', 'Reid', 'Rory', 'Scotty', 'Shad', 'Stanton', 'Stefan', 'Thaddeus', 'Tobias', 'Trenton', 'Vance', 'Walker', 'Walton', 'Weldon', 'Wes', 'Weston', 'Willian', 'Winford', 'Wyatt']</pre>

It’s worth noting that I’ll be treating capital letters as their own prefix. Hence the ‘Win’ in ‘Winford’ will be treated differently to the ‘win’ in ‘Darwin’. This helps improve realism since one prefix occurs at the onset of a word, and the other at the coda.

Let’s look at the first name on the list, Alden. Now, you could say “the l comes after the a”, “the d comes after the l”, and so on. This is an example of a first-order chain (as we are choosing a letter based on only one letter). To second order, we note that “the d comes after the al”, “the e comes after the ld”, and so on. We will define order more formally in a later post, but for now it suffices to consider order as the length of the sequence. The trick is to track all of this information for every single word. We can easily do so using a dictionary.

#after Alden
{'Al':['d'], 'ld':['e'], 'de':['n'], 'en':['\n']}

#after Alec
{'Al':['d','e'], 'ld':['e'], 'de':['n'], 'en':['\n'], 'le':['c'], 'ec':['\n']}

To those who’ve done a bit of graph theory or graph algorithms, we are actually constructing an adjacency list to represent the allowed transitions. You can visualise this as a giant matrix of the non-standard form
\[\begin{matrix} \ & \text{d} & \text{e} & \text{n} & \setminus\text{n} & \text{c} \\ \text{Al} & 0.5 & 0.5 & 0 & 0 & 0 \\ \text{ld} & 0 & 1 & 0 & 0 & 0 \\ \text{de} & 0 & 0 & 1 & 0 & 0 \\ \text{en} & 0 & 0 & 0 & 1 & 0 \\ \text{le} & 0 & 0 & 0 & 0 & 1 \\ \text{ec} & 0 & 0 & 0 & 1 & 0 \\ \end{matrix} \] where the ijth entry is the probability of picking letter \( j \) given the initial sequence \( i \). Clearly such a matrix will be incredibly sparse (i.e many entries will be empty), and so it is much more efficient to represent it as an adjacency list. It’s worth noting that the above matrix is not necessarily a square matrix because we are transitioning from a sequence of letters to a single letter. It is, however, still a Markov chain.

We can code this fairly easily in python by using a dictionary such that each key (i.e prefix) is associated with a list of letters that follow (suffixes). NB: Make sure the order used to populate the dictionary is strictly less than the length of the longest word in the training list, otherwise your sequences will just be the words themselves.

d={}
for i in namelist:
    n=0
    while n < (len(i)-order):
        d.setdefault(i[n:n+order],[]).append(i[n+order])
        n+=1
    d.setdefault(i[n:n+order],[]).append('\n')

Here are some random prefixes, and their list of suffixes, for our example namelist:

woo ['d', 'd']
erm ['a']
ena ['r']
win ['\n', 'f']
lec ['\n']
ers ['c']
irk ['\n']
ken ['d', 'e', 'n']
wto ['n']
ndr ['i']
nit ['h']
int ['o', 'o']

Thus we have several prefixes, such as ‘ken’, that may be followed by either ‘d’, ‘e’ or ‘n’, and ‘win’ that may either continue with ‘f’ or terminate (‘\n’). Clearly the more words in the training list, the better. However, I think is also important to ensure that that there is enough variety in your wordlist, otherwise you’ll encounter same sequences with alarming regularity. Thus don’t be surprised to see ‘shire’ in every second word should you wish to try and generate names based on English counties.

Generating the words themselves

Let’s visualise the process using the randomly chosen initial prefixes ‘ayd’ and ‘tha’ (i.e a Markov chain of order 3). For ‘ayd’, we must randomly choose a suffix. So, let’s lookup ‘ayd’ in our dictionary, and choose a random suffix.

ayd ['e']

as there is only ‘e’ to choose from, we have ayde. Now we shift the prefix one-over, lookup ‘yde’ and choose a random suffix.

yde ['n']

like before, there is only one choice, thus we have ayden. Now we shift the prefix one-over again, lookup ‘den’ and choose a random suffix.

den ['\n', '\n', '\n', '\n', '\n']

at which point all the suffixes are end-of-line, so we terminate and arrive at the name Ayden.

Now let’s repeat this with ‘tha’:

tha ['d']
-> thad
had ['w', '\n', 'd']
-> thadw
adw ['i']
-> thadwi
dwi ['c']
-> thadwic
wic ['k']
-> thadwick
ick ['\n', '\n', '\n']

hence we’ve arrived at Thadwick. Alternatively we could have followed another branch by choosing ‘d’ instead of ‘w’.

tha ['d']
-> thad
had ['w', '\n', 'd']
-> thadd
add ['e']
-> thadde
dde ['u']
-> thaddeu
deu ['s']
-> thaddeus
eus ['\n']

and arrive at Thaddeus. Or we could’ve chosen the ‘\n’ and ended up with Thad.

This is the gist of our procedural approach. We can code it up quite simply as follows, with some extra provisions to ensure that the generated names are new (i.e different from the provided names in the namelist) and that there are no duplicates.

max_len=12
random_names=set() #avoid duplicates

while len(rnames) < num_names:
    name=np.random.choice(list(d.keys()))
    while len(name) < max_len:
        s=np.random.choice(d[name[-order:])
        if s=='\n':
            break
        name+=s
    #avoid repeating names from the training set (may have to comment out with higher orders)
    if name.capitalize() in namelist:
        continue
    rnames.add(name.capitalize())

for name in sorted(rnames):
    print(name)

Different orders

It is important to consider the order used to populate the prefixes. As you start to increase the order, hence looking at longer prefixes, you’ll find that larger and larger chunks of the training words will appear in your generated words. In order to keep the words more random (but not so random that it makes no sense), a lower order is generally preferred. Again, this depends on the name list. Named based on languages with compact syllables (such as Japanese) could largely do with an order of 2, while languages with more elaborate syllables like English or German would likely require an order of around 3-4. There is no concrete rule for this; simply experiment with whatever works best. Let’s look at how the Markov generator performs at creating new names from the donjon namelist for orders from 1 to 6.

1		Arin, Beias, Binddel, C, Chand, Ditynk, Eren, Eresaided, Foregugom, Jan, Ldend, Lellverarro, Malbynderanf, Olvick, Qusoonis, Witord
2		Aiastuston, Att, Ever, Kerschas, Lkerick, Llincotty, Lynwoodenedi, Micheth, Ovandon, Oyt, Raphalcoln, Ren, Schel, Shael, Tand, Wel
3		Adwick, Alton, Ayden, Beneth, Dison, Hel, Ien, Ison, Jacinton, Lden, Ley, Llian, Nley, Oln, Rschel, Witt
4		Alik, Ance, Cinto, Erschel, Eston, Even, Ldon, Nett, Nwood, Obias, Onovan, Rlen, Saias, Wick, Wood, Ynwood
5		Alton, Anden, Arell, Cinto, Ddeus, Erick, Erman, Erschel, Eston, Ewitt, Inford, Novan, Onovan, Orter, Phael, Saias
6		Derick, Ennett, Erschel, Hadwick, Illian, Inford, Inwood, Kinley, Ndrick, Nedict, Ordell, Randen, Renton, Rschel, Ugustus, Ynwood

Applications of the name generator with different orders

Thus lower orders (especially 1) tend to be significantly more random (one may even consider them ‘Witord’-ded), while higher orders are more reflective of the words in the training set (all the order 6 names are essentially truncated versions of names in the namelist)

Where to from here?

There are some cosmetic issues with the above implementation, such as order-1 “names” being a single letter. While it is tempting to think that such issues can be solved by specifying a minimum length, some sequences of letters (or perhaps even single letters) may only ever be followed by the terminating character ‘\n’. This is why a large training set featuring many letter combinations is often required in order to ensure that as many sequences as possible are followed by something other than ‘\n’. Of course, you can bypass this issue altogether by simply adding a random letter to the end of a terminating sequence for a word that is less than the minimum length, but in doing so you lose out on authenticity.

The easiest way to improve the performance of a Markov name generator is to improve on the algorithm itself. So far we’ve just explored a very basic implementation. Indeed, should you play around with donjon’s Markov generator you’ll find that the names are more aesthetically pleasing. In a future post, we’ll explore the more general concept of n-grams (a sequence of words, e.g “That is not dead which can eternal lie”), and some more advanced techniques that can improve on the letter/word selection. These techniques all involve modifying the probability distribution of picking certain letters (given some sequence), and some involve combining multiple orders of sequences.