Here’s why I have been looking at the popularity of names. Yesterday, I woke up with this puzzle in my head, after an odd dream.
A Strange New Job
The scenario is that you work in the mailroom of a large company. Internal mail is addressed with a typed label with the employee’s first name only in capital letters. It’s okay – it is not ambiguous because the company has very odd HR policies that ensure that no-one is hired if they have the same first name as someone else who already works there. They do not permit diminutives (e.g. TOM, DICK or HARRY) They also insist that everyone has a relatively common name.
Definition of Common Names
What does common mean? I think 1 in 10,000 names for that gender is about right. (That’s really 1 in 20,000 population). That corresponds to the 842nd most common boy’s name, and the 1000-and-somethingth girl’s name. I am willing to settle on the top 1000 boys’ names and girls’ names if it makes it easier. I can negotiate.
Choose a list appropriate for your region, culture and period. The USA Social Security database of baby names is helpful, but not perfect because it doesn’t give overall population data, only by year of birth.)
Typos
People sometimes make typos on the address label. The errors are limited to a few categories:
- deletion of a single letter: e.g. JULIAN becomes JLIAN
- addition of a single letter: e.g. JULIAN becomes JKULIAN
- substitution of a single letter for another: e.g. JULIAN becomes JKLIAN
- transposition of two letters: e.g. JULIAN becomes JNLIAU
Strangely, people only ever make at most one of these errors per label.
Impact of Typos
If the typo doesn’t cause the name to become ambiguous, you should be able to work it out and deliver anyway: JNLIAU should be successfully delivered to JULIAN.
If a typo turns one name into another, the letter will, of course, be delivered wrongly. e.g. Deletion of the letter N would turn JULIAN into JULIA.
The interesting part is when the typo makes it impossible to resolve the name: Does DAREN resolve to DARREN or KAREN?
The Puzzle
The puzzle is to come up with an example of the most ambiguous single typo of a name.
I haven’t been able to come up with any inspiring answers to this puzzle. The best I have come up with is actually based on my own name: JULIN maps to JULIA, JULIE, JULIAN and JULIEN.
The Meta-Puzzle
If you know my attitude to these sorts of puzzles, you will have already guessed that there is a meta-puzzle here: Given a database of names, come up with an efficient algorithm to solve the original puzzle. Bonus points for the first actual software implementation!
Comment by Casey on September 21, 2005
You have piqued my interest! It turns out that I have a database of names (male and female), together with some very noisy popularity statistics, thanks to a web-game-slash-data-collector called Nom D’Amour. (In hindsight, people are much more interested in the boring question – how compatible are our names? – than the much more technically interesting question I also answered for them – what to change their name to for maximum compatibility).
Your definition of allowable typos corresponds to the usual definition of Levenshtein (edit) distance allowing transposition as an atomic operation, but I’m sure you knew that already. The question can then be asked in terms of edit distance:
For a set of allowable strings S (S1, S2, S3, … , SN), find the string t (with t not in S) such that the the number of strings Si with distance(t, Si) == 1 is maximised.
Two implementations occur to me, neither efficient. I’ll see how they go 🙂
Comment by Casey on September 21, 2005
I’d like to submit a provisional answer. I have ignored transposition for now – it’s complicates matters a bit too much given my current approach. I am using the top 1000 names of each gender from the year of my birth, 1978, from your provided link. Some names were duplicated; I removed these from the list to make a total of 1904 names.
The best answer I have is Ara, which could be any of 13 names but is not a name itself. The candidate names are: ada, aja, ana, ari, asa, cara, dara, ira, kara, lara, mara, sara, tara
The runner-up is Mari, which has 12 candidates: cari, ari, kari, mara, marc, marci, maria, marie, mario, mark, mary, omari
I don’t vouch for the common-ness of the names, but if you give me any list I’ll give you some answers!
My method is efficient but not perfect. I’m not sure how to scale to include transpositions, but I think it’s good enough for starters.
Comment by Alastair on September 21, 2005
I think it’s a little too early to play the name game, don’t you?
Comment by Julian on September 21, 2005
Casey,
Once again, I bow to your 733t algorithmic skills. 5 hours 40 minutes from posting to solution. I haven’t even come up with an algorithm that could finish the computation in that time, let alone finish coding it! You don’t have access to a massive number of parallel machines, have you?
Actually, I did! I had to write a paper on diff algorithms a long time in the past, and I saw this puzzle had a lot to do with finding the minimum differences – and also spell-checking algorithms.
What I found daunting here is the
string t (with t not in S)
, which sounds like a universe too large to filter and too big to generate. I want to know your trick! You must publish your algorithm!I see this as a perfectly acceptable bending of the rules, especially as I think that people attempting this puzzle by hand will have great difficulty in thinking of “single transposition” answers. Certainly, none of the attempts I made had any transpositions in it. I am happy to lose this rule entirely.
Yeah, the candidates combine names I am familiar with (ada, ari, ira, kara, lara, sara, tara) with names I have never heard of (aja, ana, asa, cara, dara, mara). Unfortunately, I doubt there is a clear position on the popularity list where I could meaningfully draw the line, and say names above this are acceptable, and names below aren’t.
Hmmm… given you have a set of (noisy) “couple” information, you could answer another question! What is the correlation between name popularity and attraction? If you name is John will you end up with a Mary or a Mara? I strongly suspect that the high incidence intra-cultural pairings will lead to a positive correlation between popularity of first names of each member of the couple.
Comment by Julian on September 21, 2005
Let me re-think that.
Suppose the average name is 5 letters long.
There are 5 variants involving a deletion.
There are 10 variants involving a transposition (if allowed).
There are 25 * 5 variants involving a substitution.
There are 26 * 6 variants involving an addition.
That’s only 296 variations on a 5 letter name. Multiply by 1000 names, and we are still under 300,000 variations. A big number to be pushing around, sure, but not the billions I had first conceived.
I’ve still got the job of comparing them to each of the real names (looking at 300,000,000 computations of Levenshtein distance. There goes that 5 hour time-limit!) or to each other – i.e. How many times has this “name” been generated before? That’s a big list! Well, maybe not… averaging one byte for the count and 6 bytes for the name and 4 bytes for data structure (tree? hash table?) overhead, that’s still less than 3.5 megabytes. I guess that’s feasible.
Am I getting close?
Comment by Casey on September 22, 2005
Thanks for your kind comments, Julian. You are getting close!
First, I admit that I do have access to a large collection of computers, but I didn’t use it for this 🙂
Second, concerning transpositions and edit distance. Your definition of transposition (allowing swaps between any two characters) is much more difficult than the usual version of swapping two neighbouring characters. I’m not even clear if there’s an efficient way to work it into the usual dynamic programming solution to edit distance, so I’m glad I can leave it out!
Now, for the algorithm. The naive version, which you have spelled out above, would look something like this:
typos = { } # dictionary/hash-map thingo
for each name in names:
.. for each typo in make_typos(names): # assume a generator function for typos
.. .. typos[typo].append(name) # pseudocode - I fudged initialisation
Then you just find the typo with the biggest list of names. But as you say, the space of typos is quite large, and you might not want to generate them all, or have space to hold them all. Also, almost all of these will be typos for only one or two names, and so they’re useless.
Now, you mention that you still have to compute Levenshtein distance, but that’s not true. This is all the computation you have to do!
Writing out this pseudocode, I just realised that this solution is actually not that bad. My version actually does the same thing, but with a prior filtering step.
Levenshtein distance is a “real” distance metric. It fulfils the triangle inequality, ie. d(a,b) + d(b,c) >= d(a,c). This comes in handy: if we are looking for typos with distance=1 to two strings S1 and S2, that means that the distance from S1 to S2 must be no more than 2. (It can be less; consider the case of “Jade” and “Jane”, both of which are one edit from “Jame” but are also only one edit to each other)
So I pre-filtered my list of names into clusters of potentially-typo-related names, by finding for each name those other names with a distance of 1 or 2. It’s not so bad, comparing 2000 names to each other. You can short-circuit edit distance to stop once a maximum allowable distance is no longer possible, which speeds this up even more. This first step takes maybe five minutes.
Then what you get is quite short lists of similar names. They don’t necessarily share typos in common, though, and this is where I fall back to the same approach as I put in pseudocode up there. But the advantage is that, with a list of no more than say 40 names in it, the number of typos you have to produce and test simultaneously is much smaller. This step takes under a minute, I think. You could make it faster by not even considering those candidate-groups of smaller size than your current largest known solution, but I didn’t do that.
Writing this explanation has shown to me the more general solution, not using edit distance at all but just generating and hashing typos. This would allow arbitrary transpositions, too. I’ll try it out tonight and see how slow it is.
Comment by Casey on September 22, 2005
Sigh. So my suspicion was belatedly correct – you can do this just fine with a typo-yielding iterable (much better to be a generator than an actual list) and a couple of hundred megs of RAM, and a couple of minutes. Adding transpositions is a cinch. And, following up your own intuition, it doesn’t seem to matter much! The first name-group that uses transposition is one with ten options:
maron: laron, marlon, jaron, myron, mason, aaron, marion, ramon, daron, aron
Now I’ve managed to disappoint myself with the simplicity of the final answer 🙁 Here’s the code – I leave the typos() generator as an exercise to the reader.
[Note: Slight modification of code was performed to remove characters that were being chewed up by the WordPress format engine. – Editor]
Comment by Julian on September 22, 2005
This is all going too fast! I am mentally constructing an answer to your previous post, and in the meantime, you have read my mind and implemented it. Dammit!
Let me go back a bit!
When I said:
I wasn’t very clear. You won’t believe me now, but this was predicting much of your final answer. I wish I had been faster, but I was working out how to express the pseudocode in the same time you were implementing it!
I was saying that either you could work out the Levenshtein distance – which was too expensive – or you could generate a big list, mapping the typo to a count of the number of words that would match.
Your final solution went further. It didn’t merely store a count for each word, but the whole set of matching words.
That is, where you say:
I would have settled for:
counts[typo] += 1
Once I had identified the best candidate, I would have had to go through the name list again to find the matching names, where you do it in one pass – at the cost of a lot more RAM.
Also, I would not have expected Python’s dictionary implementation to have been efficient enough for arbitrary strings. Where you used a dictionary to define
counts
I was expecting to customized hash table (with a list of unhashed names stored in the table somehow)- but the off-the-shelf solution seems to have worked. Optimising too early is a bad move!Yep. Exactly. Sorry for the lack of clarity.
I didn’t even get a chance to argue this point! 🙁
So having defended what’s left of my honour, I can now say: Clustering? Cool! It sounds like it should have been a better solution.