The genesis of this idea was a couple weeks ago when my cofounder said: “Would it be possible to see what percent of our email list was female or male based on their names alone?” Thus Drillbit was born.
In the last couple weeks I have been pouring over data sets and trying different formulas to find the best way to break down a list of seemingly random name data into digestible information. The resulting app allows anyone to upload their mailing lists and see who’s in them, and in perhaps the coolest feature, they can segment their list as well.
Drillbit uses publicly available datasets to create a likely demographic profile of mailing lists based on first and last names. Upload your mailing, customer or user list with first and last names, and based on that information we will create an age, gender and demographic profile of your list.
Listed here are the foundational datasets of this project, including for analysis tools that haven’t yet been released.
The essential principle behind Drillbit is that an individual’s first and last names betray a lot of information about his or her background, origins, language, gender, and even income and ideology. Names can be both varied in their originality and popularity as well as conservative in their staying power. A surname can be passed down for generations, whereas first names have a tendancy to be cyclical.
As an example, take the name “Max.” It is a common name, or common enough it would seem, that one could find out very little information from the name alone. But as it turns out, “Max” only may seem common to us given its surge in popularity in the late 80’s and early 90’s–the birth years of the rapidly matriculating Generation Y. In 1974, only 400 Max’s were born nationwide!
Of course, baby name popularity is not a new idea. But the variance is astounding, and not just in terms of popularity. In 2012, the two most popular baby names for boys and girls were “Jacob” and “Sophia.” Unlike “Max,” both of these popular names seem to have spent the last 100 years on the up-and-coming list.
With this amount of unique variance in names–some names jump and others sink, some names are like fads and others never really take off–it isn’t surprising that, in the aggregate, it is possible to take a list of people and determine how old they are likely to be.
So that’s what I did. Using the above datasets on name popularity, I was able to come up with some pretty convincing initial results, benchmarking against existing lists I knew well.
The first step is to condense the data I had into a table which compared year of birth, and gender, with the % likelihood that any random “Michael” born in the last century was actually born in that year. For example, if 10,000 Michaels were born between 1900 and 2000, and 1000 Michaels were born in 1950, then 1950-M-MICHAEL has a 10% likelihood; i.e., given a random Michael, there is a 10% chance he was born in 1950.
With the charts above, you can see how this would play out. If you were to use Drillbit to upload a list of 5000 Jacobs, you would see the age match pattern roughly cohere to the above chart. The more Jacobs there are, the higher confidence we would have in the result.
There are some obvious complications with this model. The first is that although 10% of all Michaels might have been born in 1950, they would be over 60 now, and their chance of being around is much smaller than that of a Michael born in 2000. That’s where actuarial data comes in. Using the above actuarial table divided by gender, I was able to normalize the distribution based on likelihood of survival in each age cohort. No matter how many Max’s were born in the 1910’s, there aren’t a lot left today.
The second problem is that names are not unisex; in fact, most names in the database aren’t 100% unisex, Michael included. It became clear that age data had to be done on the basis of gender, and not on totals. Names that are popular with one gender are not necessarily popular with the other at the same time.
To compensate for this, age data was tabulated separately, all the way down to the actuarial normalization. Female names were rated and graded against each other, male names were separately, and only at the end were they normalized against each other.
Compared to Age, Gender and Race were quite easier. Gender analysis was a simpler form of the age analysis–likely names were divided by gender and then normalized by age. Race/ethnicity data was also quite simple based on surnames–the data was already organized by the Census, albeit 13 years ago, so getting it into a searchable database wasn’t tricky.
There are some obvious limitations to my method. The first is in the nature of large numbers, or small numbers as the case may be. If you were to put a list of 2 names into Drillbit, it would spit out a similar looking demographic profile running the gammut of all ages and perhaps some different races as well. There are few names that are reliably “Black” names or “Over 65” names (although, there are a few names with a 100% incidence within one to five years–challenge you to find them). Like with any aggregate data project, the larger the list, the more reliable Drillbit will be.
The other limitation is in any sort of list that comes with existing biases. Say, a list of NBA players (heavily 25-35 and black) or a list of sitting US Congresspeople (heavily male, white, and 35-55). These inherent biases will be reflected in the anlaysis, but probably not to the extent that they could be. This is the House of Representatives, according to Drillbit:
Obviously 18-24 year old congresspeople would be impossible. And yet, even with a small list of 435 names, the trends in age in reality poke through.
In short, you shouldn’t use Drillbit to analyze a list whose composition is already known to you to skew heavily in favor of one or two demographics. However, it’s worth nothing that Congress is 18.3% female, and Drillbit predicted 20.5% based on names alone. Not shabby.
The final inherent bias that’s worth mentioning is in skewing toward younger ages. Since younger people are overwhelmingly more likely to be alive, post-normalized numbers skew younger. In addition, in development, I had a category be “Under25” but it became apparent that although my database could detect age variability all the way to Age 0, babies aren’t going to be on mailing lists, and they were throwing off all the results. So to compensate for the younger skew, I made a judgment call to make a cutoff at 18, and not track any younger cohorts, even though some websites may have 13-18 year olds as users.
Now that you know more about how I did it, upload a list and try it out!