The April Birthday Paradox
With Spring just around the corner, I have a friend who is expecting a baby girl and is using the season as inspiration for names. She is thinking about names based on flowers like Blossom or Iris and other names like Chloe (which means “young green shoot”) that you may not even identify as Spring names. And then of course there are the spring months themselves, in particular April and May. When I had my daughter in the summertime, we nearly named her June – my wife and I both really liked the name and we were going back and forth between that and a family name. When she arrived a couple weeks late in July, we asked ourselves – can you be named June if you are born in July?
Well, of course you can – right? The only catch is that in her case everyone would know she was late. Normally, this wouldn’t be an issue – but if both parents happen to have chronic issues with tardiness, they may not want to build that into their child’s name when that child is already genetically predisposed to lateness. Anyway, all of this got me thinking – how much do parents actually consider the season when picking their kids’ names?
It turns out there is a huge correlation between names and dates of birth when it comes to these types of seasonal names. To illustrate, if you are named April you are 3 times as likely to have been born in April than someone with another name is. That’s pretty significant.
To explore the impact of this, I started thinking about the classic schoolroom “Birthday Paradox,” which goes like this: Suppose you’re a student in a classroom with 23 kids in the classroom. What are the chances that two kids share a birthdate?
Almost everyone thinks it’s a real long shot. You might start thinking about the question by saying, “Well there is a 1 in 365 chance that the kid next to me has my birthdate, and there are 22 other kids so maybe 22 in 365” – or just a 6% chance. But the teacher knows better. So she makes a bet with the students and gives them great odds – 4 to 1, let’s say. (The betting part of this problem may be frowned on these days, but that’s how my 5th grade teacher did it and I never forgot.) Then she goes around the room and writes down the birthdate of every kid – and low and behold 50% of the time she wins, making a tidy profit with her 4 to 1 odds (which she can later trade for extended quiet time or help cleaning the classroom). It’s a super surprising result, which you can learn all about just by googling “Birthday Paradox.”
As a thought experiment, I pictured a school exclusively for girls named April and wondered how the Birthday Paradox would play out there. Would the result be the same? I did my own googling to figure out the math (which I certainly didn’t remember) and then built some spreadsheets, incorporating the key fact that so many of the “Aprils” have birthdates clustered in the month of April.
Now, remember how a class of 23 students gives you a 50% chance of finding a shared birthday? Well in the School of April, a classroom with 23 kids has a whopping 63% chance that two kids share a birthdate. Further, to reach the point where there is a 50% chance of a shared birthdate, as the Birthday Paradox is often posed, you need just 18 “Aprils” in the class.
This is not purely an academic observation. In the business of patient/person data matching this type of correlation can be very meaningful.
HIM professionals and data stewards know that you need to take into account how unique a person’s name is when you are trying to resolve a potential duplicate record. For example, if you see two records for Albert Einstein you are more likely to expect them to represent the same person that if you see two records for John Smith – because you instinctively know that Albert Einstein is a much more unique name than John Smith in the US.
Some probabilistic matching algorithms found in EHRs and EMPIs even attempt to account for this by weighting names differently in their scoring algorithms, depending on how unique these names are in your dataset.
But if you made these “uniqueness” assumptions for April you’d be making a big mistake – unless you also took into account her birthday. Because while April is in the middle of the pack when it comes to the name’s uniqueness – with a frequency of about 1 in 1,200 US names – in the month of April it is three times more common, representing about 1 in 400 US names.
So if you are a data steward or health information management (HIM) professional trying to figure out whether April Jones is the same person as April Jones Hawkins, it turns out you actually need to think about it differently depending on whether these records have birthdates in April or not. And if you don’t have a data science team at your disposal to help with this problem – and more importantly, if you don’t have a reference database that contains these sorts of data insights – you might make a mistake and overestimate the uniqueness of April Jones’ name and incorrectly merge two different patients’ records.
Troublingly, no conventional patient matching technology accounts for these sorts of interesting data anomalies and insights – such as the abundance of Aprils born in April.
But Verato’s Referential Matching technology does take these data insights into account, by its very nature. Verato matches two records by comparing them to our reference database, allowing our algorithm to implicitly understand that an April born in April isn’t as unique as an April born in September – because our algorithm is matching against every April in the country and has insight into each April’s birthdate. Because of this, if our algorithm is matching two April’s born in April, it will be less certain that both Aprils represent the same person.
This is a subtle but powerful characteristic of our Referential Matching technology, and it makes all the difference between us and the conventional patient matching technologies on the market.