A while back, I discovered the predictive analytics tool, RapidMiner and wanted to try it out. Given my recently loaded US baby names data set, I thought I’d ask some silly questions… more to see how Rapid Miner models work, than to accurately predict anything.
I wondered if one could predict a US person’s most likely birth state given their first name, their gender, and their year of birth. First off, I created a Rapid Miner process which:
SELECT "name", "state", "gender","count", "year" FROM "public"."statenames" ORDER BY RANDOM() LIMIT 1000
SELECT "name", "gender", "year", "state" as "actual_state" FROM "public"."statenames" ORDER BY RANDOM() LIMIT 20
So this use of decision tree was clearly a mis-applicaton for this data set and question.
So let’s narrow the solution space by seeing if we can predict gender, a binomial, based on name and birth year using the national names data set. As an aside, there are a number of names which have both male and female records for a given year. For example, my own name ‘Guy’ has a handful of female counts for 59 different years. I leave it to the read to decide if these are accurate records (actual baby girls named ‘Guy’) or simply a data quality problem.
select year, count from nationalnames where name = 'Guy' and gender = 'F' order by year desc
So, I created a RapidMiner process with Naive Bayes model, increased the training set size to 10,000 records, and added a Performance (Binomial Classification) operator as shown:
Repeated runs generated results approximating a coin flip, so not great. Here’s an example:
While the results above are not particularly predictive, I did learn a good deal about Rapid Miner and its algorithms. For rapid prototyping, it is a very effective tool. For a better result from the name data, check out this example which predicts age from name and gender.