Posts Tagged 'Data Mining'

NFL Statistical Analysis

As I mentioned in my previous post, there is a lack of both raw data and intense research and analysis in both the NFL and college football, compared to the other big three professional sports. While I’d like to look at all kinds of sports data, I think that for the time being I’m going to be focusing on football data.

The first obstacle is the lack of raw data available for recent NFL games. When I say raw data, I primarily mean easily weekly box scores in an easily parsable format (like XML, CSV, or even plain text). If you’re interested, I have found some sources, which have been promising, but just don’t have the exact stats that I’d like. Erik Berg has a wonderful site, full of regularly updated XML files for sports, but he focuses primarily on the NBA and college basketball, it would seem. His NFL beta box scores from the 2006 season also don’t have the individual player stats that I’d like to have also. But it’s like a good resource for information about the SportsML schema for storing box scores.

The football stats juggernaut, pro-football-reference.com is also a great source for data, and its new beta site, rbref.com, has a .csv format for lots of the stats on the site, but lacks enough stats for the individual games. Not to put down the work that they are doing at all…it is definitely amazing and useful, just doesn’t have exactlty what I’m looking for.

So, it would seem that my own pickiness would mean that I need to take matters into my own hands to come up with some sort of solution myself. The best remedy I could think of involved finding a source that provides box scores in .html format, and then writing a “Scraping” program that turns the .html file into usable data.  I checked out a few different sources, ESPN, CNN/Sports Illustrated, USA Today Yahoo! Sports, but I finally settled on nfl.com. In terms of ease of parsing the html and depth of the stats listed, it was far and away the best. Also, it has a good play-by-play for each game, in case that’s data that I’m ever interested in using.

So far, I’m in the process of writing this program (so far it correctly grabs the team stats, but hasn’t dealt with the individual stats yet), and my question is should I put any effort into writing the data back out into either an xml file or csv file, so that others can use? Would anyone find this type of data useful? If so which format would you prefer, xml or csv? Also, the nfl.com box scores only go back to the 2002 season, but that’s still five years worth of data with a lot of statistics, which I prefer over 20 years of minimal data. If you have any advice or input on this process, or any specific requests or ideas, feel free to let me know and I’ll keep you updated with the progress.