Posts Tagged 'sports'

NFL Statistical Analysis

As I mentioned in my previous post, there is a lack of both raw data and intense research and analysis in both the NFL and college football, compared to the other big three professional sports. While I’d like to look at all kinds of sports data, I think that for the time being I’m going to be focusing on football data.

The first obstacle is the lack of raw data available for recent NFL games. When I say raw data, I primarily mean easily weekly box scores in an easily parsable format (like XML, CSV, or even plain text). If you’re interested, I have found some sources, which have been promising, but just don’t have the exact stats that I’d like. Erik Berg has a wonderful site, full of regularly updated XML files for sports, but he focuses primarily on the NBA and college basketball, it would seem. His NFL beta box scores from the 2006 season also don’t have the individual player stats that I’d like to have also. But it’s like a good resource for information about the SportsML schema for storing box scores.

The football stats juggernaut, pro-football-reference.com is also a great source for data, and its new beta site, rbref.com, has a .csv format for lots of the stats on the site, but lacks enough stats for the individual games. Not to put down the work that they are doing at all…it is definitely amazing and useful, just doesn’t have exactlty what I’m looking for.

So, it would seem that my own pickiness would mean that I need to take matters into my own hands to come up with some sort of solution myself. The best remedy I could think of involved finding a source that provides box scores in .html format, and then writing a “Scraping” program that turns the .html file into usable data.  I checked out a few different sources, ESPN, CNN/Sports Illustrated, USA Today Yahoo! Sports, but I finally settled on nfl.com. In terms of ease of parsing the html and depth of the stats listed, it was far and away the best. Also, it has a good play-by-play for each game, in case that’s data that I’m ever interested in using.

So far, I’m in the process of writing this program (so far it correctly grabs the team stats, but hasn’t dealt with the individual stats yet), and my question is should I put any effort into writing the data back out into either an xml file or csv file, so that others can use? Would anyone find this type of data useful? If so which format would you prefer, xml or csv? Also, the nfl.com box scores only go back to the 2002 season, but that’s still five years worth of data with a lot of statistics, which I prefer over 20 years of minimal data. If you have any advice or input on this process, or any specific requests or ideas, feel free to let me know and I’ll keep you updated with the progress.

Advertisements

Not All Sports Are Created Equal

As I am beginning my journey into the realm of sports statistical analysis, I am finding that MLB and NBA research and data dominate the amount of information that can be found for the NFL and NHL, especially in terms of Sabermetric and ABPRmetrics analyzing individual players. If I had to guess, I’d put the disparity at around 80% baseball & basketball to 20% football and hockey. These are just ballpark figures and may be way off, as they’re isolated to my meager hours of research into the subject on the Internet.

So, the logical question to ask is, how is it that a sport as popular in the US as football, with a league as lucrative as the NFL, that there is such a relatively small amount of information in this type of analysis for the NFL? The NHL can be written off because it never has measured up to the rest of the big four in popularity, but the NFL? There has got to be something holding it back that lies within the game itself. So let’s compare the sports of baseball and football.

Each has a clear, common goal: obtain more points (runs) than your opponent to win a game. Win more games than any of your opponents to be successful in the season. The difference is in how the two sports achieve this goal, and in the number of ways. In baseball, essentially the only way to score runs is through hitting. In football, points can be scored by running, passing, special teams, or even on defense, which complicates things greatly. Another factor that makes football more complex is that the current situation in the game greatly affects how your team will try to score (running the ball with a lead vs. passing when you’re down), while in baseball you still try to get a hit whether the score is tied 2-2 in the top of the ninth, or whether you’re up 13 in a blowout (I realize that there are certain instances in baseball where certain types of offense are more valuable, i.e. sacrifice bunts, etc. but at the fundamental level, you don’t let the score of the game change the way you play, like in football).

There is also a greater difficulty of assigning value to a player and determining how great a role they played in achieving a particular outcome in football, than there is in baseball. Football players are broken up into different positions, which affect what they are doing 100% of the time. Baseball positions only affect what they are doing 50% of the time, because everyone is trying to get hits and score runs on the offense, (unless you’re a pitcher in the AL). Whereas in football, a left guard isn’t going to be trying to rack up receiving touchdowns or YAC, so there is no easy way to compare him to a wide receiver.

The bottom-line is football doesn’t lend itself to be dissected in the way that baseball, basketball, and even hockey do and in order for this to happen, there needs to be some sort of revolution in the way stats are kept at football games.