XML Box Scores Now Available

It’s been a little while, due to waiting on the arrival of a new laptop, but my stats parser for generating NFL box scores in an XML format is finally ready to be “beta-tested.” What I mean by this is, I have XML box scores for the entire 2006 season, which I’m going to make available for download. Due to the sheer volume of statistics available, there’s no way that I can test the accuracy of each file, but I will say that I have run a few tests that look at all of the games for the season and I came up with the same season leaders in multiple categories as nfl.com has on their web site.

Unfortunately, WordPress doesn’t allow me to upload either zip files or xml files, so I’ve uploaded them to a third-party site, mediafire.com and they can be found at this link,  http://www.mediafire.com/?23mtrjxwfmy . That file unzips to a folder that contains a folder for each of the 17 weeks of the 2006 season, each of which contains all of the games played in that week. The format is very similar to the one mentioned here, NFL Box Score XML Format, with a few additions found in the gamebook for the game.

If you download the files and look at the xml files, please leave a comment to let me know what you think of the format, or if there are any improvements you’d like to see. I’m planning on using the 2006 season as a “test-run” sort of thing while I work on a gamebook parser that will be included in the 2007 season (and later improve the 2006). Thanks, and enjoy!


NFL “Official” Analysis

No, no, I’m not making any claims to be the best or official analyzer for the NFL. Rather, I was wondering whether anyone has ever attempted to perform some sort of analysis on the officiating crews for NFL games. With all the hoopla and conspiracy theories regarding whether the NFL wants the Patriots to continue their undefeated season, I thought it would be interesting to see if there is any correlation between certain officials reffing certain games, and then I realized there would probably not be enough data to see how a crew performs on a team-by-team basis, but there would be enough data to compare them to other officiating crews throughout the season in terms of how many penalties they call as opposed to other crews. Once I finish my gamebook parser with the play-by-play I’ll also know how many of what type of penalty was called and so you can find the crews with the most holding calls, pass interference, etc.

I’m still working on gathering all the data for this (I have the parsing program working for certain sections, but now I have the tedious task of downloading all the pdfs and converting them to html for all the games) but once I have it, can anyone suggest the types of regressions I should perform to try to discover something useful? I have some background in statistics, but not as much as I’d like, though I’m definitely willing to learn. I’m planning on including the officials data in my xml file of the box score for the game, so everyone else can use that data as well.

Bills’ Playoff Forecast

So it’s time that I let you all in on the disappointing secret that I am a Buffalo Bills fan. In case you were wondering, from childhood, I selected sports franchises that were doom to let me down – the Bills, Buffalo Sabres, and Chicago Cubs – and even my college team, NC State. Now that I have divulged that embarrassing piece of information, I’ll let you in on a little pastime of mine that comes around the second week of December. I begin the long-drawn process of figuring out the exact scenarios necessary for my dreams of the Bills making the playoffs to come true.

Here we are again – Buffalo teetering on mediocrity at 6-6 with 4 games to go: Miami, Cleveland, New York Giants, and Philly. I consider the first two, Miami and Cleveland, to be “must-win” for all intents and purposes. Miami, because it is a conference game, and an important game for the Bills to gain some momentum, and Cleveland because they are currently one game ahead of the Bills in the AFC Wild-Card race. A victory there would not only tie them up in the race, but also give them the head-t0-head tiebreaker in a two-way tie for the last wildcard spot. At 8-6, the Bills would almost control their own destiny. I’ll wait one week to start my exact scenarios but I wanted to mention the chances that the nfl-forecast.com blog, powered by Brian Burke‘s prediction model give the Bills. After Week 12 the Bills had just a 3% chance of obtaining one of the two AFC wild cards. Granted this number probably went up after the Bills’ victory and a loss by the Browns, I still think it will be a little low.

I have a lot of respect for Brian’s prediction methodology and think it’s the best model that I’ve come across, it can’t take things like the return of Marshawn Lynch, or Trent Edwards taking over the starting job. For now, I’ll just hold onto the glimmer of hope my teams always give me, before I’m brutally crushed.


NFL-forecast.com has released its post-Week 13 playoff predictions and the Bills bring home a whopping 17.88% chance of making the playoffs. Now my heart will only be broken a little more than 4 out of 5 times!

Availability of Free NFL Statistics

As I’ve been trying to work on my program which parses an NFL.com box score, I have to wonder why there is such a paucity of usable statistics, in not only the NFL but other professional sports leagues. Sure there are stats out there, but it seems to be an either/or choice in terms of either comprehensiveness and timeliness (i.e. nfl.com’s stats) or ease of use and downloadability (i.e. the csv files at pro-football-reference.com). Now I can understand why a site like ESPN or SI.com wouldn’t be able to make the stats it shows available for download, since it gets them from the stats conglomerate stats.com. However, the NFL shouldn’t have any legal qualms about making their stats available for download. So you have to wonder what is stopping the NFL from doing this.

Is it an issue of time spent to put them in a format available to download, or a question of technical server demands? I highly doubt it. Is it pressure from companies like Stats Inc. wanting to maintain a quasi-monopoly on the market? Probably not? The most likely reason is money. The NFL is probably trying to work on a way that they can make the stats available, but for a price. So, if the stats are easily accessible for free now, but at some point down the line they decide to make them available for a fee, they have just lost some money.

So if money is an issue, why not start making these stats available for a small fee to personal (non-commercial) use. Or attach a disclaimer that they can only be used for non-commercial use but sell a commercial license? On the surface there seems very little reason for the stats to be available on the site on separate pages for free, but no database dump available for download.

NFL Box Score XML Format

So, the work on the scraping program for putting NFL data and statistics into a parsable, useful format is coming along well. I’ve got a sample box score in html format (from nfl.com) and the corresponding xml file that my program creates. I’ll post the links here at the bottom, and what I’m looking for right now is some input about the format. I tried to keep the xml file schema as close to the logical layout found on the nfl.com box score page as possible.

The initial format contains a “game-metadata” section, and two “team” sections. The game-metadata section consists of the names of the two teams playing and is also a placeholder for a bunch of information that I’d like to include in the future, such as the date, day of the week, weather, surface of the playing field, whether or not it’s a dome, etc.

Each team has it’s own section with “team-metadata”, like team name, win or loss, current record, etc. It also houses all of the team stats from the box score, and they’re labeled intuitively, like passingTouchdowns, or fumblesLost. Also, they’re available in a format that can easily be parsed from a String to an integer. For example, what is read in the box score as 13-25 (passing completions and attempts) is listed in the xml file as two separate fields, passingComp=”17″ and passingAtt=”26″, so that you don’t need to worry about problems with converting to an int.

Also under the “team” tag is a section for individual player stats. These are broken down into different categories: passing, rushing, receiving, fumbles, kicking, punting, kickoff returns, punt returns, and defense. Within each category is a list of all the players who recorded a stat for that particular category. So Trent Edwards is listed under the passing and rushing category but not kickoff returns or defense. The other option that I was considering was just listing each player for a team with all of his stats together, rather than separating them by category. If you leave input, please keep this in mind.

Without further ado, here is the link to the original box score at nfl.com: Bills’ defense stifles Jets in victory

And here is the link for the xml file that my program generated: buf-nyj.xml

The xml file is currently hosted on my pesonal school web space, since WordPress has restrictions on uploading xml files. Any advice on a place to permanently store the full set of stats would be appreciated as well. Please keep in mind this is only a first trial and sample format and that in order to get the most use out of it, your input is needed!

NFL Statistical Analysis

As I mentioned in my previous post, there is a lack of both raw data and intense research and analysis in both the NFL and college football, compared to the other big three professional sports. While I’d like to look at all kinds of sports data, I think that for the time being I’m going to be focusing on football data.

The first obstacle is the lack of raw data available for recent NFL games. When I say raw data, I primarily mean easily weekly box scores in an easily parsable format (like XML, CSV, or even plain text). If you’re interested, I have found some sources, which have been promising, but just don’t have the exact stats that I’d like. Erik Berg has a wonderful site, full of regularly updated XML files for sports, but he focuses primarily on the NBA and college basketball, it would seem. His NFL beta box scores from the 2006 season also don’t have the individual player stats that I’d like to have also. But it’s like a good resource for information about the SportsML schema for storing box scores.

The football stats juggernaut, pro-football-reference.com is also a great source for data, and its new beta site, rbref.com, has a .csv format for lots of the stats on the site, but lacks enough stats for the individual games. Not to put down the work that they are doing at all…it is definitely amazing and useful, just doesn’t have exactlty what I’m looking for.

So, it would seem that my own pickiness would mean that I need to take matters into my own hands to come up with some sort of solution myself. The best remedy I could think of involved finding a source that provides box scores in .html format, and then writing a “Scraping” program that turns the .html file into usable data.  I checked out a few different sources, ESPN, CNN/Sports Illustrated, USA Today Yahoo! Sports, but I finally settled on nfl.com. In terms of ease of parsing the html and depth of the stats listed, it was far and away the best. Also, it has a good play-by-play for each game, in case that’s data that I’m ever interested in using.

So far, I’m in the process of writing this program (so far it correctly grabs the team stats, but hasn’t dealt with the individual stats yet), and my question is should I put any effort into writing the data back out into either an xml file or csv file, so that others can use? Would anyone find this type of data useful? If so which format would you prefer, xml or csv? Also, the nfl.com box scores only go back to the 2002 season, but that’s still five years worth of data with a lot of statistics, which I prefer over 20 years of minimal data. If you have any advice or input on this process, or any specific requests or ideas, feel free to let me know and I’ll keep you updated with the progress.

Not All Sports Are Created Equal

As I am beginning my journey into the realm of sports statistical analysis, I am finding that MLB and NBA research and data dominate the amount of information that can be found for the NFL and NHL, especially in terms of Sabermetric and ABPRmetrics analyzing individual players. If I had to guess, I’d put the disparity at around 80% baseball & basketball to 20% football and hockey. These are just ballpark figures and may be way off, as they’re isolated to my meager hours of research into the subject on the Internet.

So, the logical question to ask is, how is it that a sport as popular in the US as football, with a league as lucrative as the NFL, that there is such a relatively small amount of information in this type of analysis for the NFL? The NHL can be written off because it never has measured up to the rest of the big four in popularity, but the NFL? There has got to be something holding it back that lies within the game itself. So let’s compare the sports of baseball and football.

Each has a clear, common goal: obtain more points (runs) than your opponent to win a game. Win more games than any of your opponents to be successful in the season. The difference is in how the two sports achieve this goal, and in the number of ways. In baseball, essentially the only way to score runs is through hitting. In football, points can be scored by running, passing, special teams, or even on defense, which complicates things greatly. Another factor that makes football more complex is that the current situation in the game greatly affects how your team will try to score (running the ball with a lead vs. passing when you’re down), while in baseball you still try to get a hit whether the score is tied 2-2 in the top of the ninth, or whether you’re up 13 in a blowout (I realize that there are certain instances in baseball where certain types of offense are more valuable, i.e. sacrifice bunts, etc. but at the fundamental level, you don’t let the score of the game change the way you play, like in football).

There is also a greater difficulty of assigning value to a player and determining how great a role they played in achieving a particular outcome in football, than there is in baseball. Football players are broken up into different positions, which affect what they are doing 100% of the time. Baseball positions only affect what they are doing 50% of the time, because everyone is trying to get hits and score runs on the offense, (unless you’re a pitcher in the AL). Whereas in football, a left guard isn’t going to be trying to rack up receiving touchdowns or YAC, so there is no easy way to compare him to a wide receiver.

The bottom-line is football doesn’t lend itself to be dissected in the way that baseball, basketball, and even hockey do and in order for this to happen, there needs to be some sort of revolution in the way stats are kept at football games.