Archive for November, 2007

NFL Box Score XML Format

So, the work on the scraping program for putting NFL data and statistics into a parsable, useful format is coming along well. I’ve got a sample box score in html format (from and the corresponding xml file that my program creates. I’ll post the links here at the bottom, and what I’m looking for right now is some input about the format. I tried to keep the xml file schema as close to the logical layout found on the box score page as possible.

The initial format contains a “game-metadata” section, and two “team” sections. The game-metadata section consists of the names of the two teams playing and is also a placeholder for a bunch of information that I’d like to include in the future, such as the date, day of the week, weather, surface of the playing field, whether or not it’s a dome, etc.

Each team has it’s own section with “team-metadata”, like team name, win or loss, current record, etc. It also houses all of the team stats from the box score, and they’re labeled intuitively, like passingTouchdowns, or fumblesLost. Also, they’re available in a format that can easily be parsed from a String to an integer. For example, what is read in the box score as 13-25 (passing completions and attempts) is listed in the xml file as two separate fields, passingComp=”17″ and passingAtt=”26″, so that you don’t need to worry about problems with converting to an int.

Also under the “team” tag is a section for individual player stats. These are broken down into different categories: passing, rushing, receiving, fumbles, kicking, punting, kickoff returns, punt returns, and defense. Within each category is a list of all the players who recorded a stat for that particular category. So Trent Edwards is listed under the passing and rushing category but not kickoff returns or defense. The other option that I was considering was just listing each player for a team with all of his stats together, rather than separating them by category. If you leave input, please keep this in mind.

Without further ado, here is the link to the original box score at Bills’ defense stifles Jets in victory

And here is the link for the xml file that my program generated: buf-nyj.xml

The xml file is currently hosted on my pesonal school web space, since WordPress has restrictions on uploading xml files. Any advice on a place to permanently store the full set of stats would be appreciated as well. Please keep in mind this is only a first trial and sample format and that in order to get the most use out of it, your input is needed!


NFL Statistical Analysis

As I mentioned in my previous post, there is a lack of both raw data and intense research and analysis in both the NFL and college football, compared to the other big three professional sports. While I’d like to look at all kinds of sports data, I think that for the time being I’m going to be focusing on football data.

The first obstacle is the lack of raw data available for recent NFL games. When I say raw data, I primarily mean easily weekly box scores in an easily parsable format (like XML, CSV, or even plain text). If you’re interested, I have found some sources, which have been promising, but just don’t have the exact stats that I’d like. Erik Berg has a wonderful site, full of regularly updated XML files for sports, but he focuses primarily on the NBA and college basketball, it would seem. His NFL beta box scores from the 2006 season also don’t have the individual player stats that I’d like to have also. But it’s like a good resource for information about the SportsML schema for storing box scores.

The football stats juggernaut, is also a great source for data, and its new beta site,, has a .csv format for lots of the stats on the site, but lacks enough stats for the individual games. Not to put down the work that they are doing at all…it is definitely amazing and useful, just doesn’t have exactlty what I’m looking for.

So, it would seem that my own pickiness would mean that I need to take matters into my own hands to come up with some sort of solution myself. The best remedy I could think of involved finding a source that provides box scores in .html format, and then writing a “Scraping” program that turns the .html file into usable data.  I checked out a few different sources, ESPN, CNN/Sports Illustrated, USA Today Yahoo! Sports, but I finally settled on In terms of ease of parsing the html and depth of the stats listed, it was far and away the best. Also, it has a good play-by-play for each game, in case that’s data that I’m ever interested in using.

So far, I’m in the process of writing this program (so far it correctly grabs the team stats, but hasn’t dealt with the individual stats yet), and my question is should I put any effort into writing the data back out into either an xml file or csv file, so that others can use? Would anyone find this type of data useful? If so which format would you prefer, xml or csv? Also, the box scores only go back to the 2002 season, but that’s still five years worth of data with a lot of statistics, which I prefer over 20 years of minimal data. If you have any advice or input on this process, or any specific requests or ideas, feel free to let me know and I’ll keep you updated with the progress.

Not All Sports Are Created Equal

As I am beginning my journey into the realm of sports statistical analysis, I am finding that MLB and NBA research and data dominate the amount of information that can be found for the NFL and NHL, especially in terms of Sabermetric and ABPRmetrics analyzing individual players. If I had to guess, I’d put the disparity at around 80% baseball & basketball to 20% football and hockey. These are just ballpark figures and may be way off, as they’re isolated to my meager hours of research into the subject on the Internet.

So, the logical question to ask is, how is it that a sport as popular in the US as football, with a league as lucrative as the NFL, that there is such a relatively small amount of information in this type of analysis for the NFL? The NHL can be written off because it never has measured up to the rest of the big four in popularity, but the NFL? There has got to be something holding it back that lies within the game itself. So let’s compare the sports of baseball and football.

Each has a clear, common goal: obtain more points (runs) than your opponent to win a game. Win more games than any of your opponents to be successful in the season. The difference is in how the two sports achieve this goal, and in the number of ways. In baseball, essentially the only way to score runs is through hitting. In football, points can be scored by running, passing, special teams, or even on defense, which complicates things greatly. Another factor that makes football more complex is that the current situation in the game greatly affects how your team will try to score (running the ball with a lead vs. passing when you’re down), while in baseball you still try to get a hit whether the score is tied 2-2 in the top of the ninth, or whether you’re up 13 in a blowout (I realize that there are certain instances in baseball where certain types of offense are more valuable, i.e. sacrifice bunts, etc. but at the fundamental level, you don’t let the score of the game change the way you play, like in football).

There is also a greater difficulty of assigning value to a player and determining how great a role they played in achieving a particular outcome in football, than there is in baseball. Football players are broken up into different positions, which affect what they are doing 100% of the time. Baseball positions only affect what they are doing 50% of the time, because everyone is trying to get hits and score runs on the offense, (unless you’re a pitcher in the AL). Whereas in football, a left guard isn’t going to be trying to rack up receiving touchdowns or YAC, so there is no easy way to compare him to a wide receiver.

The bottom-line is football doesn’t lend itself to be dissected in the way that baseball, basketball, and even hockey do and in order for this to happen, there needs to be some sort of revolution in the way stats are kept at football games.

Sports + Math = Perfect

Sports fans rally behind the unpredictable, cherish the suspenseful, and celebrate the emotional. These are age-old tenets that are proven time and again by classic sports stories that will live in the annals of sports history. With those tenets acknowledged, what is there to be said of bringing method to the madness? Does analyzing sports under a statistical microscope detract from their underlying beauty and majesty?

There is certainly an argument to be made against a movement of limiting sports to numbers and data, especially for the casual fan. A 8 year-old boy watching his baseball hero on a Saturday afternoon doesn’t want to hear that his favorite player was traded because his GM is adopting a “Moneyball” mentality. At the same time, I’d be willing to bet that same young boy could rattle off that player’s batting average and HR totals for the last few seasons.

So, to answer the question of whether or not intense sports statistical analysis detracts from sports, I’d have to go with a resounding “Not in the least.” If I’m watching an regular season MLB game, in which I have no stake in either of the teams playing, I think that if the announcers discussed sabermetric stats such as Runs Created, or Win Shares, I’d be interested in a game I might normally find boring. Also seeing statistics about George Mason’s run to the Final Four and all its unlikelihood, only serve to make the feat even more special. To me, statistical analysis in sports serves to make the mundane interesting, and the arcane all the more memorable.