Archive for the 'statistics' Category

NFL Box Score XML Format

So, the work on the scraping program for putting NFL data and statistics into a parsable, useful format is coming along well. I’ve got a sample box score in html format (from nfl.com) and the corresponding xml file that my program creates. I’ll post the links here at the bottom, and what I’m looking for right now is some input about the format. I tried to keep the xml file schema as close to the logical layout found on the nfl.com box score page as possible.

The initial format contains a “game-metadata” section, and two “team” sections. The game-metadata section consists of the names of the two teams playing and is also a placeholder for a bunch of information that I’d like to include in the future, such as the date, day of the week, weather, surface of the playing field, whether or not it’s a dome, etc.

Each team has it’s own section with “team-metadata”, like team name, win or loss, current record, etc. It also houses all of the team stats from the box score, and they’re labeled intuitively, like passingTouchdowns, or fumblesLost. Also, they’re available in a format that can easily be parsed from a String to an integer. For example, what is read in the box score as 13-25 (passing completions and attempts) is listed in the xml file as two separate fields, passingComp=”17″ and passingAtt=”26″, so that you don’t need to worry about problems with converting to an int.

Also under the “team” tag is a section for individual player stats. These are broken down into different categories: passing, rushing, receiving, fumbles, kicking, punting, kickoff returns, punt returns, and defense. Within each category is a list of all the players who recorded a stat for that particular category. So Trent Edwards is listed under the passing and rushing category but not kickoff returns or defense. The other option that I was considering was just listing each player for a team with all of his stats together, rather than separating them by category. If you leave input, please keep this in mind.

Without further ado, here is the link to the original box score at nfl.com: Bills’ defense stifles Jets in victory

And here is the link for the xml file that my program generated: buf-nyj.xml

The xml file is currently hosted on my pesonal school web space, since WordPress has restrictions on uploading xml files. Any advice on a place to permanently store the full set of stats would be appreciated as well. Please keep in mind this is only a first trial and sample format and that in order to get the most use out of it, your input is needed!

NFL Statistical Analysis

As I mentioned in my previous post, there is a lack of both raw data and intense research and analysis in both the NFL and college football, compared to the other big three professional sports. While I’d like to look at all kinds of sports data, I think that for the time being I’m going to be focusing on football data.

The first obstacle is the lack of raw data available for recent NFL games. When I say raw data, I primarily mean easily weekly box scores in an easily parsable format (like XML, CSV, or even plain text). If you’re interested, I have found some sources, which have been promising, but just don’t have the exact stats that I’d like. Erik Berg has a wonderful site, full of regularly updated XML files for sports, but he focuses primarily on the NBA and college basketball, it would seem. His NFL beta box scores from the 2006 season also don’t have the individual player stats that I’d like to have also. But it’s like a good resource for information about the SportsML schema for storing box scores.

The football stats juggernaut, pro-football-reference.com is also a great source for data, and its new beta site, rbref.com, has a .csv format for lots of the stats on the site, but lacks enough stats for the individual games. Not to put down the work that they are doing at all…it is definitely amazing and useful, just doesn’t have exactlty what I’m looking for.

So, it would seem that my own pickiness would mean that I need to take matters into my own hands to come up with some sort of solution myself. The best remedy I could think of involved finding a source that provides box scores in .html format, and then writing a “Scraping” program that turns the .html file into usable data.  I checked out a few different sources, ESPN, CNN/Sports Illustrated, USA Today Yahoo! Sports, but I finally settled on nfl.com. In terms of ease of parsing the html and depth of the stats listed, it was far and away the best. Also, it has a good play-by-play for each game, in case that’s data that I’m ever interested in using.

So far, I’m in the process of writing this program (so far it correctly grabs the team stats, but hasn’t dealt with the individual stats yet), and my question is should I put any effort into writing the data back out into either an xml file or csv file, so that others can use? Would anyone find this type of data useful? If so which format would you prefer, xml or csv? Also, the nfl.com box scores only go back to the 2002 season, but that’s still five years worth of data with a lot of statistics, which I prefer over 20 years of minimal data. If you have any advice or input on this process, or any specific requests or ideas, feel free to let me know and I’ll keep you updated with the progress.