Friday, May 14, 2010

Reading for my European Travel

I will be in Europe on business travel the next two weeks. The majority of my time will be spent at DAF's engineering facility in Eindhoven, The Netherlands with a trip to see friends in Cologne, Germany over the weekend. I've loaded the following items onto my Kindle for my flight to and from Europe, some being soccer-related and others not.

I've also thought about picking up Inverting the Pyramid (sadly not available on the Kindle) at a local bookstore before I depart on Sunday, as well as Soccer Against the Enemy. In fact, one dedicated follower has recommended that I read Soccer Against the Enemy as it partly inspired How Soccer Explains the World. That leads me to a request of my readers:

  • What soccer-related books would you recommend to someone who has been following the game for about a year or so?

This post also serves as my official notice of a lower-than-normal post count on my blog over the next two weeks. I think all of you will forgive me if I decide to take my precious free time in Europe to enjoy all that it has to offer while I am there for two (too?) brief weeks.

I hope all of you have a wonderful weekend, and go out and enjoy some soccer!

Thursday, May 13, 2010

The politics of the international soccer market

As an Arsenal fan, I am always worried about Cesc Fabregas leaving the team on a transfer to what many consider his home club - Barcelona. I read an interesting story today on this topic which makes me thankful that US sports teams don't elect their leadership. Apparently Barcelona regularly have candidates seeking votes via the promised signing of international soccer stars.

Barcelona are due to hold presidential elections this summer, where prospective candidates promise stellar signings to promote their bid for power. Local boy Fabregas would be a hugely popular arrival at the Nou Camp due to his links with the club, meaning a huge offer could soon arrive on Wenger's doorstep.
I hope that like most politician's promises, this one goes unfulfilled. However, I am sure there are Liverpool fans who wish teams had such a voting arrangement for ownership!

Wednesday, May 12, 2010

Quantifying Edson Buddle's Hot Start


On pace for one of the greatest goal-scoring seasons ever

Everyone's been amazed by Edson Buddle's fast start to the 2010 MLS season - 9 goals in 8 matches. It's one of the greatest starts to a season, and likely was a key factor in his selection to the US Men's National Team preliminary squad for the World Cup. There is a statistical way to quantify how good his pace is compared to the Top 5 goal scorers in each MLS season. That method is discussed below.

Statistical Methodology

Given that each player plays a different number of matches due to injuries, service on the national team, and other reasons I will look to a normalized metric rather than the raw statistic of total number of goals scored. The key to the analysis will be to select a normalized metric that produces a normal distribution - this will allow for the most straightforward analysis to understand how well Buddle is doing.

I tried two statistics. The first was the traditional goals per game. That produced a non-normal data set according to a normality test. I then performed a Box-Cox transform to see if any transformation of this data set would yield a normal data set. The results of the Box-Cox test was -1.0, which means the inverse of the data set. This means the most likely normal distribution is actually produced when looking at games per goal.

Sure enough, the data set is normal based upon the p-value of 0.562 in Figure 1.

Figure 1: Graphical summary of games per goal

Now that we have a normal data set, we can generate the same data (goals per game) for Buddle and the four other top goal scorers so far in 2010 and compare them to the historical data via a Z-statistic. The Z-statistic takes the player's value (X), the historical mean (mu) and standard deviation (sigma), and creates a single point Z-score that can be converted to a percentile via tools like this website. Figure 2 shows the equation for Z.

Figure 2: Z statistic equation

Based upon Figure 1, mu = 1.7919 and sigma = 0.3810 for games per goal.

Where Buddle sits vs. MLS history

Figure 3 shows the 2010 Top 5 goal scorers' Z statistics and resultant percentiles vs. the 1996-2009 historical data. As one is trying to minimize the games played per goal, percentiles will be the reverse of what's normally expected. Edson Buddle's performance to date has him in the first percentile - that means he scores goals more frequently than 99%+ of the players in MLS history. What's interesting is that three players (Buddle, Dwayne De Rosario, and Chris Wondolowski) are currently on pace to score more frequently than 95% of the players who have ever played in MLS.

Figure 3: 2010 Top 5 goal scorers

I will keep this table up to date on a weekly basis, and track Buddle's progress towards MLS history as the most frequent goal scorer in MLS history. I will also spend some time in the coming week or two creating similar analyses for assists, shots, and shots on goal.

Monday, May 10, 2010

The Impact of MLS expansion on playoff qualification

In an earlier post I commented on the points and goals it would take to secure at least 8th place in the 2010 MLS table - the cutoff for the playoffs. I then wondered how those numbers would be affected with the addition of Vancouver and Portland in 2011 and Montreal in 2012. Luckily the response variable I used in the regression was generic:

-ln[p/(number of teams - p)]

Therefore, I can simply plug in the number of teams for the next couple of seasons, determine the response value that correlates to p=8 for each number of teams, and create the same 95% PI and CI values for each season. Figure 1 shows the results of that analysis up to the 20 teams everyone suspects is MLS's ultimate goal. I skipped the scenario for 17 teams given that the league will jump from 16 to 18 teams in 2011.

Figure 1: Goal differential and points required to make MLS playoffs based upon 16-20 teams (click to enlarge)

The net effect of adding each new team is to add one more goal and one more point to the minimum required to qualify for the playoffs.

There are a few assumptions made in this analysis. First, I assume that team performance holds to the historical average from 2005 through 2009. I will attempt to mitigate the impact of this assumption by updating the regression data set at the end of each season with the previous season's data. Second, it doesn't account fully for the additional points that would be available to a 18 to 20 team league running a fully balanced schedule. The most games any team played in the 2005-2009 time frame was 32 in the 2005 and 2006 seasons. If MLS were to keep the balanced schedule from 2010 in 2011, the league would jump to 34 games which means 6 extra points available to each time beyond the maximum available in my regression model to date. One might expect a shift upwards in the points required to finish eighth once teams are added.

It will be interesting to see how well the regression prediction plays out this season with the addition of the 16th team, and into the next two seasons as three more teams are added.

Sunday, May 9, 2010

MLS Predicted Finish: May 9th, 2010

Predicted League Finish (click to enlarge)

This week saw many teams play two games as MLS attempts to bank matches ahead of the World Cup break in the middle of the season. This action also helps any prediction methods like the one I use, as a greater number of matches helps minimize errors associated with isolated hot streaks. Herewith is my summary of the impact of last week's action after a brief reminder of the background on my prediction methods. The background section will be a standard feature in each weekly post on MLS predicted finish, so regular readers of the blog can skip over it.

Background

In a previous post I noted three regression equations that I developed that have very good correlation to the last 5 years of play in MLS: finishing position vs. points, finishing position vs. goal differential, and points vs. goal differential. I use each team's current points and goal differential to project how they will finish the season in those categories, use those projections within the regression equations to predict finishing position, and then take the average of the three predictions to develop an average finish used to rank the teams.

Subsequently, I made a post explaining the range of values in goal differential and points that provided greater confidence in finishing position. This post set minimum values for goal differential and points required to achieve the 95th% confidence (CI) and prediction (PI) intervals. These minimum values are used to color code each team's projected season totals in goal differential and points, indicating the likelihood of the teams' finish in the Top-8 of the league.

Now, on to the impact of this week's action.

Galaxy continue to roll

I was fortunate (?) enough to witness the buzzsaw that are the LA Galaxy as they dismantled my Seattle Sounders for a 4-0 win. The team seems to have used last year's MLS Cup loss as a driving force in this year's campaign. Landon Donovan ran all over the pitch Saturday, keying three of the goals and then scoring his first of the year for the fourth. Seattle certainly contributed with their own bad plays on defense and in goal, but it took LA's killer instinct to take full advantage of them. This match followed LA's 1-0 win over Colorado mid-week, meaning they picked up 6 points and an average of a 2.5 goal differential per match. I didn't think they could get much better in season-ending finishing statistics, but they moved their goal differential up 3 (83) and goal differential up 9 (49) with last week's matches.

To give an indication of how well they continue to play, the only other two teams that are "all green" in the three key attributes are the Columbus Crew and San Jose Earthquakes. Neither team has played more than six games, which means they are greatly benefiting from a fast early season start with a high multiplier for total games (30) divided by games played (5 and 6, respectively). Even then, both teams are nearly 20 points behind and have less than half the goal differential of LA. Simply amazing!

Seattle continues to struggle

Of personal interest to me is the poor play of the Seattle Sounders. I mentioned before that their three ties from stoppage time goals would come back to haunt them. This weekend's loss to the LA Galaxy only made things worse. Seattle is now mired in last place in the hyper-competitive Western Conference (see discussion below). This week they stayed "all red" - their goal differential (-4) finally caught up with their low point total. Things won't get easier for the Sounders either - they face the East-leading NY Red Bulls this week at Red Bull Arena. It may be a long, hot summer for one of the league's most visible franchises. Seattle fans are a fickle bunch, and it will be interesting to see how many more sell outs the team records if their fortunes continue to deteriorate.

The West is dominating the early season

Six of the Top-8 teams are from the Western Conference. The Red Bulls loss this week lowered them to the sixth position, and two of their metrics (goal differential and points based off goal differential) turned red. The only other Eastern Conference team in the Top-8 is Columbus, who has only played 5 matches and is benefiting massively from a huge multiplier for their projected season-ending statistics.

This week also witnessed increased definition around the red-yellow-green statistics for each of the predicted attributes. There is now clear delineation around the eighth place position - those below it are now all red. The picture of possible league finish position is becoming clearer with the greater number of games played, yet we're only a third of the way through the season.

Biggest movers

Top three teams moving up in average predicted finish are:
  • Real Salt Lake (+5): Real's 3-0 win over Philadelphia moved them up four spots based upon points, three spots based upon goal differential, and four spots based upon projected points based upon goal differential.
  • San Jose Earthquakes (+4): San Jose's 4-0 win over the Red Bulls moved them up one spot based upon points, six spots based upon goal differential, and seven spots upon projected points based upon goal differential.
  • Chivas USA (+3): Chivas' 4-0 win against New England helped mitigate the effects of a 2-0 loss to Houston on Saturday. More importantly, they benefited from the poor performance of others at the bottom at the table. They're now just two positions out of eighth, and are yet another Western Conference team beating out the bulk of the Eastern Conference.
Bottom three teams moving down in average predicted finish are:
  • Chicago Fire (-5): Chicago's 4-1 loss to Toronto moved them out of the sixth position and down to eleventh. It cost them three spots based upon points, six spots based upon goal differential, and six spots upon projected points based upon goal differential.
  • New England Revolution (-4): The Revolution's losses to Chivas and Columbus cost them three spots on points, eight spots on goal differential, and five spots based upon points from goal differential.
  • New York Red Bulls (-3) and Colorado Rapids (-3): Losses by each team this week moved them towards the bottom end of the Top-8 spots in the league table.
Another interesting view of the table

Finally, I'd like to make readers aware of another awesome resource for ongoing table predictions throughout the season (HT: Boco_T). The guys at Sports Club Stats have a great prediction tool for any league you want to follow, including MLS. They use Monte Carlo statistical methods to project likely finish position, as well as the impact of a win, draw, or loss in the next match on each team's playoff chances. I'd suggest readers check out the website if they're interested in such data. I've added their EPL and MLS tables to my "Favorite Blogs and Websites" section to the right of this post.

Thursday, May 6, 2010

What does it take to make the 2010 MLS playoffs?

In the last few posts I have explained how regression equations based upon 2005 through 2009 MLS data can be used to judge how well teams are performing in the 2010 season. Those posts culminated in this table, which I will update on a weekly basis with commentary throughout the season. I have done some further studies, mainly of the goal differential and points required to finish in the eighth spot in the table and make the 2010 playoffs.

Statistical Background

There are three ways to answer the question of what goal differential or points are required to finish 8th in the table. They revolve around the three statistical concepts below.
  • Linear regression: A best fit line that represents the mean value of y for a given value of x.
  • Confidence interval (CI): A range of values based upon the statistical spread within the data set and regression, often representing a range of values that are the likely distribution of the mean.
  • Prediction interval (PI): A prediction of the range of future, individual observations based upon the data observed to date that is used to construct the regression.
In general, one can think of the regression as the single point mean of y for a given x value, the CI is the likely range of means for that same x, and the PI is the range of individual observations one would expect to see for the specific value of x. Those ranges, the CI and PI, are represented by the narrow and wider dashed lines, respectively, in the Figure 1.

Figure 1: Relationship of finish position and points from this earlier post.

Knowing the ranges of values expected for any value of x allows us to construct which values of x allow us greater certainty in finishing in the 8th position. In the case of my regression data, I have used the common 95th percentile distribution for constructing CI's and PI's. That means that for any value of x that I study, I will account for 95% of the expected values in the CI and PI when setting the bounds of any test.

Because the CI and PI involve distributions and not nominal values, the x-value that ensures the 8th place finishing position is below the range of potential finishing positions will be higher than that predicted by the regression equation.

Applying statistics to the League Table

In the case of my 2010 league table, I have constructed the following rules:
  • Values are red if they fall lower than critical x-value from the regression. This means that they have less than a 50% chance of making the playoffs.
  • Values are yellow if they fall between the critical x-value from the regression and the lower value of the 95% PI. This means that their chances of making the playoffs are between 50% and 95%.
  • Values are green when they are greater than or equal to the lower end of the 95% PI. This means a team has less than a 5% chance of missing the playoffs.
Values for the goal differential and points that correspond to the CI and PI are shown in Figure 2 below. It shows that a team needs between a 3 and 16 goal differential and between 44 and 51 points to make sure they qualify for the playoffs.

Figure 2: x-values where lowest values in 95th% CI and PI are greater y-value that corresponds to 8th place in the table

The Modified League Table

Figure 3 shows the updated league table with the associated color codings. It gives a more complete picture of where teams are consistently performing at playoff form (LA, Columbus, and NY), the bulk in the middle that are showing mixed results, and those at the bottom who are already in danger of not making the playoffs. I have kept the average predicted finish from the three regressions shown in my earlier post, while collapsing the three constituent columns for those regressions to make the table easier to read.

Figure 3: League table as of May 3, 2010.

I will continue to update this table on a weekly basis, along with each of the columns and their colors. Hopefully it will shed some light on shifts in table position that will occur on a weekly basis - whether its goals or points.

Wednesday, May 5, 2010

How good are the LA Galaxy playing right now?

It's been almost as much fun to watch, Edson.


Prior to tonight's match, the LA Galaxy are leading the league in points (16) and goal differential (8), with the goal differential being an especially-impressive two times larger than the next squad (NY Red Bulls). They are clearly playing some of the best soccer we will see at any point this season. The question is, how good are they playing right now? The regression equations I developed in a recent post, along with the recently discussed topic of confidence and prediction intervals, can help answer that question.

Points vs. Goal Differential

If the Galaxy were to maintain their current pace of scoring goals, they would end up with a 40 goal differential at the end of the season. Based upon the historical maximum goal differential from the 2005 through 2009 season (22), this is an unsustainable rate. However, the projected 40 goal differential can provide some insight into how many points they might expect to accrue by season's end.
Using the regression equation I have developed for points vs. goal differential and a 95% confidence level (i.e. accounting for 95% of the likely outcomes), the predicted range of points LA might expect to accrue with a 40 goal differential is:
  • Confidence Interval (of mean): 67 to 73
  • Prediction Interval (of all observations): 62 to 78

At a minimum LA would earn more points than all teams between 2005 and 2009 except for the 2005 San Jose Earthquakes (now Houston Dynamo). Keep in mind that the 2005 San Jose team earned those points over 32 matches, so if LA were to keep up its pace it could truly be called the best regular season team of the modern era.

The results from this calculation also highlight another fact about this early run by the Galaxy. Their projected point total - 80 points at season end - is way above the predicted range from the unsustainable 40 goal differential. The Galaxy are converting far more goals into wins than the average league leading team did from 2005-2009. Something will give before season's end, but this is just further evidence of how good this team is playing right now.

Finish Position vs. Goal Differential and Points

A similar prediction of the likely range of finishing positions can be made based upon the regression equations I have developed as a function of goal differential and points. Figure 1 shows the results of the analysis for the CI and PI of finishing position. The LA Galaxy are guaranteed to finish no lower than third in the league, and likely will finish close to or in the first position if they keep anywhere close to this level of play.

Figure 1: CI and PI for LA Galaxy finishing position

Conclusion
The LA Galaxy are, by the numbers, playing at the best rate we have seen any team play in recent memory. With 20% of the season completed, they will must have a huge failure within their team to not make the playoffs. More likely than not, they will finish in one of the top few spots. If Columbus and New York can maintain form (their points and goal differential are currently at the historical maximum from 2005-2009), they may press LA when they inevitably regress to the mean.
I will be at the Sounders/Galaxy match this Saturday. As a Sounders supporter, I definitely want them to win. It not only breaks a tough streak in the season for our team, but it helps to start closing that gap to LA, who has opened up a good lead in the West. If we have to lose though, I hope it is because LA keeps form in displaying some of the best MLS soccer we have seen recently. Buddle, Donovan, and the whole squad are simply playing lights out right now. A team this good is a rare thing. A soccer fan might as well enjoy such beauty while it lasts.

Tuesday, May 4, 2010

How important are goals and points in the 2010 MLS season?

In my last post I introduced the following three relationships via linear regression from MLS league data taken from the 2005-2009 seasons.
  • Finish position vs. points
  • Finish position vs. goal differential
  • Points vs. goal differential
Given those relationships, it might be logical to ask,
  • How much does an incremental point improve a team's finish position?
  • How much does an incremental goal improve a team's finish position or points?
The answer, it turns out, depends on where that team is starting from when adding that incremental point or goal differential.

Finish position vs. goal differential and points

In my last post I mentioned that finish position was normalized using the following equation

-ln[p/(17 -p)]

This means that the resulting linear regression produces a non linear result between p and the regressors - either goal differential or points. This relationship produces a bell shaped curve when looking at incremental benefits, suggesting that their are huge payoffs in the middle and diminishing returns at either end of the data. See Figure 1, which shows a plot of the incremental finishing position dependent upon the starting goal differential or points.

Figure 1: Improvement in table position vs. incremental goal or point dependent upon starting goal or point total

Figure 1 shows the distribution of data across the historical goal differential and point ranges, bounded by the minimum and maximum totals for the 2005 through 2009 seasons. This phenomena makes sense - a team only needs to score so many goals or points to secure their spot towards the top of the table. It's the middle where individual goals or points separate teams table position.

If you're interested in averages in this non-linear behavior, they are:

  • 0.25 table position improvement for each incremental goal
  • 0.35 table position improvement for each incremental point

Points vs. goal differential

The relationship between incremental goals and incremental points is straightforward as no transformations were required on the data. In this case, each incremental goal leads to a 0.7 point increase for the team.

Conclusions

Going back to the original inspiration for all this work - the Sounders three ties due to extra time goals - leads us to some general conclusions:

  • Each tie match costs a team a potential average table improvement of 0.7 positions due to the loss of the 2 points associated with a win.
  • Each tie match costs a team a minimum potential average table improvement of 0.25 positions due to the loss of at least one goal in the goal differential attribute.
  • Each tie match costs a team an minimum potential average loss of 0.7 points from their season point total.

Keep in mind that these relationships are NOT additive. They are simply different ways of measuring the impact of different metrics on the same variable data set with varying accuracy. They do, however, give us an idea of how costly ties can be in MLS.

Monday, May 3, 2010

Predicting MLS table finishing position through regression


Get Microsoft Silverlight
This is why the Sounders are already in jeopardy of missing the playoffs.

As a Seattle Sounders fan, I have been spoiled. The team made the playoffs in their first year after some wildly inconsistent play, and we supporters have come to expect much more out of our second year team. We simply will not accept poor play from our second year franchise, which has made the start of this season a bit frustrating. Matches that have ended in ties with Real Salt Lake, FC Dallas, and the Columbus Crew due to stoppage time goals have taken 6 points away from the Sounders total in an already young season. In my frustration, I thought of how I might understand this rough start statistically, and have come up with a composite method for understanding how teams are doing at any point in the season.

Background

In attempting to understand where the Sounders' start compares to the rest of the league's performance, one immediately runs into the challenge that teams have played anywhere between seven and four games to date. This will affect the maximum points available to each team, and may skew the any perceptions taken from the raw data. See Figure 1 for the current team standings, points, games played, and goal differential.


Figure 1: MLS Standings as of May 3, 2010


In understanding how a team might be doing so far, I have built statistical models that project a team's finish based upon play-to-date using historical data as the inputs to the models. In this case, the response variable is finish position while the predictors are the teams' goal differential and points. I have also studied a third relationship - points vs. goal differential - which can be used as a check against the assumptions regarding projected goal differential.


There are some simplifications to these models - namely, one would expect teams like the LA Galaxy and DC United to regress a bit towards the mean. Nonetheless, one can account for such gross over- and under-performance by placing bounds on the projections that correspond to historical limits.


I have also rationalized such approach by planning a regular update to the projections on a monthly basis. I have made this initial projection based upon the majority of the teams completing nearly 20% of their season. Such projections are likely going on in these clubs right now, with the management teams trying to get their first read of the adjustments they must make to improve their teams.


The Input Data


Like the previous analysis involving team payroll and finish position, I used a normalized value of the teams' finishing positions. However, in this analysis I did not use an average finish position, but rather each team's individual finish position from each season. This was done to facilitate regression analyses that were used to study the relationship between goals, points, and finishing position.


Another wrinkle in this study was that I did not use a single normalization value. Instead, I used the general normalization equation below to account for the changing number of teams in the league.


-ln[p/(number of teams + 1 -p)]

Data from the 2005 through the 2009 seasons were used in the analysis, with the following number of teams.
  • 2005: 12
  • 2006: 12
  • 2007: 13
  • 2008: 14
  • 2009: 15
The 66 data points, when normalized per the formula above, provide a normally distributed data set. See Figure 2 below detailing the results of a graphical summary of the data set.



Figure 2: Graphical summary of finishing position transformation


Regression of Finishing Position and Points


The first study involved the correlation between finishing position and points. The first check produced a Pearson correlation coefficient of 0.915 - 91.5% of the variation in the data can be explained by the relationship between finishing position and points. See Figure 3 for the results of the correlation study. This is an intuitive relationship - it would not be good for the game if there wasn't such a correlation.

Figure 3: Correlation study results - finishing position vs. points



Once a statistically significant correlation was established, a regression analysis could take place. Figure 4 shows the regression analysis fitted line plot, confidence intervals, and prediction intervals.



Figure 4: Fitted line plot - finish position vs. points


The fitted line plot shows that the R-squared value is 83.7% - that means 83.7% of the variation in the data is explained by the regression model. This model is a good fit, and will be used later in this post to project teams' finishing positions based upon their performance to date in the 2010 season.


Regression of Finishing Position and Goal Differential


The second study involved the correlation between finishing position and goal differential. The first check produced a Pearson correlation coefficient of 0.824 - 82.4% of the variation in the data can be explained by the relationship between finishing position and goal differential. See Figure 5 for the results of the correlation study. This is an intuitive relationship, but apparently is less direct than points. This makes sense though - points more clearly translate to finishing position, while teams can have a match where they rack up a 3 or 4 goal differential in their favor while earning the same 3 points for the win as a team who wins by a single goal.


Figure 5: Correlation study results - finishing position vs. goal differential



Once a statistically significant correlation was established, a regression analysis could take place. Figure 6 shows the regression analysis fitted line plot, confidence intervals, and prediction intervals.



Figure 6: Fitted line plot - finish position vs. goal differential


The fitted line plot shows that the R-squared value is 67.9% - that means 67.9% of the variation in the data is explained by the regression model. This model is a decent fit, but not as good as the regression of finishing position vs. points. Nonetheless, it will be used later in this post to project teams' finishing positions based upon their performance to date in the 2010 season. This regression will be used in conjunction with others to produce an average finishing position from multiple calculation methods.

Regression of Points and Goal Differential

The third study involved the correlation between points and goal differential. The first check produced a Pearson correlation coefficient of 0.912 - 91.2% of the variation in the data can be explained by the relationship between points and goal differential. See Figure 7 for the results of the correlation study. Apparently the relationship between points and goals is nearly as good as finishing position and points.


Figure 7: Correlation study results - points vs. goal differential


Once a statistically significant correlation was established, a regression analysis could take place. Figure 8 shows the regression analysis fitted line plot, confidence intervals, and prediction intervals.



Figure 8: Fitted line plot - points vs. goal differential


The fitted line plot shows that the R-squared value is 83.2% - that means 83.2% of the variation in the data is explained by the regression model. This model is a good fit, and will be used in conjunction with the previous two models to project teams' finishing positions based upon their performance to date in the 2010 season.


Projected League Finish


Using the three regression models developed above, a projection for finishing position based upon teams' play to date and each regression model. Projections for each teams' points and goal differential at season's end are based upon the teams' performance to date and making a projection based upon how many games each team has played. Those projected season ending goal differentials and points are used as inputs to each of the three regression equations. See Figure 9 for a summary of the results.




Figure 9: Projected team finish position based upon play through May 3rd, 2010 (click to enlarge)

Each model's results are tabulated to project a team's finish position. The average of the three model's projected finish positions is then used to rank the teams in order - the projected average is less important than the order itself. Teams marked in green indicate the Top 8 teams, representing the teams most likely to qualify for the playoffs. The teams in red are the bottom 8, representing the teams most likely to miss the playoffs.

Going back to my original inspiration - the Sounders' three ties - yields some interesting observations. The Sounders are currently projected to finish 11th in the league. The season is still early enough that converting one of those ties to a win - i.e. 2 additional points and one additional goal in the goal differential category - moves the Sounders into the Top 8. Converting all three ties to wins of one goal each moves into the Top 4. The effect is large because it is assumed that they would perform in a similar manner the rest of the season - perhaps not a safe assumption given the erratic play to date.

This analysis does show how much behind the eight ball the Sounders are. Teams like Real Salt Lake, Toronto FC, and the Philadelphia Union are expected to be in the bottom 8. The Sounders are not. They have their work cut out for them.

Further Studies

The regression models I have created can be used in a variety of studies. Subsequent blog posts will explore the following:
  • The statistical minimum number of points and goal differential required to finish in the Top 8 in 2010.
  • The number of points and goal differential required in 2011 to make the playoffs when the league expands to 18 teams.
  • How much the LA Galaxy are outperforming the historical norm, as well as the DC United are under-performing the historical norm.
Additionally, I will keep track of how the projections change on a monthly basis through the end of the season.

Sunday, May 2, 2010

About Me

Me showing the Portland Timbers stadium some Seattle Sounders FC love.

It's been nearly a month since I started this blog, and I haven't given much detail about myself except for the "Introduction" post I made. As I believe social media is about being authentic and knowing a good bit about the author's background and biases, I will attempt to give readers some insight into who I am and why I blog about soccer and statistics.
  • My father was in the military, so I moved six times before I graduated from high school. I have kept up the behavior and have made three moves since then. I have lived in eight different cities, and can confidently say that the Pacific Northwest is the best place to live in the United States. This impacts my outlook on sports fandom.
  • I currently live in the Lake City neighborhood of Seattle, and fancy myself as someone more inclined to live in a city than in the suburbs.
  • I have degrees in mechanical engineering from Carnegie Mellon University (undergraduate) and Purdue University (Masters), and I am a certified Six Sigma black belt. I also have a double major in Engineering and Public Policy from Carnegie Mellon. So, yes, saying I am a nerd is an understatement.
  • As to my Six Sigma background, I believe good statistical analysis at the right time can yield breakthrough insights that one might normally miss. That's why I used Arsene Wenger's name for my email address. Conversely, I believe bad statistics are like bad acid - they should neither be produced nor consumed. And yes, that was a rip off of a quote on Pearl Jam's Vitology liner notes for Spin the Black Circle.
  • I have spent my entire professional career designing diesel engines. I completed my first six years of work at Ford designing components for these two engines, and have spent the last three years working on the design of this engine for PACCAR.
  • I love taking public transit whenever I don't have to be at work so early that I am forced to drive my car. I average 3 days a week on the bus, and the 45 minute commute each morning and afternoon is where I get most of my reading - books, blogs, etc. - done for the day.
  • I am a divorced father of two daughters who spend the weekends with me. This explains why my detailed analysis comes in fits and spurts over a weekend - having two young daughters means I spend many Friday and Saturday nights at home.
  • I am a lover of live sports except baseball - it is too slow. Over my lifetime I have grown to become a fan/supporter of the following teams: San Francisco 49ers, Seattle Seahawks, Jordan's Chicago Bull's teams, anywhere Shaquille O'Neal has played, Detroit Red Wings, Seattle Sounders FC, and Arsenal.
  • I have come to love soccer because of its beautiful simplicity, the fact that matches only last two hours (most pro sports events in the US last 3+ hours), and that the professional game in the US is so financially limited that the players are not like the spoiled brats in our other pro leagues. You can see this post for my explanation of why I am an Arsenal supporter.
  • While I am a Libertarian politically (see my other blog if you're really that interested), I am more a socialist when it comes to sports. I want league competition to be interesting, and that means teams cannot be allowed to spend whatever money they want on players. I like the NFL's revenue sharing agreement, and I like the NBA's salary cap model with the Larry Bird rule that allows successful teams to stay together. Consequently, I dislike Chelsea and the NY Yankees. That being said, I like the European "association" model for soccer rather than the US "franchise" model for professional sports - the roll relegation and promotion plays encourages franchises to not stand pat and instead put their best foot forward every year.
  • This blog, and the analysis behind it, is done in my free time. Someday, I would wish I could make enough money from such analysis to support my family. Until then, I can only post when time allows and I am not doing one of the several other activities I appreciate in my free time. Hence, I am not the most prolific blogger, but I always intend to put out above average analysis to compensate for the lack of frequent posts.
I hope this gives you some insight into "the man behind the blog". It should hopefully clarify the mindset I use in approaching the topics I cover. As the authors of Soccernomics have shown, it's high time emotion and conjecture be taken out of the world's most popular and beautiful game and instead use statistics and numbers to explain what's happening on and off the pitch.