I am building a statistical model for basketball games. I have a data.frame like this:
winner loser date teamA teamB 2003-01-01 teamB teamC 2003-01-03 ... ... ...
Two important things to note about this data set is that each team has at least 5 wins and at least 5 losses and all the teams are in the same league (so, though there are teams that haven't played each other, there are no pairs of teams that haven't played any of the same teams.)
My first thought is to use logistic regression to capture this sort of logic:
$p($team_A_wins$) = $logit$($team_A_skill$ - $team_B_skill$)$
But I can't figure out how to get the appropriate design matrix with R's formula notation.
I ended up sort of quasi-manually computing it in this way:
winner_mat <- model.matrix(object = ~ winner, data = data_for_model1) loser_mat <- model.matrix(object = ~ loser, data = data_for_model1) X <- winner_mat - loser_mat
X constructed this way, can be used in
glm and gives sensible results:
fit1 <- glm.fit(x = X, y = rep(1, nrow(X)), family = binomial()) tail(sort(coef(fit1)))
So my question is this -- I'm happy with the design matrix I've constructed, but was there a better way to construct it, e.g. using R's formula notation?