by Akaichan
Last Updated March 12, 2017 19:19 PM

So I have a data set that looks like this

https://docs.google.com/spreadsheets/d/1QV_kCj_fN4VEeyO75wgoR1TO_B0p1ARjm78XCbjbxt8/pubhtml

I want to write a full and restricted model which would evaluate the null hypothesis that latitude - controlling for continent and sex - has a significant relationship with wing size. I am currently using R

Is it okay to do this for my full model?

fit <- lm(wingsize ~ continent + sex+latitude, data=wing)

summary(fit) # show results

This is my output WINGSIZE = 836.1648 + (-4.1289)*CONTINENT + (-98.8571)*SEX + 1.7926*LATITUDE

And for restricted model, I did this:

res <- lm(wingsize ~ latitude, data=wing)

summary(res)

and have a model

WINGSIZE = 780.532 + 1.883*LATITUDE

Is it supposed to look like a plain simple linear regression for the restricted part? Is that what restricted mean?

Also, how many degrees of freedom would exist for full and restricted models? Are the degree of freedoms always the same? Is this question supposed to refer to the residual error degrees of freedom?

Thank you for any input.

Nice project! As you guessed correctly, in the context of multiple linear regression, with predictors $X_1,\dots,X_p$ and response $Y$, the full (or unrestricted) model is the usual OLS estimate, where we put no restrictions on the regression coefficients of the various predictors. A restricted model is one for which we impose a set of constraints on the regression coefficients $\beta_i$. In the simplest case, we set one or more $\beta_i$ to 0: in general, we can consider a set of linear constraints given in matrix form by $\mathbf{R}\beta=\mathbf{r}$. In your case, you considered the two simple constraints $\beta_{sex}=\beta_{continent}=0$.

Before comparing formally the two models, let's see what the relationship between `latitude`

and `wingsize`

looks by making a simple plot:

```
with(wing, plot(latitude, wingsize, pch=16))
```

It seems that data are separated into two groups, and inside each group there's a clear increasing trend, which is what one would expect according to Bergmann's rule. The separation is nearly too good to be true, but at least for some raptors it's well-known that females are bigger than males. We can easily check that the two groups correspond to the two sexes:

```
with(wing, plot(latitude, wingsize, col = c("red", "blue")[sex+1], pch=16))
```

Thus we expect `latitude`

to be highly significant for `wingsize`

**if** we control for `sex`

, but not necessarily otherwise. What about `continent`

? It doesn't seem that including it helps explaining any variance:

```
with(wing, plot(latitude, wingsize, col = c("red", "blue")[continent+1], pch=16))
```

Let's verify this formally:

```
res_model <- lm(wingsize ~ latitude, data = wing)
summary(res_model)
Call:
lm(formula = wingsize ~ latitude, data = wing)
Residuals:
Min 1Q Median 3Q Max
-72.433 -50.335 5.956 49.651 72.132
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 780.532 64.526 12.096 6.08e-15 ***
latitude 1.883 1.436 1.312 0.197
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 51.82 on 40 degrees of freedom
Multiple R-squared: 0.04125, Adjusted R-squared: 0.01728
F-statistic: 1.721 on 1 and 40 DF, p-value: 0.1971
```

Not even significant at the 0.05 level. BTW, since you asked about degrees of freedom, note that R reports `DF`

= 40. For linear regression, `DF=n-p-1`

where `n`

is the sample size (42 in your case), `p`

is the number of predictors in the model (1 since we're considering the model with only `latitude`

) and thus 42-1-1=40.

Let's now consider the full model:

```
> full_model <- lm(wingsize ~ ., data = wing)
> summary(full_model)
Call:
lm(formula = wingsize ~ ., data = wing)
Residuals:
Min 1Q Median 3Q Max
-22.7285 -6.5128 0.6035 5.0069 30.7508
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 836.1648 14.8448 56.327 < 2e-16 ***
continent -4.1289 3.5224 -1.172 0.248
latitude 1.7926 0.3158 5.676 1.59e-06 ***
sex -98.8571 3.4114 -28.978 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 11.05 on 38 degrees of freedom
Multiple R-squared: 0.9586, Adjusted R-squared: 0.9553
F-statistic: 293 on 3 and 38 DF, p-value: < 2.2e-16
```

As expected, now `latitude`

, in a model which already includes `sex`

(and `continent`

), is very significant, and similarly `sex`

is highly significant in a model which already includes `latitude`

and `continent`

. Instead, once we control for `sex`

and `latitude`

, it looks like (the difference between zero and the regression coefficient of) `continent`

is not statistically significant. We can also check that the difference among the two models is statistically significant by using ANOVA, since the restricted model is always a nested model of the unrestricted model, and thus ANOVA is applicable.

```
> anova(res_model, full_model)
Analysis of Variance Table
Model 1: wingsize ~ latitude
Model 2: wingsize ~ continent + latitude + sex
Res.Df RSS Df Sum of Sq F Pr(>F)
1 40 107425
2 38 4643 2 102782 420.56 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
```

As expected, the difference is highly significant!

Finally, we noted that `continent`

didn't seem to explain any residual variance, once `latitude`

and `sex`

were already included in the model. Be warned! Normally this "fishing for significance" **is a terrible idea**. If you start by removing the variable with the largest p-value, refit the model and iterate the process, by removing at each step the variable with the highest p-value (backward stepwise regression), then the p-values you obtain in the final model are not valid (because they're computed without taking into account this selection process), and inference becomes unreliable. However, just for this special case, we may close an eye, since we are going to remove only one predictor, whose p-value is not just a bit higher than the others, but it's several orders of magnitude larger. Let's build the restricted model which contains only `sex`

and `latitude`

, and see how it compares to the other two.

```
> res_model_2 <- lm(wingsize ~ latitude + sex, data = wing)
> summary(res_model_2)
Call:
lm(formula = wingsize ~ latitude + sex, data = wing)
Residuals:
Min 1Q Median 3Q Max
-23.005 -5.681 -0.494 5.197 32.560
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 829.9603 13.9354 59.558 < 2e-16 ***
latitude 1.8832 0.3077 6.121 3.52e-07 ***
sex -98.8571 3.4277 -28.841 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 11.11 on 39 degrees of freedom
Multiple R-squared: 0.9571, Adjusted R-squared: 0.9549
F-statistic: 434.6 on 2 and 39 DF, p-value: < 2.2e-16
```

Very good! It seems that the model is very similar to the full model, in terms of predictive performance (as estimated by adjusted R-squared). As a matter of fact, if we now repeat the ANOVA test, we see that the difference between the first restricted model and this new restricted model is significant, but not the difference between the new restricted model and the full one:

```
> anova(res_model, res_model_2, full_model)
Analysis of Variance Table
Model 1: wingsize ~ latitude
Model 2: wingsize ~ latitude + sex
Model 3: wingsize ~ continent + latitude + sex
Res.Df RSS Df Sum of Sq F Pr(>F)
1 40 107425
2 39 4811 1 102614 839.752 <2e-16 ***
3 38 4643 1 168 1.374 0.2484
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
```

Let me stress again that in general variable selection based on stepwise regression is BAD! If you need to do variable selection in a reliable way, use LASSO instead.

- ServerfaultXchanger
- SuperuserXchanger
- UbuntuXchanger
- WebappsXchanger
- WebmastersXchanger
- ProgrammersXchanger
- DbaXchanger
- DrupalXchanger
- WordpressXchanger
- MagentoXchanger
- JoomlaXchanger
- AndroidXchanger
- AppleXchanger
- GameXchanger
- GamingXchanger
- BlenderXchanger
- UxXchanger
- CookingXchanger
- PhotoXchanger
- StatsXchanger
- MathXchanger
- DiyXchanger
- GisXchanger
- TexXchanger
- MetaXchanger
- ElectronicsXchanger
- StackoverflowXchanger
- BitcoinXchanger
- EthereumXcanger