You can use the **subset** argument to only use a subset of a data frame when using the lm() function to fit a regression model in R:

fit <- lm(points ~ fouls + minutes, data=df, subset=(minutes>10))

This particular example fits a regression model using **points** as the response variable and **fouls** and **minutes** as the predictor variables.

The **subset** argument specifies that only the rows in the data frame where the **minutes** variable is greater than 10 should be used when fitting the regression model.

The following example shows how to use this syntax in practice.

**Example: How to Use Subset of Data Frame with lm() in R**

Suppose we have the following data frame in R that contains information about the minutes played, total fouls, and total points scored by 10 basketball players:

#create data frame df <- data.frame(minutes=c(5, 10, 13, 14, 20, 22, 26, 34, 38, 40), fouls=c(5, 5, 3, 4, 2, 1, 3, 2, 1, 1), points=c(6, 8, 8, 7, 14, 10, 22, 24, 28, 30)) #view data frame df minutes fouls points 1 5 5 6 2 10 5 8 3 13 3 8 4 14 4 7 5 20 2 14 6 22 1 10 7 26 3 22 8 34 2 24 9 38 1 28 10 40 1 30

Suppose we would like to fit the following multiple linear regression model:

**points = β _{0} + β_{1}(minutes) + β_{2}(fouls)**

However, suppose we only want to use the rows in the data frame where the **minutes** variable is greater than 10.

We can use the **lm()** function with the **subset** argument to fit this regression model:

#fit multiple linear regression model (only for rows where minutes>10)fit <- lm(points ~ fouls + minutes, data=df, subset=(minutes>10)) #view model summary summary(fit) Call: lm(formula = points ~ fouls + minutes, data = df, subset = (minutes > 10)) Residuals: 3 4 5 6 7 8 9 10 1.2824 -2.5882 2.2000 -1.9118 2.3588 -1.7176 0.1824 0.1941 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -11.8353 4.9696 -2.382 0.063046 . fouls 1.8765 1.0791 1.739 0.142536 minutes 0.9941 0.1159 8.575 0.000356 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 2.255 on 5 degrees of freedom Multiple R-squared: 0.9574, Adjusted R-squared: 0.9404 F-statistic: 56.19 on 2 and 5 DF, p-value: 0.0003744

We can use the **nobs()** function to see how many observations from the data frame were actually used to fit the regression model:

#view number of observations used to fit model nobs(fit) [1] 8

We can see that **8** rows from the data frame were used to fit the model.

If we view the original data frame we can see that exactly **8** rows had a value greater than 10 for the **minutes** variable, which means only those rows were used when fitting the regression model.

We can also use the **&** operator in the **subset** argument to subset the data frame by multiple conditions.

For example, we could use the following syntax to fit a regression model using only the rows in the data frame where **minutes** is greater than 10 and **fouls** is less than 4:

#fit multiple linear regression model (only where minutes>10 & fouls<4)fit <- lm(points ~ fouls + minutes, data=df, subset=(minutes>10 & fouls<4)) #view number of observations used to fit model nobs(fit) [1] 7

From the output we can see that **7** rows from the data frame were used to fit this particular model.

**Additional Resources**

The following tutorials explain how to perform other common tasks in R:

How to Perform Simple Linear Regression in R

How to Perform Multiple Linear Regression in R

How to Create a Residual Plot in R

why wouldn’t each of the variables from the data frame df have to be preceded by “df$ in the command: fit 10)). the prospect of having to enter the data frame name for each variable/column is kind of primitive programming. isn’t there some operator other than with() that would call the data frame once and for the duration of the session. thanks, james watts