**Linear regression** is a method we can use to quantify the relationship between one or more predictor variables and a response variable.

The following step-by-step example shows how to fit a linear regression model to a dataset in PySpark.

**Step 1: Create the Data**

First, let’s create the following PySpark DataFrame that contains information about hours spent studying, number of prep exams taken, and final exam score for various students at some university:

**from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
#define data
****data = [[1, 1, 76],
[2, 3, 78],
[2, 3, 85],
[4, 5, 88],
[2, 2, 72],
[1, 2, 69],
[5, 1, 94],
[4, 1, 94],
[2, 0, 88],
[4, 3, 92],
[4, 4, 90],
[3, 3, 75],
[6, 2, 96],
[5, 4, 90],
[3, 4, 82],
[4, 4, 85],
[6, 5, 99],
[2, 1, 83],
[1, 0, 62],
[2, 1, 76]]
#define column names
columns = ['hours', 'prep_exams', 'score']
#create dataframe using data and column names
df = spark.createDataFrame(data, columns)
#view first five rows of dataframe
df.limit(5).show()
+-----+----------+-----+
|hours|prep_exams|score|
+-----+----------+-----+
| 1| 1| 76|
| 2| 3| 78|
| 2| 3| 85|
| 4| 5| 88|
| 2| 2| 72|
+-----+----------+-----+
**

We will fit the following multiple linear regression model to this dataset:

**Exam Score = β _{0} + β_{1}(hours) + β_{2}(prep exams)**

**Step 2: Format the Data**

Next, we’ll use the **VectorAssembler** function to convert the columns of the DataFrame to vectors, which is needed to fit a linear regression model in PySpark:

from pyspark.ml.feature import VectorAssembler #specify predictor variables assembler = VectorAssembler(inputCols=['hours', 'prep_exams'], outputCol='features') #format DataFrame for linear regression data = assembler.transform(df) data = data.select('features', 'score')

Note that we specified **hours** and **prep_exams** should be used as the predictor variables in the model and **score** should be used as the response variable.

**Step 3: Fit the Linear Regression Model**

Next, we’ll use the **LinearRegression** function to fit the linear regression model to the data:

from pyspark.ml.regression import LinearRegression #specify linear regression model to use lin_reg = LinearRegression(featuresCol='features', labelCol='score', predictionCol='pred_score') #fit linear regression model to data fit = lin_reg.fit(data)

**Step 4: View Model Summary**

Lastly, we can use the following syntax to view the intercept and regression coefficients of the model along with the p-values for each coefficient and the R-squared value of the model:

#view model intercept and coefficients print(fit.intercept, fit.coefficients) 67.67352554133275 [5.555748295250626,-0.6016868046417222] #view p-values of intercept and coefficents print(fit.summary.pValues) [1.0106866433545747e-05, 0.5193352264909648, 1.4654943925052066e-14] #view RMSE of model print(fit.summary.r2) 0.7340272170388176

From the output we can see:

- Intercept:
**67.674** - Coefficient of Hours:
**5.556** - Coefficient of Prep Exams:
**-0.602**

We can use these values to write the fitted regression equation:

**Exam Score = 67.674 + 5.556(hours) – 0.602(prep_exams)**

We can use this equation to find the estimated exam score for a student, based on the number of hours they studied and the number of prep exams they took.

For example, a student that studies for 3 hours and takes 2 prep exams is expected to receive an exam score of 83.1:

Estimated exam score = 67.674 + 5.556*(3) – .602*(2) = **83.1 **

From the output we can also see the p-values for the intercept and coefficients in the model:

- P-value of Intercept:
**<.001** - P-value of Hours:
**0.519** - P-value of Prep Exams:
**<.0001**

Since the p-value for prep exams is less than .05, this indicates that it has a statistically significant association with exam score.

However, the p-value for hours is not less than .05, which indicates that it *does not* have a statistically significant association with exam score.

We may decide to remove prep exams from the model since it isn’t statistically significant and instead perform simple linear regression using hours studied as the only predictor variable.

Lastly, we can see the R-squared value for the model:

- R-squared of model:
**0.734**

This tells us that **73.4%** of the variation in exam scores can be explained by the number of hours studied and number of prep exams taken.

**Note:** Refer to the PySpark documentation for a complete list of regression summary metrics you can view.

**Additional Resources**

The following tutorials explain how to perform other common tasks in PySpark:

How to Calculate the Mean of a Column in PySpark

How to Sum Multiple Columns in PySpark DataFrame

How to Add Multiple Columns to PySpark DataFrame