Linear Regression is a fundamental statistical method used to model the relationship between a dependent variable and one or more independent variables. It is widely used in data science, machine learning, and business analytics.
What is Linear Regression?
Linear regression tries to model the relationship between a target variable y
and predictors X
by fitting a linear equation:
ŷ = β₀ + β₁x₁ + β₂x₂ + … + βₚxₚ
- ŷ — Predicted value
- β₀ — Intercept
- β₁…βₚ — Coefficients (slopes)
- x₁…xₚ — Features
Objective
The goal is to estimate the coefficients that minimize the Residual Sum of Squares (RSS):
RSS = Σ(yᵢ – ŷᵢ)²
Types of Linear Regression
- Simple Linear Regression — 1 feature
- Multiple Linear Regression — 2 or more features
Interpreting Coefficients
A coefficient βᵢ tells us: for each +1 change in xᵢ, y will change by βᵢ units, holding other variables constant.
Implementation in Python
1️⃣ Using Statsmodels OLS
import statsmodels.api as sm
import pandas as pd
# Example dataset
df = pd.DataFrame({
'Hours_Studied': [1, 2, 3, 4, 5],
'Score': [50, 55, 65, 70, 75]
})
X = df[['Hours_Studied']]
y = df['Score']
# Add constant (for intercept)
X_sm = sm.add_constant(X)
# Fit model
model = sm.OLS(y, X_sm).fit()
# Summary
print(model.summary())
Key outputs:
coef
: β coefficientsp-value
: Is coefficient significant?R-squared
: % of y explained by X
2️⃣ Using Scikit-learn LinearRegression
from sklearn.linear_model import LinearRegression
import numpy as np
# Model
lr = LinearRegression()
# Fit
lr.fit(X, y)
# Coefficients
print("Intercept:", lr.intercept_)
print("Coefficient:", lr.coef_[0])
# Predictions
y_pred = lr.predict(X)
# Evaluate model
from sklearn.metrics import mean_squared_error, r2_score
rmse = np.sqrt(mean_squared_error(y, y_pred))
r2 = r2_score(y, y_pred)
print("RMSE:", rmse)
print("R²:", r2)
Handling Categorical Variables
Categorical variables must be converted using dummy variables:
df = pd.DataFrame({
'Gender': ['Male', 'Female', 'Female', 'Male'],
'Score': [70, 65, 75, 80]
})
df_encoded = pd.get_dummies(df, columns=['Gender'], drop_first=True)
print(df_encoded)
This will create a binary column Gender_Male
(0 or 1).
Multiple Regression Example
# Multiple variables example
df = pd.DataFrame({
'Hours_Studied': [1, 2, 3, 4, 5],
'Sleep_Hours': [8, 7, 6, 6, 5],
'Score': [50, 55, 65, 70, 75]
})
X = df[['Hours_Studied', 'Sleep_Hours']]
y = df['Score']
# Using sklearn
lr = LinearRegression()
lr.fit(X, y)
print("Intercept:", lr.intercept_)
print("Coefficients:", lr.coef_)
Checking Assumptions
- Linearity → Residual plot
- Normality → QQ plot
- Homoscedasticity → Residuals vs Fitted plot
- Multicollinearity → VIF
Residual Plot
import matplotlib.pyplot as plt
residuals = y - y_pred
plt.scatter(y_pred, residuals)
plt.axhline(0, color='red')
plt.xlabel("Predicted")
plt.ylabel("Residuals")
plt.title("Residual Plot")
plt.show()
QQ Plot
import scipy.stats as stats
sm.qqplot(residuals, line='45')
plt.show()
VIF (Variance Inflation Factor)
from statsmodels.stats.outliers_influence import variance_inflation_factor
X_sm = sm.add_constant(X)
vif_data = pd.DataFrame()
vif_data["feature"] = X.columns
vif_data["VIF"] = [variance_inflation_factor(X_sm.values, i+1) for i in range(len(X.columns))]
print(vif_data)
Evaluating Model Performance
- R² — % of y explained by X
- Adjusted R² — adjusted for number of predictors
- RMSE — average magnitude of errors
Summary
- Linear Regression models linear relationships.
- Use
sklearn.LinearRegression
for simple modeling. - Use
statsmodels.OLS
for statistical analysis. - Always check assumptions before interpreting results.
- Handle categorical variables via dummy variables.
Linear Regression remains one of the most interpretable and useful tools in the data scientist’s toolbox. Mastering both the theory and practical implementation allows you to build strong, explainable models.