An example of performing simple linear regression using train-test split where the process is as follows,
1. Generate a synthetic dataset:
- We create 100 data points
X
uniformly distributed between 0 and 2. - The target values
y
are generated using a linear relationshipy = 4 + 3X
with some added Gaussian noise for realism.
2. Split the dataset:
We use train_test_split
to divide the data into training and testing sets, with 80% for training and 20% for testing.
3. Create and train the model:
We create a LinearRegression
model instance and train it using the training data.
4. Make predictions and evaluate the model:
The model makes predictions on the test set, and we compute the Mean Squared Error (MSE) and Mean Absolute Error (MAE) to evaluate the model’s performance.
5. Plot the results:
We visualize the actual vs. predicted values to see how well the model has learned the relationship.
Codes in Python
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error
# Generate a synthetic dataset
# Let's create a simple linear dataset with some noise
np.random.seed(0)
X = 2 * np.random.rand(100, 1) # 100 data points
y = 4 + 3 * X + np.random.randn(100, 1) # y = 4 + 3X + noise
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create the linear regression model
model = LinearRegression()
# Train the model
model.fit(X_train, y_train)
# Make predictions on the test set
y_pred = model.predict(X_test)
# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
print(f'Mean Squared Error: {mse}')
print(f'Mean Absolute Error: {mae}')
# Plot the results
plt.scatter(X_test, y_test, color='blue', label='Actual values')
plt.plot(X_test, y_pred, color='red', linewidth=2, label='Predicted values')
plt.title('Simple Linear Regression')
plt.xlabel('X')
plt.ylabel('y')
plt.legend()
plt.show()
Codes in R
We first install the neccesary packages if you haven’t got it installed
# Install caTools
install.packages("caTools")
# Install ggplot2
install.packages("ggplot2")
Then, the codes is as follows
# Load necessary libraries
library(caTools)
library(ggplot2)
# Generate a synthetic dataset
set.seed(0)
X <- 2 * runif(100) # 100 data points uniformly distributed between 0 and 2
y <- 4 + 3 * X + rnorm(100) # y = 4 + 3X + noise
# Combine X and y into a data frame
data <- data.frame(X, y)
# Split the dataset into training and testing sets
set.seed(42)
split <- sample.split(data$y, SplitRatio = 0.8)
train_data <- subset(data, split == TRUE)
test_data <- subset(data, split == FALSE)
# Create the linear regression model
model <- lm(y ~ X, data = train_data)
# Make predictions on the test set
y_pred <- predict(model, newdata = test_data)
# Evaluate the model
mse <- mean((test_data$y - y_pred)^2)
mae <- mean(abs(test_data$y - y_pred))
cat('Mean Squared Error:', mse, '\n')
cat('Mean Absolute Error:', mae, '\n')
# Plot the results
ggplot() +
geom_point(aes(x = test_data$X, y = test_data$y), color = 'blue', label = 'Actual values') +
geom_line(aes(x = test_data$X, y = y_pred), color = 'red', size = 1, label = 'Predicted values') +
ggtitle('Simple Linear Regression') +
xlab('X') +
ylab('y') +
theme_minimal()
Discover more from Science Comics
Subscribe to get the latest posts sent to your email.