Simple Linear Regression with Python

4 min readFeb 12, 2021

Today we are going to see if there is a relationship between the daily minimum and maximum temperature in various weather stations around the world during World War II. The dataset also contains information including precipitation, snowfall, wind speed and whether the day included thunder storms or other poor weather conditions. We are just going to look at minimum and maximum temperatures. Can we predict the maximum temperature given the minimum temperature?

First let’s get our imports into our notebook.

That is the easiest part. The next thing we need to do is read in the dataset we want to analyze.

We downloaded this dataset from https://www.kaggle.com/smid80/weatherww2/data

In statistics, linear regression is a linear approach to modeling the relationship between a dependent variable and one or more independent variables. We need to create those variables.

X is what we will be using to try and predict y

I always like to check along the way to see if I’m doing what I think I’m doing. Is “X” a Pandas’ dataframe of the minimum temperatures? Is “y” a Pandas’ series of the maximum temperatures?

Yes they are. Now we have to do what is called a train-test-split. A train-test split is a technique for evaluating the performance of a machine learning algorithm. It can be used for classification or regression problems and can be used for any supervised learning algorithm. We will be using it for our regression problem.The procedure involves taking a dataset and dividing it into two subsets. The first subset is used to fit the model and is referred to as the training dataset. The second subset is not used to train the model; instead, the input element of the dataset is provided to the model, then predictions are made and compared to the expected values. This second dataset is referred to as the test dataset. The objective is to estimate the performance of the machine learning model on new data: data not used to train the model. This is how we expect to use the model in practice. Namely, to fit it on available data with known inputs and outputs, then make predictions on new examples in the future where we do not have the expected output or target values.

Done.

Now we’ll fit the model on the training data.

Now let’s try to predict the test data.

Let’s take a look at some of the predictions.

How do we know if our model is any good?

The RMSE is the square root of the variance of the residuals. It indicates the absolute fit of the model to the data–how close the observed data points are to the model’s predicted values. Whereas R-squared is a relative measure of fit, RMSE is an absolute measure of fit. As the square root of a variance, RMSE can be interpreted as the standard deviation of the unexplained variance, and has the useful property of being in the same units as the response variable. Lower values of RMSE indicate better fit. RMSE is a good measure of how accurately the model predicts the response, and it is the most important criterion for fit if the main purpose of the model is prediction. We got an RMSE score of 1.4171411314269178e+19. Not bad!

The Train score is how the model generalized or fitted in the training data. Our model is able to predict the maximum temperature if given the minimum temperature it was shown already, ~77% of the time.

The Test score is how the model generalized or fitted in the testing data which is data it hasn’t seen yet. Our model is able to predict the maximum temperature if given a new minimum temperature, also, ~77% of the time.

77% is definitely not the best we can do to predict the maximum temperature. We can do better if we give our model more “X” variables to look at. For example, if we gave the model precipitation, wind speed, air pressure humidity, etc., we would be able to get our Train and Test scores much higher.

Simple Linear Regression with Python

Written by Mathewkatz