# Regression with Keras

In this tutorial, you will learn how to perform regression using Keras and deep learning. You will learn how to train Keras neural network for regression and continuous value prediction, especially in the context of house price prediction.

Today's post begins a three part series on deep learning, regression and continuous value prediction.

We will study Keras regression forecast in the context of house price forecast:

Part 1: today we will train Keras neural network to predict house prices based on classification and digital attributes (e.g. number of bedrooms / bathrooms, square feet, zip code, etc.).

Part 2: next week we will train Keras convolutional neural network to predict house prices based on the input image of the house itself (i.e. the front view of the house, bedroom, bathroom and kitchen).

Part 3: in two weeks, we will define and train a neural network that combines our classification / digital attributes with our images to make better and more accurate house price prediction than using attributes or images alone.

Unlike classification (prediction tag), regression enables us to predict continuous values.

For example, a classification might be able to predict one of the following values: {cheap, affordable, expensive}.

On the other hand, the regression will be able to predict the exact dollar amount, such as "the estimated price of this house is \$489121".

In many practical situations, such as house price prediction or stock market prediction, the application of regression rather than classification is very important to obtain good prediction.

To learn how to perform regression using Keras, read on!

In the first part of this tutorial, we will briefly discuss the differences between classification and regression.

We will then explore the house price data set used in this series of Keras regression tutorials. From there, we will configure our development environment and review our project structure.

In this process, we will learn how to use Pandas to load our house price data set and define a neural network for Keras regression prediction.

Finally, we will train our Keras network and then evaluate the regression results.

# Classification and regression

Figure 1: classification network prediction tab (top). In contrast, the regression network can predict the value (bottom). In this blog post, we will use Keras to regress the house data set.

Generally, we will discuss Keras and deep learning - prediction tags to characterize the content of images or input data sets in the context of classification. On the other hand, regression enables us to predict continuous values. Let's reconsider the task of house price forecasting. It is well known that classification is used to predict class labels. For house price forecast, we can define the classification label as:

```labels = {very cheap, cheap, affordable, expensive, very expensive}
```

If we classify, our model can learn to predict one of the five values based on a set of input features.

However, these labels only represent the potential price range of the house, but do not represent the actual cost of the house.

In order to predict the real cost of housing, we need to carry out regression.

Using regression, we can train the model to predict continuous values. For example, although the classification may only predict one label, the regression can say: "based on the data I enter, I estimate the cost of this house to be \$781993." figure 1 above provides a visualization of performing classification and regression. In the rest of this tutorial, you will learn how to use Keras to train neural networks for regression.

# House price data set

The data set we will use today is from the 2016 paper, house price estimation based on visual and text features written by Ahmed and Moustafa. The data set includes numerical / classification attributes and images of 535 data points, making it an excellent data set for studying regression and mixed data prediction. The house dataset consists of four numeric and categorical attributes:

• Number of bedrooms
• Number of bathrooms
• Area (i.e. square feet)
• Postal Code

These attributes are stored on disk in CSV format. Later in this tutorial, we will load these attributes from disk using pandas, a popular Python package for data analysis. Each house also provides a total of four pictures:

• bedroom
• Shower Room
• kitchen
• Front view of the house

The ultimate goal of the housing data set is to predict the price of the house itself.

# Environment configuration

For this 3-part series of blog posts, you need to install the following packages:

• NumPy
• sklearn-learn
• pandas
• Keras
• OpenCV (for the next two posts in this series)

```\$ git clone https://github.com/emanhamed/Houses-dataset
```

If you have git, you can download it with git without directly opening the web link.

# Project structure

```\$ tree --dirsfirst --filelimit 10
.
├── Houses-dataset
│   ├── Houses Dataset [2141 entries]
├── model
│   ├── __init__.py
│   ├── datasets.py
│   └── models.py
└── mlp_regression.py

```

datasets.py: our script for loading numeric / categorical data from the dataset

Models.py: the neural network model will review these two scripts today. In addition, we will reuse datasets.py and models.py (modified) in the next two tutorials to keep our code organized and reusable.

Regression + Keras script included in MLP_ We will also explain it in region.py.

# Load house price dataset

Before we train the Keras regression model, we first need to load the numerical and classification data of the house dataset. Open the datasets.py file and insert the following code:

```# import the necessary packages
from sklearn.preprocessing import LabelBinarizer
from sklearn.preprocessing import MinMaxScaler
import pandas as pd
import numpy as np
import glob
import cv2
import os
# initialize the list of column names in the CSV file and then
# load it using Pandas
cols = ["bedrooms", "bathrooms", "area", "zipcode", "price"]
df = pd.read_csv(inputPath, sep=" ", header=None, names=cols)
```

We first import libraries and modules from scikit learn, pandas, NumPy, and OpenCV. Next, we'll use opencv because we'll add the ability to load images to the script. Load is defined_ house_ The attributes function, which accepts the path to the input dataset. Inside the function, we first define the name of the column in the CSV file. From there, we use pandas's function read_csv loads the CSV file into memory as a date frame (df) on line 14. Below you can see an example of our input data, including the number of bedrooms, the number of bathrooms, area (i.e. square feet), zip code, code, and finally the target price that our model should be trained to predict:

``` bedrooms  bathrooms  area  zipcode     price
0         4        4.0  4053    85255  869500.0
1         4        3.0  3343    36372  865200.0
2         3        4.0  3923    85266  889000.0
3         5        5.0  4022    85262  910000.0
4         3        4.0  4116    85266  971226.0
```

Let's finish loading_ house_ The rest of the attributes function:

```    # determine (1) the unique zip codes and (2) the number of data
# points with each zip code
zipcodes = df["zipcode"].value_counts().keys().tolist()
counts = df["zipcode"].value_counts().tolist()
# loop over each of the unique zip codes and their corresponding
# count
for (zipcode, count) in zip(zipcodes, counts):
# the zip code counts for our housing dataset is *extremely*
# unbalanced (some only having 1 or 2 houses per zip code)
# so let's sanitize our data by removing any houses with less
# than 25 houses per zip code
if count < 25:
idxs = df[df["zipcode"] == zipcode].index
df.drop(idxs, inplace=True)
# return the data frame
return df
```

In the remaining lines, we:

• Determine a unique zip code set, and then calculate the number of data points for each unique zip code.
• Filter out zip codes with low count. For some zip codes, we have only one or two data points, which makes it challenging, if not impossible, to obtain accurate house price estimates.
• Returns data to the calling function.

Now let's create a process for preprocessing the data_ house_ Attributes function:

```def process_house_attributes(df, train, test):
# initialize the column names of the continuous data
continuous = ["bedrooms", "bathrooms", "area"]
# performin min-max scaling each continuous feature column to
# the range [0, 1]
cs = MinMaxScaler()
trainContinuous = cs.fit_transform(train[continuous])
testContinuous = cs.transform(test[continuous])
```

We define functions. process_ house_ The attributes function accepts three parameters:

• df: Our data frame generated by pandas (the previous function helps us delete some records from the data frame)
• train: our training data for house price data set
• Test: our test data.

Then, we defined the columns of continuous data, including bedroom, bathroom and house size.

We will take these values and use the MinMaxScaler of sklearn learn to scale continuous features to the range [0, 1]. Now we need to preprocess our classification feature, namely zip code:

```    # one-hot encode the zip code categorical data (by definition of
# one-hot encoing, all output features are now in the range [0, 1])
zipBinarizer = LabelBinarizer().fit(df["zipcode"])
trainCategorical = zipBinarizer.transform(train["zipcode"])
testCategorical = zipBinarizer.transform(test["zipcode"])
# construct our training and testing data points by concatenating
# the categorical features with the continuous features
trainX = np.hstack([trainCategorical, trainContinuous])
testX = np.hstack([testCategorical, testContinuous])
# return the concatenated training and testing data
return (trainX, testX)
```

First, we will encode the zip code one hot.

Then, we will use NumPy's hstack function to connect classification features with continuous features, and return the generated training and test sets as tuples. Remember, now our classification features and continuous features are in the range of [0,1].

# Implementation of recurrent neural network

Figure 5: our Keras regression architecture. The input to the network is a data point, including the family's # bedroom, # bathroom, area / square foot and zip code. The output of the network is a single neuron with linear activation function. Linear activation allows neurons to output the predicted price of the house.

Before we train the Keras network for regression, we first need to define the architecture itself. Today we will use a simple multilayer perceptron (MLP), as shown in Figure 5. Open the models.py file and insert the following code:

```# import the necessary packages
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import BatchNormalization
from tensorflow.keras.layers import Conv2D
from tensorflow.keras.layers import MaxPooling2D
from tensorflow.keras.layers import Activation
from tensorflow.keras.layers import Dropout
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Flatten
from tensorflow.keras.layers import Input
from tensorflow.keras.models import Model
def create_mlp(dim, regress=False):
# define our MLP network
model = Sequential()
# check to see if the regression node should be added
if regress:
# return our model
return model
```

First, we will import all the necessary modules from Keras. By writing a program called create_mlp function to define the MLP architecture. The function accepts two parameters: dim: define our input dimension, regress: a Boolean value that defines whether our regression neurons should be added. We will continue to use dim-8-4 architecture to start building our MLP. If we are performing regression, we will add a density layer containing a neuron with a linear activation function. Usually we use ReLU based activation, but since we are performing regression, we need a linear activation. Finally, return to the model.

# Implement Keras regression script

Now it's time to put all the parts together!

Open MLP_ The region.py file and insert the following code:

```# import the necessary packages
from tensorflow.keras.optimizers import Adam
from sklearn.model_selection import train_test_split
from pyimagesearch import datasets
from pyimagesearch import models
import numpy as np
import argparse
import locale
import os
# construct the argument parser and parse the arguments
ap = argparse.ArgumentParser()
ap.add_argument("-d", "--dataset", type=str, required=True,
help="path to input dataset of house images")
args = vars(ap.parse_args())
```

We first import the necessary packages, modules, and libraries.

Our script only needs a command line parameter -- dataset. When you run the training script in the terminal, you need to provide the -- dataset switch and the actual path of the dataset.

Let's load house dataset properties and build our training and test segmentation:

```# construct the path to the input .txt file that contains information
# on each house in the dataset and then load the dataset
inputPath = os.path.sep.join([args["dataset"], "HousesInfo.txt"])
# construct a training and testing split with 75% of the data used
# for training and the remaining 25% for evaluation
print("[INFO] constructing training/testing split...")
(train, test) = train_test_split(df, test_size=0.25, random_state=42)
```

Use our convenient load_ house_ The attributes function, and by passing the inputPath to the dataset itself, our data is loaded into memory.

The training set and test set are segmented according to 4:1. Let's expand our house price data:

```# find the largest house price in the training set and use it to
# scale our house prices to the range [0, 1] (this will lead to
# better training and convergence)
maxPrice = train["price"].max()
trainY = train["price"] / maxPrice
testY = test["price"] / maxPrice
```

As mentioned in the review, scaling our house price to the range of [0,1] will make our model easier to train and converge. Scaling the output target to [0,1] will reduce our output prediction range (relative to [0, maxPrice]), which will not only make our network training easier and faster, but also enable our model to obtain better results. Therefore, we obtain the highest price in the training set and expand our training and test data accordingly. Now let's deal with house properties:

```# process the house attributes data by performing min-max scaling
# on continuous features, one-hot encoding on categorical features,
# and then finally concatenating them together
print("[INFO] processing data...")
(trainX, testX) = datasets.process_house_attributes(df, train, test)
```

Recall process from the datasets.py script_ house_ Attributes function:

• Preprocess our classification and continuous features.
• Scale our continuous features to the range [0, 1] by min max scaling.
• One hot encodes our classification features.
• The classification features and continuous features are connected to form the final feature vector.

Now let's continue to train the MLP model:

```# create our MLP and then compile the model using mean absolute
# percentage error as our loss, implying that we seek to minimize
# the absolute percentage difference between our price *predictions*
# and the *actual prices*
model = models.create_mlp(trainX.shape[1], regress=True)
opt = Adam(lr=1e-3, decay=1e-3 / 200)
model.compile(loss="mean_absolute_percentage_error", optimizer=opt)
# train the model
print("[INFO] training model...")
model.fit(x=trainX, y=trainY,
validation_data=(testX, testY),
epochs=200, batch_size=8)
```

Our model is initialized with Adam optimizer and then compile d. Note that we use the average absolute percentage error as our loss function, which indicates that we seek to minimize the average percentage difference between the predicted price and the actual price.

Training.

After the training, we can evaluate our model and summarize our results:

```# make predictions on the testing data
print("[INFO] predicting house prices...")
preds = model.predict(testX)
# compute the difference between the *predicted* house prices and the
# *actual* house prices, then compute the percentage difference and
# the absolute percentage difference
diff = preds.flatten() - testY
percentDiff = (diff / testY) * 100
absPercentDiff = np.abs(percentDiff)
# compute the mean and standard deviation of the absolute percentage
# difference
mean = np.mean(absPercentDiff)
std = np.std(absPercentDiff)
# finally, show some statistics on our model
locale.setlocale(locale.LC_ALL, "en_US.UTF-8")
print("[INFO] avg. house price: {}, std house price: {}".format(
locale.currency(df["price"].mean(), grouping=True),
locale.currency(df["price"].std(), grouping=True)))
print("[INFO] mean: {:.2f}%, std: {:.2f}%".format(mean, std))
```

Line 57 instructs Keras to predict our test set.

Using the forecast, we calculate:

• The difference between the predicted house price and the actual house price.
• Percentage difference.
• Absolute percentage difference.
• Calculate the mean and standard deviation of the absolute percentage difference.
• Print the results.

Regression using Keras is not that difficult, is it? Let's train the model and analyze the results!

# Keras regression results

Figure 6: the Keras regression model uses four numerical inputs to produce a numerical output: the predicted value of the house.

Open a terminal and provide the following command (ensure that the -- dataset command line parameter points to the location where you download the house price dataset):

```\$ python mlp_regression.py --dataset Houses-dataset/Houses\ Dataset/
[INFO] constructing training/testing split...
[INFO] processing data...
[INFO] training model...
Epoch 1/200
34/34 [==============================] - 0s 4ms/step - loss: 73.0898 - val_loss: 63.0478
Epoch 2/200
34/34 [==============================] - 0s 2ms/step - loss: 58.0629 - val_loss: 56.4558
Epoch 3/200
34/34 [==============================] - 0s 1ms/step - loss: 51.0134 - val_loss: 50.1950
Epoch 4/200
34/34 [==============================] - 0s 1ms/step - loss: 47.3431 - val_loss: 47.6673
Epoch 5/200
34/34 [==============================] - 0s 1ms/step - loss: 45.5581 - val_loss: 44.9802
Epoch 6/200
34/34 [==============================] - 0s 1ms/step - loss: 42.4403 - val_loss: 41.0660
Epoch 7/200
34/34 [==============================] - 0s 1ms/step - loss: 39.5451 - val_loss: 34.4310
Epoch 8/200
34/34 [==============================] - 0s 2ms/step - loss: 34.5027 - val_loss: 27.2138
Epoch 9/200
34/34 [==============================] - 0s 2ms/step - loss: 28.4326 - val_loss: 25.1955
Epoch 10/200
34/34 [==============================] - 0s 2ms/step - loss: 28.3634 - val_loss: 25.7194
...
Epoch 195/200
34/34 [==============================] - 0s 2ms/step - loss: 20.3496 - val_loss: 22.2558
Epoch 196/200
34/34 [==============================] - 0s 2ms/step - loss: 20.4404 - val_loss: 22.3071
Epoch 197/200
34/34 [==============================] - 0s 2ms/step - loss: 20.0506 - val_loss: 21.8648
Epoch 198/200
34/34 [==============================] - 0s 2ms/step - loss: 20.6169 - val_loss: 21.5130
Epoch 199/200
34/34 [==============================] - 0s 2ms/step - loss: 19.9067 - val_loss: 21.5018
Epoch 200/200
34/34 [==============================] - 0s 2ms/step - loss: 19.9570 - val_loss: 22.7063
[INFO] predicting house prices...
[INFO] avg. house price: \$533,388.27, std house price: \$493,403.08
[INFO] mean: 22.71%, std: 18.26%
```

As can be seen from our output, our average absolute percentage error was as high as 73% at first, and then decreased rapidly to less than 30%.

When we finish training, we can see that our network is starting to fit a little too well. Our training loss is as low as ~ 20%; However, our verification loss is about 23%.

By calculating our final average absolute percentage error, we get a final value of 22.71%. What does this value mean?

Our final average absolute percentage error means that, on average, our network will reduce its house price forecast by about 23% and the standard deviation is about 18%.

# Limitations of house price data sets

Getting a 22% discount on house price forecasts is a good start, but it's certainly not the type of accuracy we're looking for. In other words, this prediction accuracy can also be regarded as the limitation of house price data set itself. Remember that the dataset contains only four attributes:

• Number of bedrooms
• Number of bathrooms
• Area (i.e. square feet)
• Postal Code

Most other house price datasets contain more attributes.

For example, the Boston house price data set includes a total of 14 attributes that can be used for house price prediction (although the data set does have some racial discrimination). The Ames House dataset contains more than 79 different attributes that can be used to train regression models. In the next two articles in this series, I will show you how to combine our numerical / classification data with house images to produce a model better than all our previous Keras regression experiments.

# summary

In this tutorial, you learned how to use the Keras deep learning library for regression. Specifically, we use Keras and regression to predict house prices based on four numerical and categorical attributes:

• Number of bedrooms
• Number of bathrooms
• Area (i.e. square feet)
• Postal Code

Overall, our neural network obtained an average absolute percentage error of 22.71%, which means that, on average, our house price forecast will decline by 22.71%.