Datasets Deep Learning Client App Tutorial
Video Tutorial
The purpose of this sample project is to show you how to use datasets and datascience to predict if a customer loan will be accepted or denied.
The video tutorial below summarizes the steps you need to take to build the deep-learning model using the datasets.
In the next sections you will find the detailed transcript of the video and the code lines.
Two copies of datasets from FusionCreator will be used: Customer Loans and Demographics. The copies are downloaded from your Azure BlobStorage.
Get it from GitHub
If you have a GitHub account, here’s the link to the sample app repository: https://github.com/FusionFabric/ffdc-sample-deeplearning
Clone it and follow the instructions from the README.md file.
Prerequisites
To start building this client application
You must register an application on FusionCreator that includes the two previously mentioned datasets from the Dataset Catalog. After registration, download the datasets from Azure. See step 6 from the Application Wizard and Datasets for more information.
You need a recent Python installation on your machine. However, due to the dependency on the
keras
library, you must install a compatible Python version, such as 2.7 or 3.6, or previous versions. See the library repository Readme.md for details.Install Jupyter Notebook, that allows you to run the deep-learning program found in GitHub repository:
pip install notebook
.Install the Python libraries that you will use for this sample application:
pip install pandas numpy scipy matplotlib gmaps sklearn keras tensorflow
Create a Python model file and name it datasets-neural-network.py. You will add here all the required code for your client application.
Copy the following code to your model file to import the libraries into your model, in order to handle and manipulate the datasets.
#Data Handling and Manipulation
import pandas as pd
import numpy as np
'display.max_columns', None)
pd.set_option(
#Statistical Analysis
from scipy import stats
#Plotting and Visualization
import matplotlib.pyplot as plt
import matplotlib.pylab as plt
'figure.dpi'] = 120
plt.rcParams[import gmaps
!jupyter nbextension enable --py gmaps
#Machine Learning
from sklearn.preprocessing import MinMaxScaler
from keras.models import Sequential
from keras.layers import Dense, Dropout
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.model_selection import KFold
from sklearn.ensemble import RandomForestClassifier
Load Data
In this section you load the Customer Loans and Demographics datasets you downloaded from your Azure Data Share. Create a folder named Data in your working directory. Copy the datasets to this folder.
Load the customer_loans.csv dataset to your model and see its top 5 entries:
= pd.read_csv('Data/customer_loans.csv',sep=',')
df_loans df_loans.head()
Load the customer_demographics.csv dataset to your model and see its top 5 entries:
= pd.read_csv('Data/customer_demographics.csv',sep=',')
df_customer df_customer.head()
Examine Missing Data
The next code lines check if there are any columns with null values in the datasets. A zero value means that there are no null values in that column.
sum()
df_loans.isna().sum() df_customer.isna().
Examine Outliers
In this section you check if the loaded data has any outliers on the numerical columns beyond three standard deviations.
= ['Income','CreditScore','Debt','LoanTerm','InterestRate','CreditIncidents','HomeValue','LoanAmount']
num_cols_loans abs(stats.zscore(df_loans[num_cols_loans])) > 3).all(axis=1)] df_loans[(np.
= ['Age','Income','CreditScore','HouseholdSize','MedianHomeValue','Debt']
num_cols_customer abs(stats.zscore(df_customer[num_cols_customer])) > 3).all(axis=1)] df_customer[(np.
Having no outliers means that the data is clean.
Descriptive Statistics
In this section you see the descriptive statistics for the numerical columns of the data loaded into your model, such as count, mean, standard deviation, minimum value, maximum value and the 25%, 50% and 75% quantiles.
df_loans[num_cols_loans].describe() df_customer[num_cols_customer].describe()
Data Histograms
Use the code lines below to plot a histogram with the distribution of the data:
df_loans[num_cols_loans].hist() plt.tight_layout()
df_customer[num_cols_customer].hist() plt.tight_layout()
Customer Location Heatmap
You generate a Heatmap that shows the location of the customers based on latitude ang longitude columns from the Demographics dataset. The red color on the Heatmap defines a higher density of customers.
= df_customer["Lat"]
latitudes = df_customer["Long"]
longitudes
= np.array(list(zip(latitudes,longitudes)))
locations = gmaps.figure()
fig
fig.add_layer(gmaps.heatmap_layer(locations)) fig
Join Datasets Together
You analyzed the datasets independently until now. Using the following code lines, you join the Customer Loans and Demographics datasets together. See the top 5 entries from the joined dataset.
= df_customer.columns.difference(df_loans.columns).tolist()
cols_to_use 'custid')
cols_to_use.append(= df_loans.merge(df_customer[cols_to_use], on = 'custid')
df df.head()
Create Features
This section demonstrates how to create a base feature set for your deep learning model. You use some of the columns from the previously joined data.
You build a neural network classifier which predicts “Approved” or “Denied” based on the input features.
The target variable is the column LoanStatus, which receives the values “Approved” or “Denied”.
= ['Income','CreditScore','Debt','LoanTerm','InterestRate','CreditIncidents','HomeValue','LoanAmount',
num_features 'HouseholdSize','Lat','Long','MedianHomeValue','MedianHouseholdIncome']
= df[num_features]
df_features = pd.get_dummies(df.ProductType,prefix='ProductType')
df_product_type = pd.concat([df_features,df_product_type],axis=1)
df_features = df_features.values
features = np.argmax(pd.get_dummies(df.LoanStatus).values,axis=1) targets
Scale Data
Scale your data using MinMaxScaler from sklearn library.
=MinMaxScaler()
scaler= scaler.fit_transform(features) X
Neural Network Model Creation
Start building your neural network model by defining two functions.
- create_model - builds a feed forward neural network for binary classification
def create_model(input_shape):
= Sequential()
model 128, input_dim=input_shape, activation='relu'))
model.add(Dense(64, activation='relu'))
model.add(Dense(0.2))
model.add(Dropout(32, activation='relu'))
model.add(Dense(16, activation='relu'))
model.add(Dense(8, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.add(Dense(compile(loss='binary_crossentropy', optimizer='sgd', metrics=['accuracy'])
model.return model
- train_and_evaluate_model - trains the network and returns the validation accuracy for the K-Fold cross validation
def train_and_evaluate__model(model, data_train, labels_train, data_test, labels_test):
= model.fit(data_train,labels_train,validation_data=(data_test,labels_test),epochs=30,batch_size=128)
history = history.history['val_accuracy'][-1]
val_acc return val_acc, history
If you run this model on a Mac OS Machine, replace
val_acc = history.history['val_accuracy'][-1]
with
val_acc = history.history['val_acc'][-1]
K-Fold Cross Validation
In this section you set the K-Fold Cross Validation, that trains and tests the model with every data point from the train and test set. You use 3 different splits for training the model. After the run, if the Estimated Accuracy is closer to 1, the better is the model.
= []
scores = []
models = []
historys = 3
num_splits = KFold(n_splits=num_splits)
kf
kf.get_n_splits(X)= X.shape[1]
input_shape
= 0
fold for train_index, test_index in kf.split(X):
print("Running fold {}".format(fold))
= X[train_index], X[test_index]
X_train, X_test = targets[train_index], targets[test_index]
y_train, y_test = create_model(input_shape)
model = train_and_evaluate__model(model,X_train,y_train,X_test,y_test)
score, history
scores.append(score)
models.append(model)
historys.append(history)+= 1
fold
print('\n\nEstimated Accuracy ' , (np.round(np.mean(scores),2)), ' %')
Model Creation After K-Fold Cross Validation
If K-Fold Cross Validation accuracy is high, you can use your model for prediction.
Retrain your model with more train and test data, to fit the final model.
= train_test_split(X, targets, test_size=0.20, random_state=42)
X_train, X_test, y_train, y_test = create_model(input_shape)
model = model.fit(X_train, y_train, validation_data=(X_test,y_test),epochs=30,batch_size=128) history
Model Performance
The code lines below plot the loss and the accuracy for the train and test set.
'loss'])
plt.plot(history.history['val_loss'])
plt.plot(history.history['model loss')
plt.title('loss')
plt.ylabel('epoch')
plt.xlabel('train', 'test'], loc='upper left')
plt.legend([; plt.show()
'accuracy'])
plt.plot(history.history['val_accuracy'])
plt.plot(history.history['model acc')
plt.title('acc')
plt.ylabel('epoch')
plt.xlabel('train', 'test'], loc='upper left')
plt.legend([; plt.show()
If you run this model on a Mac OS Machine, replace:
plt.plot(history.history['accuracy'])
with
plt.plot(history.history['acc'])
and
plt.plot(history.history['val_accuracy'])
with
plt.plot(history.history['val_acc'])
Final Code Review
Here are the code files discussed on this page.
#Data Handling and Manipulationimport pandas as pd
import numpy as np
.set_option('display.max_columns', None)
pd
#Statistical Analysisfrom scipy import stats
#Plotting and Visualizationimport matplotlib.pyplot as plt
import matplotlib.pylab as plt
.rcParams['figure.dpi'] = 120
pltimport gmaps
!jupyter nbextension enable --py gmaps
#Machine Learningfrom sklearn.preprocessing import MinMaxScaler
from keras.models import Sequential
from keras.layers import Dense, Dropout
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.model_selection import KFold
from sklearn.ensemble import RandomForestClassifier
= pd.read_csv('Data/customer_loans.csv',sep=',')
df_loans .head()
df_loans= pd.read_csv('Data/customer_demographics.csv',sep=',')
df_customer .head()
df_customer.isna().sum()
df_loans.isna().sum()
df_customer
= ['Income','CreditScore','Debt','LoanTerm','InterestRate','CreditIncidents','HomeValue','LoanAmount']
num_cols_loans .abs(stats.zscore(df_loans[num_cols_loans])) > 3).all(axis=1)]
df_loans[(np
= ['Age','Income','CreditScore','HouseholdSize','MedianHomeValue','Debt']
num_cols_customer .abs(stats.zscore(df_customer[num_cols_customer])) > 3).all(axis=1)]
df_customer[(np
.describe()
df_loans[num_cols_loans].describe()
df_customer[num_cols_customer]
.hist()
df_loans[num_cols_loans].tight_layout()
plt
.hist()
df_customer[num_cols_customer].tight_layout()
plt
= df_customer["Lat"]
latitudes = df_customer["Long"]
longitudes
= np.array(list(zip(latitudes,longitudes)))
locations = gmaps.figure()
fig .add_layer(gmaps.heatmap_layer(locations))
fig
fig
= df_customer.columns.difference(df_loans.columns).tolist()
cols_to_use .append('custid')
cols_to_use= df_loans.merge(df_customer[cols_to_use], on = 'custid')
df .head()
df
= ['Income','CreditScore','Debt','LoanTerm','InterestRate','CreditIncidents','HomeValue','LoanAmount',
num_features 'HouseholdSize','Lat','Long','MedianHomeValue','MedianHouseholdIncome']
= df[num_features]
df_features = pd.get_dummies(df.ProductType,prefix='ProductType')
df_product_type = pd.concat([df_features,df_product_type],axis=1)
df_features = df_features.values
features = np.argmax(pd.get_dummies(df.LoanStatus).values,axis=1)
targets
=MinMaxScaler()
scaler= scaler.fit_transform(features)
X
create_model(input_shape):
def = Sequential()
model .add(Dense(128, input_dim=input_shape, activation='relu'))
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(32, activation='relu'))
model.add(Dense(16, activation='relu'))
model.add(Dense(8, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='sgd', metrics=['accuracy'])
modelreturn model
train_and_evaluate__model(model, data_train, labels_train, data_test, labels_test):
def = model.fit(data_train,labels_train,validation_data=(data_test,labels_test),epochs=30,batch_size=128)
history = history.history['val_accuracy'][-1]
val_acc return val_acc, history
= []
scores = []
models = []
historys = 3
num_splits = KFold(n_splits=num_splits)
kf .get_n_splits(X)
kf= X.shape[1]
input_shape
= 0
fold for train_index, test_index in kf.split(X):
print("Running fold {}".format(fold))
, X_test = X[train_index], X[test_index]
X_train, y_test = targets[train_index], targets[test_index]
y_train= create_model(input_shape)
model , history = train_and_evaluate__model(model,X_train,y_train,X_test,y_test)
score.append(score)
scores.append(model)
models.append(history)
historys+= 1
fold
print('\n\nEstimated Accuracy ' , (np.round(np.mean(scores),2)), ' %')
, X_test, y_train, y_test = train_test_split(X, targets, test_size=0.20, random_state=42)
X_train= create_model(input_shape)
model = model.fit(X_train, y_train, validation_data=(X_test,y_test),epochs=30,batch_size=128)
history
.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show();
plt
.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title('model acc')
plt.ylabel('acc')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show();
plt