Predicting Credit Risk In Germany Using Machine Learning

I. Introduction:

In the finance industry, classifying the potential risk of customers is the way to make the company itself and the finance market stable. People usually come to the bank for a loan in many forms: Personal loans, car loans, credit cards, mortgages,… Before issuing the loan, the bank has to investigate the potential risk of that customer.

● If an applicant has good credit (pay loan on time, responsible for their debt,…), the bank will likely get the profit from the loan without risk.
● Conversely, an applicant with bad credit will likely walk away with the money and not pay the money back, resulting in a loss for the bank.

The second risk could be considered to be a greater risk because the bank (or any other entity that lent the money to an untrustworthy party) has a higher chance of not being paid back the amount borrowed. Therefore it is on the bank’s or other lending authority’s part to determine the risks associated with lending money to a customer. This research attempts to tackle this issue by using the demographic and socio-economic profiles of the borrowers to determine the probability of extending a loan to the client. We are trying to mitigate the risk in business terms and optimize benefits for the bank.

With the help of machine learning, the pre-screening process will be faster than ever, with high accuracy and less bias. Moreover, the ability to scale up is suitable for the actual scenario of banking now a day.  

II. Proposed Work


We use German Credit Data from UCI Machine Learning Repository, created by Professor Dr. Hans Hofmann. This data is a record of information of loan applicant in Germany.

The dataset come with a numeric version, but we will focus on the original data with 1000 rows, 21 features (1 feature is target for prediction). There are 7 Numeric Feature and 14 Categorical features.

Cost Matrix

According to the author of dataset’s note. It is worse to class a customer as good when they are bad (5), than it is to class a customer as bad when they are good (1). We can explain this as a cost matrix as below:  

To handle this problem, we will Execute R Script to replicated each high risk expamle five times, while keep the low risk expamle unchanged.

III. Experimental Setup

Confusion matrix interpret

In this experiment, we mark  high risk customer as Positive and low risk customer as Negative, hence the confusion matrix will be explained as:

  • True Positive (TP): We predict that person has high credit risk, and they are actually high
  • True Negative (TN): We predict that person has low credit risk, and they are actually low
  • False Positive (FP): We predict that person has high credit risk, and they are actually low
  • False Negative (FN): We predict that person has low credit risk, and they are actually high 

Target indicator

A critical case is when we predict a person as low credit risk, but they are actually high. This misclassification will allow bad credit people to receive the loan and can create damage for the bank. As a consequence, our metric of performance in this paper is accuracy (as high as possible) and False negative Number (as low as possible).

Metric of Performance

Accuracy = 

Accuracy is total true prediction of all intances. This indicator represent ability to predict accurately of the model

Precision = 

Ratio of true Positive over total prediction for positive case. This indicator is important if we have to find the positive case.

Recall = 

Ratio of true positive over all positive number have to be predict.

 IV. Algorithmic Implementation

Two-Class Support Vector Machine

“Support Vector Machine” (SVM) is useful for both classification or regression problem. However, its usually use for classification. This algorithm plot each item of data as a point in n-dimensional space (N is total feature in dataset) and value of feature will have a coordinate in that demension.  When this algorithm work, it will create hyperplane that separates the dataset into classes. The optimal hyperplane is support by closet value  of each class to itself.

Two-Class Decision Forest

This algorithm was built intense for classification problem. Instead of relay on a single decision tree, its create creating multiple random tree, then compare them for the best result . each decision tree in the forest considers a random subset of features when forming questions and only has access to a random set of the training data points. The random selection algorithm help overcome’ habit of overfitting, hence pay a little bias but achieve high performance in speed and accuracy


Two-Class Neural Network

Because the dataset is labeled, we can use this powerful algorithm to solve a classification problem. Input fully connect to many hidden process layer, them predict the outcome. Usually, 1 hidden layer can solve the problem. But when its come to complex task like voice recognition or predict weather, a deep neural network with huge amount of hidden layer will involve.


Two-Class Boosted Tree

Microsoft has a very simple and easy explanation for this: “A boosted decision tree is an ensemble learning method in which the second tree corrects for the errors of the first tree, the third tree corrects for the errors of the first and second trees, and so forth. Predictions are based on the entire ensemble of trees together that makes the prediction.”

V. Setup

Process data

the data comes with the column heading (Col1, Col2, Col3…ect), we have put the Edit Metadata module to set the name in column headings. In the box of Edit Metadata module:

  • All column headings: From Col1 to Col 21
  • Data Type: Unchanged
  • Categorical: Unchanged
  • Fields: Unchanged
  • New Column Names: Status of checking account, Duration (in months), Credit History, Purpose, Credit amount, Savings account/bond, Present employment since, Installment rate in percentage of Disposable Income, Personal status and sex, Other debtors/guarantors, Present residence since, Property, Age in years, Other installment plans, Housing, Number of existing credits at this bank, Job, Number of people being liable to provide maintenance for, Telephone, Foreign worker, Credit risk

Now we handle the cost matrix by connect the “Execute-R Script” module to Edit Metadata. In the Properties pane:

dataset1 <- maml.mapInputPort(1)



for (i in 1:5) data.set<-rbind(data.set,pos)


To handle the great difference in the scale of the numbers, we use the Normalize Data module in order to convert all the features to numeric as the hyperbolic tangent and exclude the “Credit risk” because it is available as numeric and this is the column to predict the dataset. Normalization is a technique often applied as part of data preparation for machine learning.There is a big fluctuation in our dataset between some features. Therefore, our goal of normalization is to change the values of numeric columns in the dataset to use a common scale, without distorting differences in the ranges of values or losing information. Moreover, we use TanH instead of  Z-score because Tanh estimators are considered to be more efficient and robust normalization techniques. It is not sensitive to outliers and it also converges faster than “Z-score”normalization. It yields values between -1 and 1.

  • Transformation method: TanH
  • Column type: All Numeric, Exclude: Credit risk

After then, we are going to put Summarize Data and Split Data into the experiment and start to predict with Fraction of rows in the first output dataset 0.7. With this setting, we mean 70% of total rows will go through on the left for training and the rest use for testing.

Train model and evaluate result

Then we put the Train Model which must be set “Credit Risk” as the prediction and choose which learners will be given the higher prediction right. In our project, we are choosing the two-class Support Vector Machine (SVM). Setup for the rest module are same as the proposed diagram.

For comparing with other algorithm, we will keep the same setup, but change the SVM algorithm to others.

Optimize result (Optional)

To find the best parameter for the algorithm, we use Tune Model Hyperparameters to do a grid search of combination settings. 

Because of the nature of risk prediction, we should focus on F-score and use Correlation of Determination to optimize the parameters.

VI. Result

By adjusting the threshold, our aim is keeping the false negative number unchanged and increase accuracy to the highest as possible. This is the best result we achieved by far. With only 13 false positive prediction, we will deploy the model with Boosted Decision Tree to the final production like app or the web apps.