

Research
Title: The impact on sustainability risk-weighted asset calculations by using a set of machine learning concepts which leads to companies' probability default.
Abstract
Data analytics is a combination of innumerable tools, algorithms, and machine learning perceptions with a motivation to analyse some tessellations in the raw and unstructured data for making further predictions on the dataset provided.
The main goal of the dataset given is to focus on generating the values for the target variable PD using Regression as well as Classification. To find the regression values the models used are Linear, Decision Tree and Support Vector. For the classification purpose, the models used are Random Forest Classifier, Decision Tree Classifier and Logistic model. Out of the regression models which is the root mean squared error (RMSE), is calculated to check which model best fits and predicts the accurate result. The RMSE for Linear is 0.047, the Support Vector is 0.097 and the Decision Tree is 0.223, Lower the RMSE best is the model, so linear and SVR best fits the data and Decision tree.
As the PD contained numeric data, which is not suitable for classification because it follows the rules to have the data in the column ranging from 0 to 1. The PD contained numeric data so, the mean of the column was calculated and values below the mean are given value as 0 and above mean as 1. So, we altered the numeric column into a discrete column ranging from 0 or 1.
Next, is the classification models which we used Random Forest Classifier, Decision Tree classifier and logistic. For classification accuracy is the parameter to find the best model and three of the models gave an accuracy of 96%(Decision Tree Classifier),93% (Random Forest) and 93% for Logistic Regression. Decision Tree Classifier is the best, and it means three of the models have approximately the same result that best fits the data and predicts the results accurately. The model is best if its accuracy is higher.
Features such as period, Exchange name, and Ticker contained values which were categorical in nature i.e., the values which are not numeric in nature and hence can’t be used for regression and hence can’t be used for regression problems the columns were converted using the label encoder.
Keywords: Sustainability, ESG, Financial performance, Probability default, Credit risk, Review