In this blog and the next few ones that will follow, we will analyze a banking domain dataset, which contains several files with details of its customers. This database was prepared by Petr Berka and Marta Sochorova.
The Berka dataset is a collection of financial information from a Czech bank. The dataset deals with over 5,300 bank clients with approximately 1,000,000 transactions. Additionally, the bank represented in the dataset has extended close to 700 loans and issued nearly 900 credit cards, all of which are represented in the data.
By the time you finish reading this blog, you would have learned :
- How to analyze a bank’s data to predict a customer’s quality
- Using this analysis we can categorize a customer into three categories:
- Excellent: Customers whose record is good with the bank
- Good: Customers who have average earning with a good record till now
- Risky: Customers who are under debt of bank or who has not paid the loan on time
- How to write PIG UDF
- How to connect Hadoop with R
- How to load data from Hadoop to R
How to analyze a bank’s data to predict the customer’s quality
Prerequisite
Software Technology
- Java installed Hadoop concepts
- Hadoop installed Java concepts
- Pig installed Pig concepts
- R-base
- Rstudio
- Ubuntu OS
1. Understanding the data
The berka dataset was prepared by Petr Berka and Marta Sochorova. We will refer to the dataset as the Berka Dataset throughout this analysis. The Berka Dataset is a collection of financial information from a Czech bank. The dataset deals with over 5,300 bank clients with approximately 1,000,000 transactions. Additionally, the bank represented in the dataset has extended close to 700 loans and issued nearly 900 credit cards, all of which are represented in the data.
- account.asc
- card.asc
- client.asc
- disp.asc
- district.asc
- loan.asc
- order.asc
- trans.asc
2. Data cleaning
Before we proceed to analysis phase we must understand the raw data. If the raw data needs formatting, we can do pre-processing to clean it.
Here, I have cleaned the data and re-saved as a CSV file.
sed 's/"//g' account.asc | sed -e '1d' |sed 's/;/,/g'> acccount.csv
Similarly, we can clean the rest of the data.
3. Moving the data to HDFS
hadoop dfs -put bank_project /user/abhay
4. Writing PIG script
Loan data: –
Loading and picking only required fields
loan = load '/user/itsupport/data_berka/loan.csv' using PigStorage(',') ; loan_fields = foreach loan generate $1 as ac_id,$0 as loan_id,$3 as amount,$6 as status;
Grouping on account Id
grp_loan_ac_id = group loan_fields by $0;
Breaking nested bag to store the fields as tuple
grp_loan_ac_id_flatten = foreach grp_loan_ac_id generate FLATTEN(loan_fields);
Removing the header
filtered_grp = filter grp_loan_ac_id_flatten by $3 != 'status';
Storing the short file back to HDFS
store filtered_grp into '/bank_project/loan_required_out' using PigStorage(','); Client data: -
Note : The date of birth (DOB) information is in the format yymm+50dd(for female)yymmdd(for male)
For calculating the age from the above date format a UDF is written and its jar is registered with PIG execution engine.
REGISTER alljars/pig_age_calculator.jar
Loading and picking only required fields
client = load '/user/itsupport/data_berka/client.csv' using PigStorage(',') AS (Client_id:int,dob:chararray,dist_id:int) ; client_fields = foreach client generate $0 as client_id,$2 as district_id,$1 as birthday_n_sex;
Grouping on client_id
grp_client_id = group client_fields by $0;
Breaking nested bag to store the fields as tuple
grp_client_flat = FOREACH grp_client_id GENERATE FLATTEN(client_fields);
Removing the header
B = filter grp_client_flat by $2 != 'birth_number;
Calling Java function to parse the date and finding age and sex of the customer
age = foreach B generate $0,$1, bank.Age_calculator(birthday_n_sex) ; store age into '/bank_project/age_required_out' using PigStorage(',');
Transaction data:
Loading and picking only required fields
transaction = load '/user/itsupport/data_berka/transaction.csv' using PigStorage(',') as (trans_id:int,ac_id:int,date:chararray,type:chararray,operation:chararray,amount:int,bal:int,k_sym:chararray,bank:int,account:int); transaction_fields = foreach transaction generate $1 as ac_id,$2 as date_of_transaction,$3 as transaction_type,$5 as amount,$6 as bal_post_trnsaction;
Picking only transaction of last one year
filtered_trans = filter transaction_fields by (int)SUBSTRING($1,0,2) > 97; grp_ac = group filtered_trans by $0;
Summing up the transaction carried out by the user in last one year
MAX_grp_ac = FOREACH grp_ac GENERATE group, SUM(filtered_trans.$3),SUM(filtered_trans.$4); store MAX_grp_ac into '/bank_project/transaction_left_bal_required_out' using PigStorage(',');
Card details data:-
card = load '/user/itsupport/data_berka/card.csv' using PigStorage(',') ; card_fields = foreach card generate $1 as disposition_id,$2 as card_type; grp_card_disp_id = group card_fields by $0; flatten_card = foreach grp_card_disp_id generate FLATTEN(card_fields); filtered_card = filter flatten_card by card_type != 'type'; store filtered_card into '/bank_project/card_required_out' using PigStorage(',');
District data:
Register the jar to calculate the difference in unemployment between two consecutive years 95 and 96
REGISTER alljars/pig_substractjar.jar district = load '/user/itsupport/data_berka/district.csv' using PigStorage(',') AS (dist_id:int,dist_name:chararray,region:chararray,no_inhabs:long,mun_499:int,mun_1999:int,mun_10k:int,mun_more:int,no_of_cities:int,no_of_urban_inhabs:double ,avg_sal:int,unemp_95:double,unemp_96:double,entre_ratio:int); district_fields = foreach district generate $0 as district_id,$1 as district_name,$2 as region,$10 as avg_salary,$11 as unemp_rate_95,$12 as unemp_rate_96,$13 as entrepreneur_per_1000; grp_dist_id = group district_fields by $0; MAX_grp_dist = FOREACH grp_dist_id GENERATE group,FLATTEN(district_fields); B = filter MAX_grp_dist by unemp_rate_95 > 0.0 AND unemp_rate_96 > 0.0; unem_percentage = foreach B generate $1, district_name,avg_salary,bank.substract(unemp_rate_95,unemp_rate_96),entrepreneur_per_1000 ; store unem_percentage into '/bank_project/district_required_out' using PigStorage(',');
Disposition data:
disposition = load '/user/itsupport/data_berka/disposition.csv' using PigStorage(',') ; disposition_fields = foreach disposition generate $2 as ac_id,$0 as disposition_id,$3 as disposition_type,$1 as client_id; grp_disposition_disp_id = group disposition_fields by $1; flatten_disposition_disp_id = foreach grp_disposition_disp_id generate FLATTEN(disposition_fields); filtered_disposition_disp_id = filter flatten_disposition_disp_id by disposition_type != 'type';
Joining all the data :-
Before we start writing script we should design our approach/Algorithm which should be implemented:
- join loan,transaction,Account,Disposition,on ac_id as ac_id_join
- join ac_id_join,district_info,client on district_id as include_district
- join include_district,card on disposition_id as join_done
- select loan_amount,loan_duration,loan_status,type,transaction_amount,date,owner_type,district_name,region,avg_salary,unemployment_rate_95,unemployment_rate_96,no_of_enterpreneur/1000,card type,birthday
- Algorithm used to predict excellent, good and risky customers:
within 1 year {
if transaction_amount > 10 lac and avg_sal > 10k and loan_status==’A’ and (age >25 and age <=65)
write in a file called good more loan can be granted card can be upgrade
if transaction_amount > 10 lac and avg_sal > 6k and loan_status==’A’ and loan_status==’C’ and (age >25 and age <=55) and unemployment_rate < 0.80
write in a file called ok more loan can be granted after completion of the loan card can be upgraded after completion of the loan
if avg_sal > 6k and loan_status==’B’ and loan_status==’D’ and (age >35) and no_of_entrepreneur>100
write in a file called risky no more loans card must be downgraded
}
Client_age = load '/bank_project/age_required_out' using PigStorage(',') AS (client_id:int,dist_id:int,age:double,sex:chararray); card_type = load '/bank_project/card_required_out' using PigStorage(',') AS (disp_id:int,type:chararray); transaction_sum = load '/bank_project/transaction_left_bal_required_out' using PigStorage(',') AS (ac_id:int,trans_sum:long,bal_sum:long); loan_status = load '/bank_project/loan_required_out' using PigStorage(',') AS (ac_id:int,loan_id:int,amount:int,status:chararray); district_info = load '/bank_project/district_required_out' using PigStorage(',') AS (district_id:int,dist_name:chararray,avg_sal:int,unemprate:double,entrepreneur:int); join_disp_client = join filtered_disposition_disp_id by $3,Client_age by $0; join_disp_client_card = join join_disp_client by $1,card_type by $0; join_disp_client_card_district = join join_disp_client_card by $5,district_info by $0; join_disp_client_card_district_trans_loan = join join_disp_client_card_district by $0,transaction_sum by $0,loan_status by $0; pick_fields = foreach join_disp_client_card_district_trans_loan generate $0 as ac_id,$2 as disp_type,$6 as age,$7 as sex,$9 as card_type,$11 as dist_name,$12 as avg_sal,$13 as unemp_rate,$14 as no_of_entre,$16 as transaction_sum,$20 as loan_amount,$21 as loan_status; store pick_fields into '/bank_project/combined_out' using PigStorage(','); Good = filter pick_fields by $9 > 1000000 AND $6 > 10000 AND $11 == 'A' AND (($2 >= 25.0 AND $2 <=65.0)); store Good into '/bank_project/VIP_customer' using PigStorage(','); Normal = filter pick_fields by $9 < 1000000 AND $9 >150000 AND $6 > 6000 AND ($11=='A' OR $11=='C') AND (($2 <= 55.0) AND ($2 >=25.0)) AND $7 < 0.80; store Normal into '/bank_project/good_customer' using PigStorage(','); Risky = filter pick_fields by $6 > 6000 AND ($11 == 'B' OR $11 == 'D') AND $2 > 35.0 AND $8 > 100; store Risky into '/bank_project/risky_customer' using PigStorage(',');
Now, we have three categories of customers.
The bank can prepare separate plans for three categories as per their business goals.
In the next blog we will analyze the intermediate output of Hadoop i.e. output file “/bank_project/combined_out” and build clusters on the variable using R.
Got a question for us? Please mention it in the comments section and we will get back to you.
Related Posts: