Microsoft Certified Azure Data Engineer Assoc ...
- 14k Enrolled Learners
- Weekend
- Live Class
In this blog and the next few ones that will follow, we will analyze a banking domain dataset, which contains several files with details of its customers. This database was prepared by Petr Berka and Marta Sochorova.
The Berka dataset is a collection of financial information from a Czech bank. The dataset deals with over 5,300 bank clients with approximately 1,000,000 transactions. Additionally, the bank represented in the dataset has extended close to 700 loans and issued nearly 900 credit cards, all of which are represented in the data.
By the time you finish reading this blog, you would have learned :
How to analyze a bank’s data to predict the customer’s quality
Prerequisite
Software Technology
1. Understanding the data
The berka dataset was prepared by Petr Berka and Marta Sochorova. We will refer to the dataset as the Berka Dataset throughout this analysis. The Berka Dataset is a collection of financial information from a Czech bank. The dataset deals with over 5,300 bank clients with approximately 1,000,000 transactions. Additionally, the bank represented in the dataset has extended close to 700 loans and issued nearly 900 credit cards, all of which are represented in the data.
2. Data cleaning
Before we proceed to analysis phase we must understand the raw data. If the raw data needs formatting, we can do pre-processing to clean it.
Here, I have cleaned the data and re-saved as a CSV file.
sed 's/"//g' account.asc | sed -e '1d' |sed 's/;/,/g'> acccount.csv
Similarly, we can clean the rest of the data.
3. Moving the data to HDFS
hadoop dfs -put bank_project /user/abhay
4. Writing PIG script
Loan data: –
Loading and picking only required fields
loan = load '/user/itsupport/data_berka/loan.csv' using PigStorage(',') ; loan_fields = foreach loan generate $1 as ac_id,$0 as loan_id,$3 as amount,$6 as status;
Grouping on account Id
grp_loan_ac_id = group loan_fields by $0;
Breaking nested bag to store the fields as tuple
grp_loan_ac_id_flatten = foreach grp_loan_ac_id generate FLATTEN(loan_fields);
Removing the header
filtered_grp = filter grp_loan_ac_id_flatten by $3 != 'status';
Storing the short file back to HDFS
store filtered_grp into '/bank_project/loan_required_out' using PigStorage(','); Client data: -
Note : The date of birth (DOB) information is in the format yymm+50dd(for female)yymmdd(for male)
For calculating the age from the above date format a UDF is written and its jar is registered with PIG execution engine.
REGISTER alljars/pig_age_calculator.jar
Loading and picking only required fields
client = load '/user/itsupport/data_berka/client.csv' using PigStorage(',') AS (Client_id:int,dob:chararray,dist_id:int) ; client_fields = foreach client generate $0 as client_id,$2 as district_id,$1 as birthday_n_sex;
Grouping on client_id
grp_client_id = group client_fields by $0;
Breaking nested bag to store the fields as tuple
grp_client_flat = FOREACH grp_client_id GENERATE FLATTEN(client_fields);
Removing the header
B = filter grp_client_flat by $2 != 'birth_number;
Calling Java function to parse the date and finding age and sex of the customer
age = foreach B generate $0,$1, bank.Age_calculator(birthday_n_sex) ; store age into '/bank_project/age_required_out' using PigStorage(',');
Transaction data:
Loading and picking only required fields
transaction = load '/user/itsupport/data_berka/transaction.csv' using PigStorage(',') as (trans_id:int,ac_id:int,date:chararray,type:chararray,operation:chararray,amount:int,bal:int,k_sym:chararray,bank:int,account:int); transaction_fields = foreach transaction generate $1 as ac_id,$2 as date_of_transaction,$3 as transaction_type,$5 as amount,$6 as bal_post_trnsaction;
Picking only transaction of last one year
filtered_trans = filter transaction_fields by (int)SUBSTRING($1,0,2) > 97; grp_ac = group filtered_trans by $0;
Summing up the transaction carried out by the user in last one year
MAX_grp_ac = FOREACH grp_ac GENERATE group, SUM(filtered_trans.$3),SUM(filtered_trans.$4); store MAX_grp_ac into '/bank_project/transaction_left_bal_required_out' using PigStorage(',');
Card details data:-
card = load '/user/itsupport/data_berka/card.csv' using PigStorage(',') ; card_fields = foreach card generate $1 as disposition_id,$2 as card_type; grp_card_disp_id = group card_fields by $0; flatten_card = foreach grp_card_disp_id generate FLATTEN(card_fields); filtered_card = filter flatten_card by card_type != 'type'; store filtered_card into '/bank_project/card_required_out' using PigStorage(',');
District data:
Register the jar to calculate the difference in unemployment between two consecutive years 95 and 96
REGISTER alljars/pig_substractjar.jar district = load '/user/itsupport/data_berka/district.csv' using PigStorage(',') AS (dist_id:int,dist_name:chararray,region:chararray,no_inhabs:long,mun_499:int,mun_1999:int,mun_10k:int,mun_more:int,no_of_cities:int,no_of_urban_inhabs:double ,avg_sal:int,unemp_95:double,unemp_96:double,entre_ratio:int); district_fields = foreach district generate $0 as district_id,$1 as district_name,$2 as region,$10 as avg_salary,$11 as unemp_rate_95,$12 as unemp_rate_96,$13 as entrepreneur_per_1000; grp_dist_id = group district_fields by $0; MAX_grp_dist = FOREACH grp_dist_id GENERATE group,FLATTEN(district_fields); B = filter MAX_grp_dist by unemp_rate_95 > 0.0 AND unemp_rate_96 > 0.0; unem_percentage = foreach B generate $1, district_name,avg_salary,bank.substract(unemp_rate_95,unemp_rate_96),entrepreneur_per_1000 ; store unem_percentage into '/bank_project/district_required_out' using PigStorage(',');
Disposition data:
disposition = load '/user/itsupport/data_berka/disposition.csv' using PigStorage(',') ; disposition_fields = foreach disposition generate $2 as ac_id,$0 as disposition_id,$3 as disposition_type,$1 as client_id; grp_disposition_disp_id = group disposition_fields by $1; flatten_disposition_disp_id = foreach grp_disposition_disp_id generate FLATTEN(disposition_fields); filtered_disposition_disp_id = filter flatten_disposition_disp_id by disposition_type != 'type';
Joining all the data :-
Before we start writing script we should design our approach/Algorithm which should be implemented:
within 1 year {
if transaction_amount > 10 lac and avg_sal > 10k and loan_status==’A’ and (age >25 and age <=65)
write in a file called good more loan can be granted card can be upgrade
if transaction_amount > 10 lac and avg_sal > 6k and loan_status==’A’ and loan_status==’C’ and (age >25 and age <=55) and unemployment_rate < 0.80
write in a file called ok more loan can be granted after completion of the loan card can be upgraded after completion of the loan
if avg_sal > 6k and loan_status==’B’ and loan_status==’D’ and (age >35) and no_of_entrepreneur>100
write in a file called risky no more loans card must be downgraded
}
Client_age = load '/bank_project/age_required_out' using PigStorage(',') AS (client_id:int,dist_id:int,age:double,sex:chararray); card_type = load '/bank_project/card_required_out' using PigStorage(',') AS (disp_id:int,type:chararray); transaction_sum = load '/bank_project/transaction_left_bal_required_out' using PigStorage(',') AS (ac_id:int,trans_sum:long,bal_sum:long); loan_status = load '/bank_project/loan_required_out' using PigStorage(',') AS (ac_id:int,loan_id:int,amount:int,status:chararray); district_info = load '/bank_project/district_required_out' using PigStorage(',') AS (district_id:int,dist_name:chararray,avg_sal:int,unemprate:double,entrepreneur:int); join_disp_client = join filtered_disposition_disp_id by $3,Client_age by $0; join_disp_client_card = join join_disp_client by $1,card_type by $0; join_disp_client_card_district = join join_disp_client_card by $5,district_info by $0; join_disp_client_card_district_trans_loan = join join_disp_client_card_district by $0,transaction_sum by $0,loan_status by $0; pick_fields = foreach join_disp_client_card_district_trans_loan generate $0 as ac_id,$2 as disp_type,$6 as age,$7 as sex,$9 as card_type,$11 as dist_name,$12 as avg_sal,$13 as unemp_rate,$14 as no_of_entre,$16 as transaction_sum,$20 as loan_amount,$21 as loan_status; store pick_fields into '/bank_project/combined_out' using PigStorage(','); Good = filter pick_fields by $9 > 1000000 AND $6 > 10000 AND $11 == 'A' AND (($2 >= 25.0 AND $2 <=65.0)); store Good into '/bank_project/VIP_customer' using PigStorage(','); Normal = filter pick_fields by $9 < 1000000 AND $9 >150000 AND $6 > 6000 AND ($11=='A' OR $11=='C') AND (($2 <= 55.0) AND ($2 >=25.0)) AND $7 < 0.80; store Normal into '/bank_project/good_customer' using PigStorage(','); Risky = filter pick_fields by $6 > 6000 AND ($11 == 'B' OR $11 == 'D') AND $2 > 35.0 AND $8 > 100; store Risky into '/bank_project/risky_customer' using PigStorage(',');
Now, we have three categories of customers.
The bank can prepare separate plans for three categories as per their business goals.
In the next blog we will analyze the intermediate output of Hadoop i.e. output file “/bank_project/combined_out” and build clusters on the variable using R.
Got a question for us? Please mention it in the comments section and we will get back to you.
Related Posts:
edureka.co
getting error could not find age calculator.jar
getting error : bank.Age_calculator could not be found
+Lelouch_0, thanks for checking out our blog. The above error occurs if you haven’t created a User Defined Function for calculating the age from the format given in the data and its jar files need to be registered with the PIG execution engine. Hope this helps. Cheers!
Hi,
sed ‘s/”//g’ account.asc | sed -e ‘1d’ |sed ‘s/;/,/g’> acccount.csv is not working.. Can you please suggest anything?
Hey M, thanks for checking out our tutorial! Could you please share the details of the error you are getting so that we can support you better? Cheers!
Hy data is helpful for me but can i found a datset of Banking Domain??
Hey Raza, thanks for checking out our blog. We have uploaded the data set here: https://drive.google.com/file/d/0B5elLUlnHIK3RDNycnhkYTJ4UkU/view. Cheers!
Hi Abhay i m new to hadoop i m trying to do this process to get practice .can get some ideas how to create age calculation logic and code as well not as jar i want to see the code logic, also m new to java too.. please do me a favor.. thanks in advance
Hey Peter, thanks for checking out our blog. You can get a basic idea about logic and code from our R blogs here https://www.edureka.co/blog/category/big-data-analytics?s=Data%20Analytics%20with%20R and MapReduce blogs here https://www.edureka.co/blog?s=mapreduce. You can also check out our R tutorials in this playlist https://www.youtube.com/playlist?list=PL9ooVrP1hQOHAPQPcVgnVZjbrbKSckiii and Hadoop tutorials in this playlist: https://www.youtube.com/playlist?list=PL9ooVrP1hQOFrYxqxb0NJCdCABPZNo0pD. Please feel free to write to us if you have any questions. Hope this helps. Cheers!
hello friends
Im trying few statistical operations to calculate customer potentiality usimg R tool. I need information for literature survey ,such as , how far “R” tool is of use in bank domain.Areas where it is aggressively used, its.Merits compared to other tools in bank domain.any exclusive papers available on it. Few papers , i browse are not pinpoint to the concept.
Hey Meera, thanks for checking out our blog. Business analytics is a fast growing field and there are many tools available in the market to serve the needs of organizations. The range of analytical software goes from relatively simple statistical tools in spreadsheets (ex-MS Excel) to statistical software packages (ex-KXEN, Statistica) to sophisticated business intelligence suites (ex-SAS, Oracle, SAP, IBM among the big players). Open source tools like R and Weka are also gaining popularity. Besides these, companies develop in-house tools designed for specific purposes.
Merits of R:
R is the most comprehensive statistical analysis package available.It incorporates all of the standard statistical tests, models, and analyses, as well as providing a comprehensive language for managing and manipulating data. New technology and ideas often appear first in R.
ˆ The graphical capabilities of R are outstanding, providing a fully programmable graphics language that surpasses most other statistical and graphical packages.
ˆ R is free and open source software, allowing anyone to use and, importantly, to modify it. R is licensed under the GNU General Public License, with copyright held by The R Foundation for Statistical Computing.
ˆ R has no license restrictions (other than ensuring our freedom to use it at our own discretion), and so we can run it anywhere and at any time, and even sell it under the conditions of the license.
ˆ Anyone is welcome to provide bug xes, code enhancements, and new packages, and the wealth of quality packages available for R is a testament to this approach to software development and sharing.
R has over 4800 packages available from multiple repositories specializing in topics like econometrics, data mining, spatial analysis, and bio-informatics.
R is cross-platform. R runs on many operating systems and diff erent hardware. It is popularly used on GNU/Linux, Macintosh, and Microsoft Windows, running on both 32 and 64 bit processors. ˆ
R plays well with many other tools, importing data, for example, from CSV les, SAS, and SPSS, or directly from Microsoft Excel, Microsoft Access, Oracle, MySQL, and SQLite. It can also produce graphics output in PDF, JPG, PNG, and SVG formats, and table output for LATEX and HTML.
Demerits of R
R has a steep learning curve it does take a while to get used to the power of R but no steeper than for other statistical languages. ˆ R is not so easy to use for the novice. There are several simple-to use graphical user interfaces (GUIs) for R that encompass point and-click interactions, but they generally do not have the polish of the commercial offerings.
ˆ Documentation is sometimes patchy and terse, and impenetrable to the non-statistician. However, some very high-standard books are increasingly plugging the documentation gaps.
ˆ The quality of some packages is less than perfect, although if a package is useful to many people, it will quickly evolve into a very robust product through collaborative efforts.
ˆ Many R commands give little thought to memory management, and so R can very quickly consume all available memory. This can be a restriction when doing data mining. There are various solutions, including using 64 bit operating systems that can access much more memory than 32 bit ones.
Here’s a related blog: https://www.edureka.co/blog/clustering-on-bank-data-using-r/. Cheers!
can i get jar files of age
Hey Jassy, thanks for checking out the blog. To calculate the age from the above date format, a UDF is written and its jar is registered with PIG execution engine.
REGISTER alljars/pig_age_calculator.jar
You can type this command in the terminal and proceed in accordance with the mentioned steps. Hope this helps.
grunt> REGISTER alljars/pig_age_calculator.jar
2016-12-09 12:16:34,002 [main] ERROR org.apache.pig.tools.grunt.Grunt – ERROR 101: file ‘alljars/pig_age_calculator.jar’ does not exist.
Details at logfile: /home/training/Desktop/pig_1481264251880.log
You could acheive a lot of the leg work here in RapidMiner and just incorporate your R scripts as an operator
Hi Abhay.. what is R and Rbase, Rstudio…?
Hi Suresh, R is a programming language which helps you to perform statistical analysis, data visualization on a given data. RBase in nothing but relational database. Rstudio gives you an UI to work with R.
i’m confused with the date format for the female for writing the UDF.could u please explain a bit about it
Hi Prashanth, the date of birth (DOB) information is in the format yymm+50dd(for female)yymmdd(for male)
Example : 701230 is 30th dec 1970 for a male customer.The same date for female customer would be : 706230 i.e. 50 is added in the month of a female customer.
Hope this helps!
thanks for your help sir.. :)
hi prashanth,
can you please share the udf codes? It would be a great help!