Banking and Financial Use Cases With Hadoop & R

client = load '/user/itsupport/data_berka/client.csv' using PigStorage(',') AS (Client_id:int,dob:chararray,dist_id:int) ; client_fields = foreach client generate $0 as client_id,$2 as district_id,$1 as birthday_n_sex;

transaction = load '/user/itsupport/data_berka/transaction.csv' using PigStorage(',') as (trans_id:int,ac_id:int,date:chararray,type:chararray,operation:chararray,amount:int,bal:int,k_sym:chararray,bank:int,account:int); transaction_fields = foreach transaction generate $1 as ac_id,$2 as date_of_transaction,$3 as transaction_type,$5 as amount,$6 as bal_post_trnsaction;

card = load '/user/itsupport/data_berka/card.csv' using PigStorage(',') ; card_fields = foreach card generate $1 as disposition_id,$2 as card_type; grp_card_disp_id = group card_fields by $0; flatten_card = foreach grp_card_disp_id generate FLATTEN(card_fields); filtered_card = filter flatten_card by card_type != 'type'; store filtered_card into '/bank_project/card_required_out' using PigStorage(',');

REGISTER alljars/pig_substractjar.jar district = load '/user/itsupport/data_berka/district.csv' using PigStorage(',') AS (dist_id:int,dist_name:chararray,region:chararray,no_inhabs:long,mun_499:int,mun_1999:int,mun_10k:int,mun_more:int,no_of_cities:int,no_of_urban_inhabs:double ,avg_sal:int,unemp_95:double,unemp_96:double,entre_ratio:int); district_fields = foreach district generate $0 as district_id,$1 as district_name,$2 as region,$10 as avg_salary,$11 as unemp_rate_95,$12 as unemp_rate_96,$13 as entrepreneur_per_1000; grp_dist_id = group district_fields by $0; MAX_grp_dist = FOREACH grp_dist_id GENERATE group,FLATTEN(district_fields); B = filter MAX_grp_dist by unemp_rate_95 > 0.0 AND unemp_rate_96 > 0.0; unem_percentage = foreach B generate $1, district_name,avg_salary,bank.substract(unemp_rate_95,unemp_rate_96),entrepreneur_per_1000 ; store unem_percentage into '/bank_project/district_required_out' using PigStorage(',');

disposition = load '/user/itsupport/data_berka/disposition.csv' using PigStorage(',') ; disposition_fields = foreach disposition generate $2 as ac_id,$0 as disposition_id,$3 as disposition_type,$1 as client_id; grp_disposition_disp_id = group disposition_fields by $1; flatten_disposition_disp_id = foreach grp_disposition_disp_id generate FLATTEN(disposition_fields); filtered_disposition_disp_id = filter flatten_disposition_disp_id by disposition_type != 'type';

Client_age = load '/bank_project/age_required_out' using PigStorage(',') AS (client_id:int,dist_id:int,age:double,sex:chararray); card_type = load '/bank_project/card_required_out' using PigStorage(',') AS (disp_id:int,type:chararray); transaction_sum = load '/bank_project/transaction_left_bal_required_out' using PigStorage(',') AS (ac_id:int,trans_sum:long,bal_sum:long); loan_status = load '/bank_project/loan_required_out' using PigStorage(',') AS (ac_id:int,loan_id:int,amount:int,status:chararray); district_info = load '/bank_project/district_required_out' using PigStorage(',') AS (district_id:int,dist_name:chararray,avg_sal:int,unemprate:double,entrepreneur:int); join_disp_client = join filtered_disposition_disp_id by $3,Client_age by $0; join_disp_client_card = join join_disp_client by $1,card_type by $0; join_disp_client_card_district = join join_disp_client_card by $5,district_info by $0; join_disp_client_card_district_trans_loan = join join_disp_client_card_district by $0,transaction_sum by $0,loan_status by $0; pick_fields = foreach join_disp_client_card_district_trans_loan generate $0 as ac_id,$2 as disp_type,$6 as age,$7 as sex,$9 as card_type,$11 as dist_name,$12 as avg_sal,$13 as unemp_rate,$14 as no_of_entre,$16 as transaction_sum,$20 as loan_amount,$21 as loan_status; store pick_fields into '/bank_project/combined_out' using PigStorage(','); Good = filter pick_fields by $9 > 1000000 AND $6 > 10000 AND $11 == 'A' AND (($2 >= 25.0 AND $2 <=65.0)); store Good into '/bank_project/VIP_customer' using PigStorage(','); Normal = filter pick_fields by $9 < 1000000 AND $9 >150000 AND $6 > 6000 AND ($11=='A' OR $11=='C') AND (($2 <= 55.0) AND ($2 >=25.0)) AND $7 < 0.80; store Normal into '/bank_project/good_customer' using PigStorage(','); Risky = filter pick_fields by $6 > 6000 AND ($11 == 'B' OR $11 == 'D') AND $2 > 35.0 AND $8 > 100; store Risky into '/bank_project/risky_customer' using PigStorage(',');

lelouch_0 says:
Jan 27, 2017 at 6:05 am GMT
getting error could not find age calculator.jar
Reply
lelouch_0 says:
Jan 27, 2017 at 6:04 am GMT
getting error : bank.Age_calculator could not be found
Reply
- EdurekaSupport says:
  Jan 31, 2017 at 2:27 pm GMT
  +Lelouch_0, thanks for checking out our blog. The above error occurs if you haven’t created a User Defined Function for calculating the age from the format given in the data and its jar files need to be registered with the PIG execution engine. Hope this helps. Cheers!
  Reply
M says:
Dec 23, 2016 at 11:02 pm GMT
Hi,
sed ‘s/”//g’ account.asc | sed -e ‘1d’ |sed ‘s/;/,/g’> acccount.csv is not working.. Can you please suggest anything?
Reply
- EdurekaSupport says:
  Dec 28, 2016 at 8:12 am GMT
  Hey M, thanks for checking out our tutorial! Could you please share the details of the error you are getting so that we can support you better? Cheers!
  Reply
Raza Raza says:
Dec 19, 2016 at 7:44 pm GMT
Hy data is helpful for me but can i found a datset of Banking Domain??
Reply
- EdurekaSupport says:
  Dec 22, 2016 at 1:04 pm GMT
  Hey Raza, thanks for checking out our blog. We have uploaded the data set here: https://drive.google.com/file/d/0B5elLUlnHIK3RDNycnhkYTJ4UkU/view. Cheers!
  Reply
Peter Joy says:
Dec 14, 2016 at 5:10 am GMT
Hi Abhay i m new to hadoop i m trying to do this process to get practice .can get some ideas how to create age calculation logic and code as well not as jar i want to see the code logic, also m new to java too.. please do me a favor.. thanks in advance
Reply
- EdurekaSupport says:
  Jan 4, 2017 at 2:20 pm GMT
  Hey Peter, thanks for checking out our blog. You can get a basic idea about logic and code from our R blogs here https://www.edureka.co/blog/category/big-data-analytics?s=Data%20Analytics%20with%20R and MapReduce blogs here https://www.edureka.co/blog?s=mapreduce. You can also check out our R tutorials in this playlist https://www.youtube.com/playlist?list=PL9ooVrP1hQOHAPQPcVgnVZjbrbKSckiii and Hadoop tutorials in this playlist: https://www.youtube.com/playlist?list=PL9ooVrP1hQOFrYxqxb0NJCdCABPZNo0pD. Please feel free to write to us if you have any questions. Hope this helps. Cheers!
  Reply
Meera Rajan says:
Nov 30, 2016 at 5:48 am GMT
hello friends
Im trying few statistical operations to calculate customer potentiality usimg R tool. I need information for literature survey ,such as , how far “R” tool is of use in bank domain.Areas where it is aggressively used, its.Merits compared to other tools in bank domain.any exclusive papers available on it. Few papers , i browse are not pinpoint to the concept.
Reply
- EdurekaSupport says:
  Dec 7, 2016 at 11:53 am GMT
  Hey Meera, thanks for checking out our blog. Business analytics is a fast growing field and there are many tools available in the market to serve the needs of organizations. The range of analytical software goes from relatively simple statistical tools in spreadsheets (ex-MS Excel) to statistical software packages (ex-KXEN, Statistica) to sophisticated business intelligence suites (ex-SAS, Oracle, SAP, IBM among the big players). Open source tools like R and Weka are also gaining popularity. Besides these, companies develop in-house tools designed for specific purposes.
  Merits of R:
  R is the most comprehensive statistical analysis package available.It incorporates all of the standard statistical tests, models, and analyses, as well as providing a comprehensive language for managing and manipulating data. New technology and ideas often appear first in R.
  ˆ The graphical capabilities of R are outstanding, providing a fully programmable graphics language that surpasses most other statistical and graphical packages.
  ˆ R is free and open source software, allowing anyone to use and, importantly, to modify it. R is licensed under the GNU General Public License, with copyright held by The R Foundation for Statistical Computing.
  ˆ R has no license restrictions (other than ensuring our freedom to use it at our own discretion), and so we can run it anywhere and at any time, and even sell it under the conditions of the license.
  ˆ Anyone is welcome to provide bug xes, code enhancements, and new packages, and the wealth of quality packages available for R is a testament to this approach to software development and sharing.
  R has over 4800 packages available from multiple repositories specializing in topics like econometrics, data mining, spatial analysis, and bio-informatics.
  R is cross-platform. R runs on many operating systems and diff erent hardware. It is popularly used on GNU/Linux, Macintosh, and Microsoft Windows, running on both 32 and 64 bit processors. ˆ
  R plays well with many other tools, importing data, for example, from CSV les, SAS, and SPSS, or directly from Microsoft Excel, Microsoft Access, Oracle, MySQL, and SQLite. It can also produce graphics output in PDF, JPG, PNG, and SVG formats, and table output for LATEX and HTML.
  Demerits of R
  R has a steep learning curve it does take a while to get used to the power of R but no steeper than for other statistical languages. ˆ R is not so easy to use for the novice. There are several simple-to use graphical user interfaces (GUIs) for R that encompass point and-click interactions, but they generally do not have the polish of the commercial offerings.
  ˆ Documentation is sometimes patchy and terse, and impenetrable to the non-statistician. However, some very high-standard books are increasingly plugging the documentation gaps.
  ˆ The quality of some packages is less than perfect, although if a package is useful to many people, it will quickly evolve into a very robust product through collaborative efforts.
  ˆ Many R commands give little thought to memory management, and so R can very quickly consume all available memory. This can be a restriction when doing data mining. There are various solutions, including using 64 bit operating systems that can access much more memory than 32 bit ones.
  Here’s a related blog: https://www.edureka.co/blog/clustering-on-bank-data-using-r/. Cheers!
  Reply
jassy says:
Sep 15, 2016 at 5:53 pm GMT
can i get jar files of age
Reply
- EdurekaSupport says:
  Sep 16, 2016 at 10:33 am GMT
  Hey Jassy, thanks for checking out the blog. To calculate the age from the above date format, a UDF is written and its jar is registered with PIG execution engine.
  REGISTER alljars/pig_age_calculator.jar
  You can type this command in the terminal and proceed in accordance with the mentioned steps. Hope this helps.
  Reply
  - Rajendra Prasad Balige says:
    Dec 9, 2016 at 6:53 am GMT
    grunt> REGISTER alljars/pig_age_calculator.jar
    2016-12-09 12:16:34,002 [main] ERROR org.apache.pig.tools.grunt.Grunt – ERROR 101: file ‘alljars/pig_age_calculator.jar’ does not exist.
    Details at logfile: /home/training/Desktop/pig_1481264251880.log
    Reply
Steve Farr says:
Nov 30, 2015 at 3:15 pm GMT
You could acheive a lot of the leg work here in RapidMiner and just incorporate your R scripts as an operator
Reply
suresh says:
Jun 10, 2015 at 2:08 pm GMT
Hi Abhay.. what is R and Rbase, Rstudio…?
Reply
- EdurekaSupport says:
  Jul 6, 2015 at 7:07 am GMT
  Hi Suresh, R is a programming language which helps you to perform statistical analysis, data visualization on a given data. RBase in nothing but relational database. Rstudio gives you an UI to work with R.
  Reply
prashanth says:
Mar 17, 2015 at 10:56 am GMT
i’m confused with the date format for the female for writing the UDF.could u please explain a bit about it
Reply
- EdurekaSupport says:
  Mar 18, 2015 at 10:55 am GMT
  Hi Prashanth, the date of birth (DOB) information is in the format yymm+50dd(for female)yymmdd(for male)
  Example : 701230 is 30th dec 1970 for a male customer.The same date for female customer would be : 706230 i.e. 50 is added in the month of a female customer.
  Hope this helps!
  Reply
  - prashanth reddy says:
    Oct 17, 2015 at 10:32 am GMT
    thanks for your help sir.. :)
    Reply
- faheen fathima says:
  Mar 29, 2019 at 5:38 am GMT
  hi prashanth,
  can you please share the udf codes? It would be a great help!
  Reply