Write a Splitted PDF Back to HDFS using Python Insecure Client

0 votes

I have used the PdfFileReader to read the file from the Data Lake and my requirement is to split the read PDF into individual pages and write back individual files back to a different folder in HDFS.

For reading files i have used below code and is working.

 from PyPDF2 import PdfFileWriter, PdfFileReader
        from io import BytesIO
        from hdfs import InsecureClient
        client = InsecureClient('http://datalake:50070')
        import requests
        from json import dumps
        
        client.status("/")
        fnames=client.list('/shared/Team5162')
        with client.read('/shared/Team5162/DemoCompany/Green Energy Limited.pdf') as reader:
                input_pdf = PdfFileReader(BytesIO(reader.read()))
        print(input_pdf.getNumPages()) 


Now i want to split the read PDF and write back.Using this code am able to create 136 individual pages.However it has no content embedded and i gets no error as well.


for i in range(input_pdf.getNumPages()):
    out_pdf  = PdfFileWriter()
    output   = out_pdf.addPage(input_pdf.getPage(i))
    #output   = out_pdf.appendPagesFromReader(input_pdf)
    filename = "/shared/Team5162/demopdf/"+"document-page%s.pdf" % i
    with client.write(filename) as writeStream:
            writeStream.write(output)


Could you please comment.

Nov 25, 2021 in Python by Kannan
• 120 points
594 views

No answer to this question. Be the first to respond.

Your answer

Your name to display (optional):
Privacy: Your email address will only be used for sending these notifications.

Related Questions In Python

0 votes
2 answers

How can I write a program to add two numbers using functions in python?

there is sum() function as a built ...READ MORE

answered Oct 25, 2020 in Python by anonymous
23,842 views
+1 vote
1 answer
+1 vote
0 answers

Sum the values of column matching and not matching from a .txt file and write output to a two different files using Python

Name                                                    value DR_CNDAOFSZAPZP_GPFS_VOL.0 139264 DR_CNDAOFSZAPZP_GPFS_VOL.1 15657 DR_CNDAOFSZAPZP_GPFS_VOL.0 139264 DR_CNDAOFSZAPZP_GPFS_VOL.1 156579 DR_CNDAOFSZAPZP_GPFS_VOL.2 156579 DR_CNDAOFSZAPZP_GPFS_VOL.3 ...READ MORE

Nov 20, 2019 in Python by Sagar
• 130 points
1,185 views
0 votes
1 answer

How to a write reg expression that confirms an email id using the python reg expression module “re”?

Hey, @Roshni, Python has a regular expression module ...READ MORE

answered Jun 26, 2020 in Python by Gitika
• 65,770 points
929 views
+1 vote
1 answer

Hadoop Mapreduce word count Program

Firstly you need to understand the concept ...READ MORE

answered Mar 16, 2018 in Data Analytics by nitinrawat895
• 11,380 points
11,027 views
0 votes
1 answer

hadoop.mapred vs hadoop.mapreduce?

org.apache.hadoop.mapred is the Old API  org.apache.hadoop.mapreduce is the ...READ MORE

answered Mar 16, 2018 in Data Analytics by nitinrawat895
• 11,380 points
2,534 views
+2 votes
11 answers

hadoop fs -put command?

Hi, You can create one directory in HDFS ...READ MORE

answered Mar 16, 2018 in Big Data Hadoop by nitinrawat895
• 11,380 points
108,828 views
–1 vote
1 answer

Hadoop dfs -ls command?

In your case there is no difference ...READ MORE

answered Mar 16, 2018 in Big Data Hadoop by kurt_cobain
• 9,350 points
4,611 views
webinar REGISTER FOR FREE WEBINAR X
REGISTER NOW
webinar_success Thank you for registering Join Edureka Meetup community for 100+ Free Webinars each month JOIN MEETUP GROUP