I have used the PdfFileReader to read the file from the Data Lake and my requirement is to split the read PDF into individual pages and write back individual files back to a different folder in HDFS.
For reading files i have used below code and is working.
from PyPDF2 import PdfFileWriter, PdfFileReader
from io import BytesIO
from hdfs import InsecureClient
client = InsecureClient('http://datalake:50070')
import requests
from json import dumps
client.status("/")
fnames=client.list('/shared/Team5162')
with client.read('/shared/Team5162/DemoCompany/Green Energy Limited.pdf') as reader:
input_pdf = PdfFileReader(BytesIO(reader.read()))
print(input_pdf.getNumPages())
Now i want to split the read PDF and write back.Using this code am able to create 136 individual pages.However it has no content embedded and i gets no error as well.
for i in range(input_pdf.getNumPages()):
out_pdf = PdfFileWriter()
output = out_pdf.addPage(input_pdf.getPage(i))
#output = out_pdf.appendPagesFromReader(input_pdf)
filename = "/shared/Team5162/demopdf/"+"document-page%s.pdf" % i
with client.write(filename) as writeStream:
writeStream.write(output)
Could you please comment.