Convert PDF into excel data

Question

I have a PDF which is having the below data attached as an image. How can I convert it into a tabular format as CSV/excels?

narikkadan · Answer 1 · Oct 3, 2022

To convert data from an image in a PDF to a tabular format using OpenCV and Python, you typically go through these steps:

Read the Image: Use OpenCV to read the image extracted from the PDF.
Preprocess the Image: Apply various preprocessing techniques like converting to grayscale, thresholding, etc., to enhance the image for OCR.
OCR (Optical Character Recognition): Use an OCR tool, like Tesseract OCR, to extract text from the preprocessed image.
Data Parsing and Structuring: Parse the extracted text to structure it into a tabular format. This might require custom coding depending on the layout of the data in the image.
Export to CSV/Excel: Finally, use a Python library like Pandas to export the structured data into a CSV or Excel file.

Here's a basic outline of how you could do this in Python:

import cv2
import pytesseract
import pandas as pd

# Load the image
image = cv2.imread('path_to_your_image.jpg')

# Preprocess the image (example: convert to grayscale)
gray_image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
# You can add more preprocessing steps like thresholding here

# Use Tesseract OCR to extract text
text = pytesseract.image_to_string(gray_image)

# Parse the text into a structured format (this part depends on your specific data)
# Example: split the text into lines and then into columns
lines = text.split('\n')
data = [line.split() for line in lines]

# Convert the structured data into a DataFrame
df = pd.DataFrame(data, columns=['Column1', 'Column2', 'Column3'])  # Adjust the columns as per your data

# Save the DataFrame to a CSV file
df.to_csv('output.csv', index=False)

Please replace 'path_to_your_image.jpg' with the path to your image file and adjust the column names and data parsing logic according to your specific data format.

You need to have Python, OpenCV (opencv-python), Pytesseract (pytesseract), and Pandas (pandas) installed on your machine to run this script.

To fine-tune this process, you might need to experiment with different image preprocessing techniques and adjust the data parsing logic to match the layout of your data.

To learn more check OpenCV Tutorial with Python.