How to Convert PDF File to Text File Using Python
We do not require any further software or a specific application or Google search for changing a PDF to a text document.
The process of converting a PDF file into a text document using python is as follows:
Step to Convert PDF to Text in Python
Step 1- Install the library
We need only one external module or package for this which is PyPDF2.
PyPDF2 is a module in python which is used to perform various operations on PDF files, such as, extracting document information from a file, merging pdf, splitting pdf, overlying and watermarking pages, and encrypting and decrypting pdf files.
First, we install the library for that using pip by executing the following command in the command prompt.
pip install PyPDF2
Step 2- Import the installed library
After installing the PyPDF2 module we want to import that library using the Import keyword.
Step 3-Open your PDF file to read
Now, we are going to open the PACKSLIP.pdf file just by calling the open () method in the 'rb' mode.
read_pdf = open(r"D:\Practice\PACKING SLIP.pdf", 'rb')
Step 4-Create a PdfReader object
We will create a pdfReader object using the PdfFileReader() function defined in the PyPDF2 module.
pdfReader object will read the file opened from the previous step.
pdfReader = PyPDF2.PdfFileReader(read_pdf)
Step 5-looping to get all pages from PDF
To get the number of pages in the PDF file we use the getPage () method, which stores the number of pages in the pageObject variable. We wanted to get the text from page 1 to page 5. So, we use for loop with the range() function to get all pages used in the PDF file. pageObject = pdfReader.getPage (i)
Step 6-Extract text from page using extractText () object
After getting pageObject we will use the extractText () method to extract all the text from the PDF file.
extract_text=pageObject. extractText ()
Here is the Complete Code for extracting text from a PDF file using the PyPDF2 module in Python:-
This is the source PDF file location.
The source PDF files, which we are using, is PACKSLIP.pdf and will be converted into a text file
If we open this you see here this is kind of 5-page document.
In this output terminal, it shows the total number of pages which is 5, and shows data as you can see in the above picture.
Extracted data file:-
This is the location where extracted data is store in text file name PACKINGSLIP.txt
When we executed the script we generated a PACKSLIP text file. We will open this file and here we can see that our text file carry all the content from our PACKSLIP.pdf file.
This is the final step to transform our PDF into a text file.
This blog explains how to convert a file from PDF format to text format. We have used python for this purpose because it has a wide variety of tools and libraries that contains in-built modules that make our work simpler and easier. Using python is justified since the code written is automated and executes our process in a single go.