The first library we'll look at is. Let's start with the Maven dependencies we need to add to our project: org.apache.pdfboxpdfbox-tools2.0.3net.sf.cssboxpdf2dom1.6We're going to use the first dependency to load the selected PDF file. The second dependency is responsible for the conversion itself.
The latest versions can be found here: and.What's more, we'll use to extract the text from a PDF file and to create the. Docx document.Let's take a look at Maven dependencies that we need to include in our project: com.itextpdfitextpdf5.5.10com.itextpdf.toolxmlworker5.5.10org.apache.poipoi-ooxml3.15org.apache.poipoi-scratchpad3.15The latest version of iText can be found and you can look for Apache POI.
PDF and HTML Conversions. We created a method named generateTxtFromPDF and divided it into three main parts: loading of the PDF file, extraction of text, and final file creation.Let's start with loading part: File f = new File(filename);String parsedText;PDFParser parser = new PDFParser(new RandomAccessFile(f, 'r'));parser.parse;In order to read a PDF file, we use PDFParser, with an “r” (read) option. Moreover, we need to use the parser.parse method that will cause the PDF to be parsed as a stream and populated into the COSDocument object.Let's take a look at the extracting text part: COSDocument cosDoc = parser.getDocument;PDFTextStripper pdfStripper = new PDFTextStripper;PDDocument pdDoc = new PDDocument(cosDoc);parsedText = pdfStripper.getText(pdDoc);In the first line, we'll save COSDocument inside the cosDoc variable. It will be then used to construct PDocument, which is the in-memory representation of the PDF document. Finally, we will use PDFTextStripper to return the raw text of a document. After all of those operations, we'll need to use close method to close all the used streams.In the last part, we'll save text into the newly created file using the simple Java PrintWriter: PrintWriter pw = new PrintWriter('src/output/pdf.txt');pw.print(parsedText);pw.close;Please note that you cannot preserve formatting in a plain text file because it contains text only. Converting text files to PDF is bit tricky.
In order to maintain the file formatting, you'll need to apply additional rules.In the following example, we are not taking into consideration the formatting of the file.First, we need to define the size of the PDF file, version and output file. Let's have a look at the code example: Document pdfDoc = new Document(PageSize.A4);PdfWriter.getInstance(pdfDoc, new FileOutputStream('src/output/txt.pdf')).setPdfVersion(PdfWriter.PDFVERSION17);pdfDoc.open;In the next step, we'll define the font and also the command that is used to generate new paragraph.