Skip to content
  • About Us
  • Contact Us
  • Disclaimer
  • Privacy Policy
  • Terms and Conditions
  • Author Profile: Govind

Simplified Learning Blog

Learning made easy

  • Java
    • Core Java Tutorial
    • Java 8
    • What is Rest API in java
    • Spring Framework
    • Type Casting in Java | 2 types Implicit and explicit casting
    • Spring Boot Tutorial
      • Spring Boot Rest API Example complete guide
    • Top 50 Java Interview Questions
    • JUnit 5 Tutorial
      • Assertall in JUnit 5
      • Assertions in JUnit 5
    • Java Thread Tutorials
      • How to create thread in Java
      • Multithreading in java
      • Daemon Thread in Java | How to create daemon thread in java
      • Top 40+ Multithreading interview questions
  • AWS
    • What is AWS (Amazon Web Services)
    • AWS IAM (Identity and Access Management)
    • AWS SNS | What is SNS
    • What is SQS | AWS SQS (Simple Queue Service)
    • What is AWS Lambda
    • Top 10 AWS Lambda interview questions
  • Java Snippets
  • Software Architecture
    • Software Architecture Performance
    • Performance Principles of Software Architecture
    • System Performance Objective
  • Spring Boot Tutorial
  • Tools
    • JSON Formatter & Debugging Guide for Spring Boot Developers
    • Free Character Counter Tool: The Ultimate Guide to Counting Characters, Words, and Text Statistics
  • Tech Blogs
    • Java 21 New Features
    • Is Java Dead? Is java dead, 2023 ?
    • New Features in Java 17
  • Toggle search form

Converting PDF to JSON in Java Top 3 ways to code:

Posted on February 26, 2024December 13, 2025 By Govind No Comments on Converting PDF to JSON in Java Top 3 ways to code:

Converting PDF to JSON in Java Top 3 ways: A Beginner’s Guide

Introduction:

In today’s digital era, handling data in various formats is crucial for developers. One common task is converting PDF files into JSON format, PDF to JSON conversion which is widely used for data interchange. In this blog post, we’ll explore how to achieve this PDF to JSON using Java, a versatile programming language.

Table of Contents

Toggle
  • Converting PDF to JSON in Java Top 3 ways: A Beginner’s Guide
  • Introduction:
  • Why Convert PDF to JSON?
  • Tools and Libraries:
    • Step 1: Set Up Your Project:
    • Step 2: Read PDF File:
    • Step 3: Convert to JSON:
  • Overview and ways to convert
  • Step 1: PDF Data Extraction Libraries
    • 1. Apache PDFBox (Recommended Open-Source)
    • 2. iText / OpenPDF
    • 3. Apache Tika (Content Detection Focus)
  • Step 2: JSON Serialization Libraries
    • 1. Jackson (The Industry Standard)
    • 2. Gson (Google Library)
  • Beginner’s Implementation Example (PDFBox & Jackson)
    • Project Setup (Maven Dependencies)
    • Java Code Example
  • Key Challenge: Structuring Complex Data
    • 1. Tabular Data Extraction
    • 2. Form (AcroForm) Data
    • 3. Layout-Based Tagging
  • Conclusion:
  • FAQ
pdf to json

Why Convert PDF to JSON?

PDF (Portable Document Format) is excellent for preserving the layout and structure of documents, but extracting data programmatically from PDFs can be challenging. JSON (JavaScript Object Notation), on the other hand, is a lightweight and human-readable data interchange format. Converting PDFs to JSON allows developers to extract and process information more efficiently.

Tools and Libraries:

To make our task of converting PDF to JSON easier, we’ll use a Java library called Apache PDFBox. Apache PDFBox is an open-source Java tool for working with PDF documents. It provides functionalities to extract text and metadata from PDFs, which we can then convert to JSON.

Step 1: Set Up Your Project:

Start by creating a new Java project in your favorite Integrated Development Environment (IDE) or a simple text editor. Make sure to include the Apache PDFBox library in your project. You can add the library manually or use a dependency management tool like Maven or Gradle.

Step 2: Read PDF File:

Use PDFBox to read the content of the PDF file. Open the PDF document and extract the text content. This can be done using the following code snippet:

java

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;

import java.io.File;
import java.io.IOException;

public class PdfToJsonConverter {

    public static void main(String[] args) {
        try {
            PDDocument document = PDDocument.load(new File("path/to/your/pdf file.pdf"));
            PDFTextStripper pdfTextStripper = new PDFTextStripper();
            String pdfText = pdfTextStripper.getText(document);
            document.close();

            // Now, you have the text content of the PDF in the 'pdfText' variable.
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

Step 3: Convert to JSON:

Once you have the text content from the PDF, you can use Java’s JSON libraries or other third-party libraries like Jackson to convert the text into JSON format. Here’s a simple example using the JSONObject class from the JSON-java library:

import org.json.JSONObject;

public class PdfToJsonConverter {

    public static void main(String[] args) {
        // ... (previous code)

        // Convert text content to JSON
        JSONObject json = new JSONObject();
        json.put("pdfText", pdfText);

        // Print or save the JSON as needed
        System.out.println(json.toString());
    }
}

here is full code snippet:

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import org.json.JSONObject;

import java.io.File;
import java.io.IOException;

public class PdfToJsonConverter {

    public static void main(String[] args) {
        try {
            // Step 1: Load the PDF document
            PDDocument document = PDDocument.load(new File("path/to/your/file.pdf"));

            // Step 2: Extract text content from the PDF
            PDFTextStripper pdfTextStripper = new PDFTextStripper();
            String pdfText = pdfTextStripper.getText(document);
            document.close();

            // Step 3: Convert text content to JSON
            JSONObject json = new JSONObject();
            json.put("pdfText", pdfText);

            // Step 4: Print or save the JSON as needed
            System.out.println(json.toString());

        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

Overview and ways to convert

Converting a PDF (Portable Document Format) to JSON (JavaScript Object Notation) in Java is a common but non-trivial task.1 The challenge lies in the fundamental difference in their structure:

  • PDF: An archival format designed for presentation and human readability, focusing on layout, fonts, and positioning.2 It is inherently unstructured or semi-structured.
  • JSON: A text format designed for data exchange, focusing on hierarchical key-value pairs.3 It is inherently structured.

The conversion process, therefore, requires two distinct steps: Extraction (getting the data out of the PDF) and Serialization (mapping the data into JSON objects).


Step 1: PDF Data Extraction Libraries

Since there is no single library that perfectly converts any PDF’s content directly into a reliable JSON structure, the first step is always parsing the PDF to extract raw text and metadata.

1. Apache PDFBox (Recommended Open-Source)

This is the de facto standard open-source Java library for PDF manipulation. It is robust, well-maintained, and freely available under the Apache License.4

FeatureDescription
Primary UseReading, manipulating, rendering, and generating PDF documents.
ExtractionPDFTextStripper is the core class for simple, sequential text extraction.
AdvantageFree, comprehensive, and handles most standard text-based PDFs reliably.

2. iText / OpenPDF

iText is a powerful, enterprise-grade PDF library (now primarily commercial).5 OpenPDF is a friendly fork of an older iText version (iText 4) and is free under the LGPL/MPL license.6

FeatureDescription
Primary UseGenerating PDFs from scratch, signing, and advanced form handling.
ExtractionProvides utilities for extracting text and managing PDF forms (AcroForm).
AdvantageExcellent for PDFs with digital signatures or complex form fields.

3. Apache Tika (Content Detection Focus)

Tika is a toolkit primarily designed for content detection and metadata extraction from various file formats, including PDFs.7

FeatureDescription
Primary UseIdentifying and parsing document types (e.g., DOCX, PDF, XLS).
ExtractionUses existing PDF parsers (like PDFBox) internally, focusing on extracting the raw content stream as plain text.
AdvantageGood for uniform text extraction across different document types.

Step 2: JSON Serialization Libraries

Once the raw data is extracted from the PDF into Java objects (e.g., Strings, Lists, Maps), a dedicated JSON library is used to structure and serialize these objects into a valid JSON string.

1. Jackson (The Industry Standard)

Jackson is the most widely used and feature-rich library for JSON processing in Java.8

  • Primary Classes: ObjectMapper for reading and writing JSON.9
  • Key Capability: Converts standard Java objects (POJOs) directly to JSON and vice-versa (Data Binding).

2. Gson (Google Library)

Gson is Google’s library, often chosen for its simplicity and ease of use compared to Jackson, particularly for basic serialization tasks.10

  • Primary Classes: Gson object for serialization/deserialization.11
  • Key Capability: Minimal setup required to serialize custom objects.

Beginner’s Implementation Example (PDFBox & Jackson)

This example demonstrates the simplest conversion: extracting all text from a PDF and mapping it into a simple JSON structure.

Project Setup (Maven Dependencies)

You need the core PDF parser and a JSON serialization library:

XML

<dependencies>
    <dependency>
        <groupId>org.apache.pdfbox</groupId>
        <artifactId>pdfbox</artifactId>
        <version>2.0.31</version> </dependency>
    <dependency>
        <groupId>com.fasterxml.jackson.core</groupId>
        <artifactId>jackson-databind</artifactId>
        <version>2.16.1</version>
    </dependency>
</dependencies>

Java Code Example

We create a simple Java class (PdfData) to hold the extracted information and use Jackson to serialize it.

Java

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import com.fasterxml.jackson.databind.ObjectMapper;
import java.io.File;
import java.io.IOException;

// 1. Simple POJO to structure the output data
class PdfData {
    public String fileName;
    public String extractedText;

    public PdfData(String fileName, String extractedText) {
        this.fileName = fileName;
        this.extractedText = extractedText;
    }
}

public class PdfToJsonConverter {

    public static String convertPdfToJson(String pdfPath) throws IOException {
        
        File file = new File(pdfPath);
        
        // 2. Use PDFBox to extract the text
        try (PDDocument document = PDDocument.load(file)) {
            
            // Check if the document allows content extraction
            if (document.isEncrypted()) {
                throw new IOException("Document is encrypted and cannot be processed.");
            }
            
            PDFTextStripper stripper = new PDFTextStripper();
            String text = stripper.getText(document);
            
            // 3. Create the data object
            PdfData data = new PdfData(file.getName(), text);
            
            // 4. Use Jackson to serialize the object to JSON
            ObjectMapper mapper = new ObjectMapper();
            return mapper.writerWithDefaultPrettyPrinter().writeValueAsString(data);
        }
    }

    public static void main(String[] args) {
        try {
            String pdfFilePath = "path/to/your/document.pdf"; // CHANGE THIS PATH
            String jsonOutput = convertPdfToJson(pdfFilePath);
            System.out.println(jsonOutput);
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

Key Challenge: Structuring Complex Data

The example above only extracts raw, sequential text. For most real-world scenarios, you need to extract structured data (e.g., key-value pairs, tables, invoices). This is the hardest part of PDF to JSON conversion.

1. Tabular Data Extraction

PDFs do not inherently understand “tables.” They just contain lines and positioned text. To extract tables, you must:

  • Determine Coordinates: Use PDFBox’s advanced APIs to get the 12$(x, y)$ coordinates of individual text elements.13
  • Reconstruct Structure: Write custom logic to infer rows and columns based on the relative positioning of those coordinates.
  • Specialized Libraries: Consider using commercial tools or sophisticated open-source libraries that sit on top of PDFBox/iText specifically designed for table parsing, as this is too complex for basic code.

2. Form (AcroForm) Data

If the PDF is a dynamic form, the process is easier. Libraries like iText or PDFBox can directly access the form fields (AcroForm) and extract their associated values and keys, which map easily to JSON fields.

3. Layout-Based Tagging

For documents with a consistent layout (like receipts or invoices), sophisticated solutions often use machine learning (ML) or template matching to identify fields based on surrounding text labels (e.g., finding the text next to the “Total:” label) before serialization into JSON.

Conclusion:

Converting PDF to JSON in Java is a valuable skill for handling data efficiently. With the help of Apache PDFBox and JSON libraries, you can easily extract information from PDF documents and represent it in a format that is both machine-readable and human-friendly. This blog has provided a beginner-friendly guide to get you started on this journey. If you want to format the generated json you can use our tool : JSON FORMATTER. Happy coding!

FAQ

Q: What purpose does converting PDF to JSON in Java serve? A: Converting PDF to JSON in Java allows developers to efficiently extract and process data from PDF files. While PDFs are great for preserving document layout, JSON offers a lightweight and human-readable format, making data interchange and manipulation more accessible.

Q: Which libraries are used in the Java program for PDF to JSON conversion? A: The program uses Apache PDFBox for PDF document handling and JSON-java for converting extracted text into JSON format.

Q: How can I set up a Java project for PDF to JSON conversion? A: Start by creating a new Java project in your preferred IDE or text editor. Include Apache PDFBox and JSON-java libraries. You can manually add these libraries or use dependency management tools like Maven or Gradle.

Q: Can you explain the steps involved in the PDF to JSON conversion process? A:

  1. Load PDF Document: Use Apache PDFBox to load the PDF document.
  2. Extract Text Content: Utilize PDFTextStripper to extract text content from the PDF.
  3. Convert to JSON: Use a JSON library (e.g., JSON-java) to convert the extracted text into JSON format.
  4. Print or Save JSON: Print or save the resulting JSON as needed.

Q: How can I specify the PDF file path in the Java program? A: Replace the placeholder "path/to/your/file.pdf" with the actual path to the PDF file you want to convert. Ensure that the file path is correctly specified to access the target PDF document.

Q: Are there any specific considerations for adding dependencies to the project? A: Yes, include the necessary dependencies for Apache PDFBox and JSON-java in your project. You can find these dependencies on Maven Central Repository or include them in your project’s pom.xml if you’re using Maven.

Q: Can I customize the program for my specific requirements? A: Absolutely! The provided code is a basic example. Depending on your needs, you may need to customize the program, handling different PDF structures or adding additional functionalities.

Q: How do I handle errors during the PDF to JSON conversion process? A: The provided code includes a try-catch block to handle IOExceptions. Enhance error handling based on your requirements, such as logging, displaying user-friendly messages, or implementing retry mechanisms.

See more tutorials on JSON here.

Govind

For over 15 years, I have worked as a hands-on Java Architect and Senior Engineer, specializing in building and scaling high-performance, enterprise-level applications. My career has been focused primarily within the FinTech, Telecommunications, or E-commerce sector, where I’ve led teams in designing systems that handle millions of transactions per day.

Checkout my profile here : AUTHOR https://simplifiedlearningblog.com/author/

Related

Java Snippets Tags:convertor, json, pdf, pdf to json

Post navigation

Previous Post: System Performance Objective
Next Post: String to Integer Conversion in Java | Java convert string to int

More Related Articles

jcalendar in java swing example Java Snippets
Java dice roll program Java Snippets
String to Integer Conversion in Java | Java convert string to int Java Snippets
Calculate date of birth from age in jquery Java Snippets
Java int to String Conversion: Performance Benchmarks & Memory Pitfalls Java Snippets
How to convert excel to PDF using java Java Snippets

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Recent Posts

  • Java Virtual Threads (Project Loom) in Real Enterprise Applications
  • Free Character Counter Tool: The Ultimate Guide to Counting Characters, Words, and Text Statistics
  • Understanding Java Sealed Classes
  • Top 50 Java Coding Interview Questions and Answers (2025 Updated)
  • Java Record Class Explained: Simple, Immutable Data Carriers

Recent Comments

  1. Gajanan Pise on Performance Principles of Software Architecture

Copyright © 2025 Simplified Learning Blog.

Powered by PressBook Green WordPress theme