Getting Started with PDF to Markdown API (Python)

To get started using Adobe PDF to Markdown API, let's walk through a simple scenario - taking an input PDF document and extracting its elements into Markdown format. Once the PDF has been converted, we'll save the Markdown output. In this guide, we will walk you through the complete process for creating a program that will accomplish this task.

Prerequisites

To complete this guide, you will need:

Python - Python 3.10 or higher is required.
An Adobe ID. If you do not have one, the credential setup will walk you through creating one.
A way to edit code. No specific editor is required for this guide.

Step One: Getting credentials

1) To begin, open your browser to https://acrobatservices.adobe.com/dc-integration-creation-app-cdn/main.html?api=pdf-extract-api. If you are not already logged in to Adobe.com, you will need to sign in or create a new user. Using a personal email account is recommend and not a federated ID.

2) After registering or logging in, you will then be asked to name your new credentials. Use the name, "New Project".

3) Change the "Choose language" setting to "Python".

4) Also note the checkbox by, "Create personalized code sample." This will include a large set of samples along with your credentials. These can be helpful for learning more later.

5) Click the checkbox saying you agree to the developer terms and then click "Create credentials."

Project setup

6) After your credentials are created, they are automatically downloaded:

alt

Step Two: Setting up the project

1) In your Downloads folder, find the ZIP file with your credentials: PDFServicesSDK-Python Samples.zip. If you unzip that archive, you will find a README file, a folder of samples and the pdfservices-api-credentials.json file.

alt

2) Take the pdfservices-api-credentials.json file and place it in a new directory. Remember that these credential files are important and should be stored safely.

3) At the command line, change to the directory you created, and run the following command to install the Python SDK: pip install pdfservices-sdk.

alt

At this point, we've installed the Python SDK for Adobe PDF Services API as a dependency for our project and have copied over our credentials files.

Our application will take a PDF, Adobe Extract API Sample.pdf (downloadable from [here](/ Adobe%20Extract%20API%20Sample.pdf) and extract it's contents. The results will be saved as a .md file with a timestamp in the filename.

4) In your editor, open the directory where you previously copied the credentials. Create a new file, extract.py.

Now you're ready to begin coding.

Step Three: Creating the application

1) We'll begin by including our required dependencies:

import logging
import os
from datetime import datetime

from adobe.pdfservices.operation.auth.service_principal_credentials import ServicePrincipalCredentials
from adobe.pdfservices.operation.exception.exceptions import ServiceApiException, ServiceUsageException, SdkException
from adobe.pdfservices.operation.io.cloud_asset import CloudAsset
from adobe.pdfservices.operation.io.stream_asset import StreamAsset
from adobe.pdfservices.operation.pdf_services import PDFServices
from adobe.pdfservices.operation.pdf_services_media_type import PDFServicesMediaType
from adobe.pdfservices.operation.pdfjobs.jobs.pdf_to_markdown_job import PDFToMarkdownJob
from adobe.pdfservices.operation.pdfjobs.result.pdf_to_markdown_result import PDFToMarkdownResult
Copied to your clipboard
import logging
import os
from datetime import datetime

from adobe.pdfservices.operation.auth.service_principal_credentials import ServicePrincipalCredentials
from adobe.pdfservices.operation.exception.exceptions import ServiceApiException, ServiceUsageException, SdkException
from adobe.pdfservices.operation.io.cloud_asset import CloudAsset
from adobe.pdfservices.operation.io.stream_asset import StreamAsset
from adobe.pdfservices.operation.pdf_services import PDFServices
from adobe.pdfservices.operation.pdf_services_media_type import PDFServicesMediaType
from adobe.pdfservices.operation.pdfjobs.jobs.pdf_to_markdown_job import PDFToMarkdownJob
from adobe.pdfservices.operation.pdfjobs.result.pdf_to_markdown_result import PDFToMarkdownResult

These imports bring in the Adobe PDF Services SDK components needed for PDF to Markdown conversion.

2) Next, we setup the SDK to use our credentials.

# Initial setup, create credentials instance
credentials = ServicePrincipalCredentials(
    client_id=os.getenv('PDF_SERVICES_CLIENT_ID'),
    client_secret=os.getenv('PDF_SERVICES_CLIENT_SECRET')
)
Copied to your clipboard
# Initial setup, create credentials instance
credentials = ServicePrincipalCredentials(
    client_id=os.getenv('PDF_SERVICES_CLIENT_ID'),
    client_secret=os.getenv('PDF_SERVICES_CLIENT_SECRET')
)

This code both points to the credentials downloaded previously as well as sets up an execution context object that will be used later.

3) Now, let's create the operation:

# Creates a PDF Services instance
pdf_services = PDFServices(credentials=credentials)

# Creates an asset(s) from source file(s) and upload
input_asset = pdf_services.upload(input_stream=input_stream, mime_type=PDFServicesMediaType.PDF)

# Creates a new job instance
pdf_to_markdown_job = PDFToMarkdownJob(input_asset=input_asset)
Copied to your clipboard
# Creates a PDF Services instance
pdf_services = PDFServices(credentials=credentials)

# Creates an asset(s) from source file(s) and upload
input_asset = pdf_services.upload(input_stream=input_stream, mime_type=PDFServicesMediaType.PDF)

# Creates a new job instance
pdf_to_markdown_job = PDFToMarkdownJob(input_asset=input_asset)

This code creates a PDF to Markdown conversion job. The job will convert the PDF content to Markdown format, preserving document structure and formatting.

4) The next code block executes the operation:

# Submit the job and gets the job result
location = pdf_services.submit(pdf_to_markdown_job)
pdf_services_response = pdf_services.get_job_result(location, PDFToMarkdownResult)

# Get content from the resulting asset(s)
result_asset: CloudAsset = pdf_services_response.get_result().get_asset()
stream_asset: StreamAsset = pdf_services.get_content(result_asset)

# Creates an output stream and copy stream asset's content to it
output_file_path = self.create_output_file_path()
with open(output_file_path, "wb") as file:
    file.write(stream_asset.get_input_stream())
Copied to your clipboard
# Submit the job and gets the job result
location = pdf_services.submit(pdf_to_markdown_job)
pdf_services_response = pdf_services.get_job_result(location, PDFToMarkdownResult)

# Get content from the resulting asset(s)
result_asset: CloudAsset = pdf_services_response.get_result().get_asset()
stream_asset: StreamAsset = pdf_services.get_content(result_asset)

# Creates an output stream and copy stream asset's content to it
output_file_path = self.create_output_file_path()
with open(output_file_path, "wb") as file:
    file.write(stream_asset.get_input_stream())

This code runs the PDF to Markdown conversion process and then stores the result Markdown file to the file system.

alt

Here's the complete application (extract.py):

import logging
import os
from datetime import datetime

from adobe.pdfservices.operation.auth.service_principal_credentials import ServicePrincipalCredentials
from adobe.pdfservices.operation.exception.exceptions import ServiceApiException, ServiceUsageException, SdkException
from adobe.pdfservices.operation.io.cloud_asset import CloudAsset
from adobe.pdfservices.operation.io.stream_asset import StreamAsset
from adobe.pdfservices.operation.pdf_services import PDFServices
from adobe.pdfservices.operation.pdf_services_media_type import PDFServicesMediaType
from adobe.pdfservices.operation.pdfjobs.jobs.pdf_to_markdown_job import PDFToMarkdownJob
from adobe.pdfservices.operation.pdfjobs.result.pdf_to_markdown_result import PDFToMarkdownResult

# Initialize the logger
logging.basicConfig(level=logging.INFO)

# This sample illustrates how to convert a PDF file to Markdown format.
#
# Refer to README.md for instructions on how to run the samples.

class PDFToMarkdown:
    def __init__(self):
        try:
            file = open('./pdfToMarkdownInput.pdf', 'rb')
            input_stream = file.read()
            file.close()

            # Initial setup, create credentials instance
            credentials = ServicePrincipalCredentials(
                client_id=os.getenv('PDF_SERVICES_CLIENT_ID'),
                client_secret=os.getenv('PDF_SERVICES_CLIENT_SECRET')
            )

            # Creates a PDF Services instance
            pdf_services = PDFServices(credentials=credentials)

            # Creates an asset(s) from source file(s) and upload
            input_asset = pdf_services.upload(input_stream=input_stream,
                                              mime_type=PDFServicesMediaType.PDF)

            # Creates a new job instance
            pdf_to_markdown_job = PDFToMarkdownJob(input_asset=input_asset)

            # Submit the job and gets the job result
            location = pdf_services.submit(pdf_to_markdown_job)
            pdf_services_response = pdf_services.get_job_result(location, PDFToMarkdownResult)

            # Get content from the resulting asset(s)
            result_asset: CloudAsset = pdf_services_response.get_result().get_asset()
            stream_asset: StreamAsset = pdf_services.get_content(result_asset)

            # Creates an output stream and copy stream asset's content to it
            output_file_path = self.create_output_file_path()
            with open(output_file_path, "wb") as file:
                file.write(stream_asset.get_input_stream())

        except (ServiceApiException, ServiceUsageException, SdkException) as e:
            logging.exception(f'Exception encountered while executing operation: {e}')

    # Generates a string containing a directory structure and file name for the output file
    @staticmethod
    def create_output_file_path() -> str:
        now = datetime.now()
        time_stamp = now.strftime("%Y-%m-%dT%H-%M-%S")
        os.makedirs("output/PDFToMarkdown", exist_ok=True)
        return f"output/PDFToMarkdown/markdown{time_stamp}.md"


if __name__ == "__main__":
    PDFToMarkdown()
Copied to your clipboard
import logging
import os
from datetime import datetime

from adobe.pdfservices.operation.auth.service_principal_credentials import ServicePrincipalCredentials
from adobe.pdfservices.operation.exception.exceptions import ServiceApiException, ServiceUsageException, SdkException
from adobe.pdfservices.operation.io.cloud_asset import CloudAsset
from adobe.pdfservices.operation.io.stream_asset import StreamAsset
from adobe.pdfservices.operation.pdf_services import PDFServices
from adobe.pdfservices.operation.pdf_services_media_type import PDFServicesMediaType
from adobe.pdfservices.operation.pdfjobs.jobs.pdf_to_markdown_job import PDFToMarkdownJob
from adobe.pdfservices.operation.pdfjobs.result.pdf_to_markdown_result import PDFToMarkdownResult

# Initialize the logger
logging.basicConfig(level=logging.INFO)

# This sample illustrates how to convert a PDF file to Markdown format.
#
# Refer to README.md for instructions on how to run the samples.

class PDFToMarkdown:
    def __init__(self):
        try:
            file = open('./pdfToMarkdownInput.pdf', 'rb')
            input_stream = file.read()
            file.close()

            # Initial setup, create credentials instance
            credentials = ServicePrincipalCredentials(
                client_id=os.getenv('PDF_SERVICES_CLIENT_ID'),
                client_secret=os.getenv('PDF_SERVICES_CLIENT_SECRET')
            )

            # Creates a PDF Services instance
            pdf_services = PDFServices(credentials=credentials)

            # Creates an asset(s) from source file(s) and upload
            input_asset = pdf_services.upload(input_stream=input_stream,
                                              mime_type=PDFServicesMediaType.PDF)

            # Creates a new job instance
            pdf_to_markdown_job = PDFToMarkdownJob(input_asset=input_asset)

            # Submit the job and gets the job result
            location = pdf_services.submit(pdf_to_markdown_job)
            pdf_services_response = pdf_services.get_job_result(location, PDFToMarkdownResult)

            # Get content from the resulting asset(s)
            result_asset: CloudAsset = pdf_services_response.get_result().get_asset()
            stream_asset: StreamAsset = pdf_services.get_content(result_asset)

            # Creates an output stream and copy stream asset's content to it
            output_file_path = self.create_output_file_path()
            with open(output_file_path, "wb") as file:
                file.write(stream_asset.get_input_stream())

        except (ServiceApiException, ServiceUsageException, SdkException) as e:
            logging.exception(f'Exception encountered while executing operation: {e}')

    # Generates a string containing a directory structure and file name for the output file
    @staticmethod
    def create_output_file_path() -> str:
        now = datetime.now()
        time_stamp = now.strftime("%Y-%m-%dT%H-%M-%S")
        os.makedirs("output/PDFToMarkdown", exist_ok=True)
        return f"output/PDFToMarkdown/markdown{time_stamp}.md"


if __name__ == "__main__":
    PDFToMarkdown()

Next Steps

Now that you've successfully performed your first operation, review the documentation for many other examples and reach out on our forums with any questions. Also remember the samples you downloaded while creating your credentials also have many demos.

.NET

How Tos

Was this helpful?

Yes