PDF to Markdown
The PDF to Markdown API (included with the PDF Services API) is a cloud-based web service that automatically converts PDF documents – native or scanned – into well-formatted LLM-friendly Markdown text. This service preserves the document's structure and formatting while converting it into a format that's widely used for LLM flows, content authoring and documentation.
Structured Information Output Format
The output of a PDF to Markdown operation includes:
- A primary
.mdfile containing the converted Markdown content
Output Structure
The following is a summary of key elements in the converted Markdown:
Elements
Ordered list of semantic elements converted from the PDF document, preserving the natural reading order and document structure. The conversion handles:
- Text content with proper Markdown syntax
- Document hierarchy and structure
- Inline formatting and emphasis
- Links and references
- Images and figures
- Tables and complex layouts
Content Types
The API processes various content types as follows:
Text Elements
- Headings: Converted to appropriate Markdown heading levels (H1-H6)
- Paragraphs: Preserved with proper spacing and formatting
- Lists: Both ordered and unordered lists with proper nesting
- Text Emphasis: Bold, italic, and other text formatting
- Links: Preserved with proper Markdown link syntax
Images and Figures
- Provided as base64-embedded images in the Markdown output
- Referenced correctly in the Markdown output
- Original quality preserved
- Proper alt-text and captions maintained
Tables
- Converted to Markdown table syntax
- Column alignment preserved
- Cell content formatting maintained
- Complex table structures supported
Element Types and Paths
The API recognizes and converts the following structural elements:
| Category | Element Type | Description |
|---|---|---|
Aside | Aside | Content that is not part of the regular content flow |
Figure | Figure | Non-reflowable constructs such as graphs, images, and flowcharts |
Footnote | Footnote | Footnote |
Headings | H, H1, H2, etc | Heading levels |
List | L, Li, Lbl, Lbody | List and list item elements |
Paragraph | P, ParagraphSpan | Paragraphs and paragraph segments |
Reference | Reference | Links |
Section | Sect | Logical section of the document |
StyleSpan | StyleSpan | Styling variations within text |
Table | Table, TD, TH, TR | Table elements |
Title | Title | Document title |
Reading Order
The reading order in the output Markdown maintains:
- Natural document flow
- Proper content hierarchy
- Column-based layouts
- Page transitions
- Inline elements and references
Use Cases
The PDF to Markdown API is particularly valuable for:
- LLM and RAG ingestion: Convert PDFs to Markdown for chunking, embeddings, and retrieval-augmented generation (RAG).
- Prompt and context packaging: Produce Markdown that is easy to paste, structure, and cite in prompts and agent workflows.
- Training data preparation: Create LLM fine-tuning datasets from PDF content after review, cleanup, and labeling.
- Doc-as-code workflows: Bring PDF content into Git-based review, versioning, diffing, and static-site generators.
- Knowledge base publishing: Migrate PDFs into documentation platforms and internal wikis as clean, editable Markdown.
- Legacy and archive modernization: Convert historical PDFs so they become searchable, editable, and maintainable.
- Automated document processing: Standardize PDF-to-text conversion inside ETL and document-processing pipelines.
- Enterprise search and indexing: Feed converted Markdown into internal search systems and knowledge repositories.
- Compliance and audit readiness: Make PDF policies, SOPs, and manuals searchable and easier to review for changes.
- Content QA and change tracking: Compare converted Markdown across document versions to detect updates and regressions.
- Analytics and classification: Use Markdown output for topic modeling, tagging, deduplication, and routing workflows.
- Localization workflows: Convert to Markdown as a starting point for translation and multi-language documentation.
API Limitations
For File Constraints and Processing Limits, see Licensing and Usage Limits.
Document Requirements
- Files must be unprotected or allow content copying
- No support for:
- Hidden objects (JavaScript, OCG)
- XFA and fillable forms
- Complex annotations
- CAD drawings or vector art
- Password-protected content
REST API
See our public API Reference for PDF to Markdown API.
Get Markdown from a PDF
Use the sample below to create Markdowns from PDFs
Please refer the API usage guide to understand how to use our APIs.
Copied to your clipboard// Get the samples from https://github.com/adobe/PDFServices.NET.SDK.Samples// Run the sample:// cd PDFToMarkdown/// dotnet run PDFToMarkdown.csprojnamespace PDFToMarkdown{class Program{private static readonly ILog log = LogManager.GetLogger(typeof(Program));static void Main(){ConfigureLogging();try{ICredentials credentials = new ServicePrincipalCredentials(Environment.GetEnvironmentVariable("PDF_SERVICES_CLIENT_ID"),Environment.GetEnvironmentVariable("PDF_SERVICES_CLIENT_SECRET"));PDFServices pdfServices = new PDFServices(credentials);using Stream inputStream = File.OpenRead(@"pdfToMarkdownInput.pdf");IAsset asset = pdfServices.Upload(inputStream, PDFServicesMediaType.PDF.GetMIMETypeValue());PDFToMarkdownJob pdfToMarkdownJob = new PDFToMarkdownJob(asset);String location = pdfServices.Submit(pdfToMarkdownJob);PDFServicesResponse<PDFToMarkdownResult> pdfServicesResponse =pdfServices.GetJobResult<PDFToMarkdownResult>(location, typeof(PDFToMarkdownResult));IAsset resultAsset = pdfServicesResponse.Result.Asset;StreamAsset streamAsset = pdfServices.GetContent(resultAsset);String outputFilePath = CreateOutputFilePath();new FileInfo(Directory.GetCurrentDirectory() + outputFilePath).Directory.Create();Stream outputStream = File.OpenWrite(Directory.GetCurrentDirectory() + outputFilePath);streamAsset.Stream.CopyTo(outputStream);outputStream.Close();}catch (ServiceUsageException ex){log.Error("Exception encountered while executing operation", ex);}catch (ServiceApiException ex){log.Error("Exception encountered while executing operation", ex);}catch (SDKException ex){log.Error("Exception encountered while executing operation", ex);}catch (IOException ex){log.Error("Exception encountered while executing operation", ex);}catch (Exception ex){log.Error("Exception encountered while executing operation", ex);}}static void ConfigureLogging(){ILoggerRepository logRepository = LogManager.GetRepository(Assembly.GetEntryAssembly());XmlConfigurator.Configure(logRepository, new FileInfo("log4net.config"));}private static String CreateOutputFilePath(){String timeStamp = DateTime.Now.ToString("yyyy'-'MM'-'dd'T'HH'-'mm'-'ss");return ("/output/pdfToMarkdown" + timeStamp + ".md");}}}
Copied to your clipboard# Get the samples https://github.com/adobe/pdfservices-python-sdk-samples# Run the sample:# python src/pdftomarkdown/pdf_to_markdown.py# Initialize the loggerlogging.basicConfig(level=logging.INFO)class PDFToMarkdown:def __init__(self):try:file = open('src/resources/pdfToMarkdownInput.pdf', 'rb')input_stream = file.read()file.close()# Initial setup, create credentials instancecredentials = ServicePrincipalCredentials(client_id=os.getenv('PDF_SERVICES_CLIENT_ID'),client_secret=os.getenv('PDF_SERVICES_CLIENT_SECRET'))# Creates a PDF Services instancepdf_services = PDFServices(credentials=credentials)# Creates an asset(s) from source file(s) and uploadinput_asset = pdf_services.upload(input_stream=input_stream,mime_type=PDFServicesMediaType.PDF)# Creates a new job instancepdf_to_markdown_job = PDFToMarkdownJob(input_asset=input_asset)# Submit the job and gets the job resultlocation = pdf_services.submit(pdf_to_markdown_job)pdf_services_response = pdf_services.get_job_result(location, PDFToMarkdownResult)# Get content from the resulting asset(s)result_asset: CloudAsset = pdf_services_response.get_result().get_asset()stream_asset: StreamAsset = pdf_services.get_content(result_asset)# Creates an output stream and copy stream asset's content to itoutput_file_path = self.create_output_file_path()with open(output_file_path, "wb") as file:file.write(stream_asset.get_input_stream())except (ServiceApiException, ServiceUsageException, SdkException) as e:logging.exception(f'Exception encountered while executing operation: {e}')# Generates a string containing a directory structure and file name for the output file@staticmethoddef create_output_file_path() -> str:now = datetime.now()time_stamp = now.strftime("%Y-%m-%dT%H-%M-%S")os.makedirs("output/PDFToMarkdown", exist_ok=True)return f"output/PDFToMarkdown/markdown{time_stamp}.md"if __name__ == "__main__":PDFToMarkdown()
Copied to your clipboard// Please refer our REST API docs for more information// https://developer.adobe.com/document-services/docs/apis/#tag/PDF-To-Markdowncurl --location --request POST 'https://pdf-services.adobe.io/operation/pdftomarkdown' \--header 'x-api-key: {{Placeholder for client_id}}' \--header 'Content-Type: application/json' \--header 'Authorization: Bearer {{Placeholder for token}}' \--data-raw '{"assetID": "urn:aaid:AS:UE1:23c30ee0-2e4d-46d6-87f2-087832fca718"}'
Get Markdown from a PDF with Figures
Use the sample below to create Markdowns from PDFs with figures embedded in the PDFs
Please refer the API usage guide to understand how to use our APIs.
Copied to your clipboard// Get the samples from https://github.com/adobe/PDFServices.NET.SDK.Samples// Run the sample:// cd PDFToMarkdownWithFigures/// dotnet run PDFToMarkdownWithFigures.csprojnamespace PDFToMarkdownWithFigures{class Program{private static readonly ILog log = LogManager.GetLogger(typeof(Program));static void Main(){ConfigureLogging();try{ICredentials credentials = new ServicePrincipalCredentials(Environment.GetEnvironmentVariable("PDF_SERVICES_CLIENT_ID"),Environment.GetEnvironmentVariable("PDF_SERVICES_CLIENT_SECRET"));PDFServices pdfServices = new PDFServices(credentials);using Stream inputStream = File.OpenRead(@"pdfToMarkdownInput.pdf");IAsset asset = pdfServices.Upload(inputStream, PDFServicesMediaType.PDF.GetMIMETypeValue());// Create parameters for the job (include figure renditions in the output)PDFToMarkdownParams pdfToMarkdownParams = PDFToMarkdownParams.PDFToMarkdownParamsBuilder().WithGetFigures(true).Build();PDFToMarkdownJob pdfToMarkdownJob = new PDFToMarkdownJob(asset).SetParams(pdfToMarkdownParams);String location = pdfServices.Submit(pdfToMarkdownJob);PDFServicesResponse<PDFToMarkdownResult> pdfServicesResponse =pdfServices.GetJobResult<PDFToMarkdownResult>(location, typeof(PDFToMarkdownResult));IAsset resultAsset = pdfServicesResponse.Result.Asset;StreamAsset streamAsset = pdfServices.GetContent(resultAsset);String outputFilePath = CreateOutputFilePath();new FileInfo(Directory.GetCurrentDirectory() + outputFilePath).Directory.Create();Stream outputStream = File.OpenWrite(Directory.GetCurrentDirectory() + outputFilePath);streamAsset.Stream.CopyTo(outputStream);outputStream.Close();}catch (ServiceUsageException ex){log.Error("Exception encountered while executing operation", ex);}catch (ServiceApiException ex){log.Error("Exception encountered while executing operation", ex);}catch (SDKException ex){log.Error("Exception encountered while executing operation", ex);}catch (IOException ex){log.Error("Exception encountered while executing operation", ex);}catch (Exception ex){log.Error("Exception encountered while executing operation", ex);}}static void ConfigureLogging(){ILoggerRepository logRepository = LogManager.GetRepository(Assembly.GetEntryAssembly());XmlConfigurator.Configure(logRepository, new FileInfo("log4net.config"));}private static String CreateOutputFilePath(){String timeStamp = DateTime.Now.ToString("yyyy'-'MM'-'dd'T'HH'-'mm'-'ss");return ("/output/pdfToMarkdownWithFigures" + timeStamp + ".md");}}}
Copied to your clipboard# Get the samples https://github.com/adobe/pdfservices-python-sdk-samples# Run the sample:# python src/pdftomarkdown/pdf_to_markdown.py# Initialize the loggerlogging.basicConfig(level=logging.INFO)class PDFToMarkdownWithOptions:def __init__(self):try:file = open('src/resources/pdfToMarkdownInput.pdf', 'rb')input_stream = file.read()file.close()# Initial setup, create credentials instancecredentials = ServicePrincipalCredentials(client_id=os.getenv('PDF_SERVICES_CLIENT_ID'),client_secret=os.getenv('PDF_SERVICES_CLIENT_SECRET'))# Creates a PDF Services instancepdf_services = PDFServices(credentials=credentials)# Creates an asset(s) from source file(s) and uploadinput_asset = pdf_services.upload(input_stream=input_stream,mime_type=PDFServicesMediaType.PDF)# Create parameters for the job with figures extraction enabledpdf_to_markdown_params = PDFToMarkdownParams(get_figures=True)# Creates a new job instancepdf_to_markdown_job = PDFToMarkdownJob(input_asset=input_asset,pdf_to_markdown_params=pdf_to_markdown_params)# Submit the job and gets the job resultlocation = pdf_services.submit(pdf_to_markdown_job)pdf_services_response = pdf_services.get_job_result(location, PDFToMarkdownResult)# Get content from the resulting asset(s)result_asset: CloudAsset = pdf_services_response.get_result().get_asset()stream_asset: StreamAsset = pdf_services.get_content(result_asset)# Creates an output stream and copy stream asset's content to itoutput_file_path = self.create_output_file_path()with open(output_file_path, "wb") as file:file.write(stream_asset.get_input_stream())except (ServiceApiException, ServiceUsageException, SdkException) as e:logging.exception(f'Exception encountered while executing operation: {e}')# Generates a string containing a directory structure and file name for the output file@staticmethoddef create_output_file_path() -> str:now = datetime.now()time_stamp = now.strftime("%Y-%m-%dT%H-%M-%S")os.makedirs("output/PDFToMarkdownWithOptions", exist_ok=True)return f"output/PDFToMarkdownWithOptions/markdown{time_stamp}.md"if __name__ == "__main__":PDFToMarkdownWithOptions()
Copied to your clipboard// Please refer our REST API docs for more information// https://developer.adobe.com/document-services/docs/apis/#tag/PDF-To-Markdowncurl --location --request POST 'https://pdf-services.adobe.io/operation/pdftomarkdown' \--header 'x-api-key: {{Placeholder for client_id}}' \--header 'Content-Type: application/json' \--header 'Authorization: Bearer {{Placeholder for token}}' \--data-raw '{"assetID": "urn:aaid:AS:UE1:23c30ee0-2e4d-46d6-87f2-087832fca718","getFigures": true}'

