LLM Document Parsing API Endpoint

Version 2.197 (Release Notes ↗)

Description

The LLM Parse API converts documents into LLM-ready output for retrieval, summarization, RAG ingestion, search, and automation workflows. Send a document URL to https://api.pixlab.io/llmparse, choose an output format, and PixLab queues a parsing job that extracts the document into clean Markdown, structured JSON, or plain text.

The exposed endpoint is powered by the implemented docparse operation. It downloads the source document, runs document conversion with layout-aware parsing, and returns a jobId immediately. Poll the job result until parsing is completed.

  • Parse PDF, DOCX, PPTX, XLSX, HTML, and other office-style documents into LLM-friendly content
  • Export parsed documents as md, json, or text
  • Preserve useful document structure such as headings, reading order, tables, lists, and sections where possible
  • Queue long-running document parsing jobs asynchronously and poll by jobId
  • Use a simple SDK-free POST endpoint for backend services, automation workers, RAG pipelines, and document ingestion systems
  • Reduce file-format noise before sending content to LLMs, vector databases, search indexes, or downstream analysis tools

For image analysis, use the QUERY, TAG-IMG, and DESCRIBE endpoints. For raw OCR extraction from images, use the OCR endpoint. For embedded image text translation, use Image Text Translation.

HTTP Methods

POST

HTTP Parameters

Required

Fields Type Description
key String Your PixLab API Key ↗. You can also embed your key in the WWW-Authenticate: HTTP header and omit this parameter if you want to.
url URL Publicly reachable URL to the input document to parse. The backend also accepts downloadUrl as an alias. The document should be a PDF, DOCX, PPTX, XLSX, HTML, text, or another supported office/document format.

Optional

Fields Type Description
format String Desired output format. Supported values are md, json, and text. Defaults to md.
extension String File extension hint used by the parser when opening the downloaded document, for example pdf, docx, xlsx, pptx, or html. Defaults to pdf.

POST Request Body

The exposed llm-parse endpoint accepts POST requests only. Submit a JSON body containing the input document URL and desired output format.

Allowed Content-Types:

  • application/json

Large documents are processed asynchronously. The initial response returns a jobId. Poll the job endpoint returned by your integration layer until the job status becomes completed or failed.

HTTP Response

application/json

The LLM Parse API starts an asynchronous document parsing job. A successful POST returns a queued job identifier immediately. Use the returned jobId to poll for completion. When the job is completed, the result contains the requested output format and parsed document data.

Accepted Job Response


{
  "rc": true,
  "status": "accepted",
  "jobId": "doc_01hx9z3p9r6n6k2a",
  "message": "Job queued. Poll /job/{jobId} for results."
}

Completed Job Result


{
  "status": "completed",
  "result": {
    "format": "md",
    "data": "# Parsed document\n\nClean LLM-ready Markdown output..."
  }
}
Fields Type Description
rc Boolean True when the parsing job was accepted. False when the request failed validation.
status String Initial response status is accepted. Job polling can return queued, processing, completed, or failed.
jobId String Identifier used to poll the document parsing job until completion.
result.format String Output format returned by the completed job: md, json, or text.
result.data String | Object Parsed document output. Markdown and text formats return strings; JSON format returns structured document data.
result.error String Error message when the job fails, for example if the document cannot be downloaded or exceeds the maximum allowed size.

Code Samples


import time
import requests

API_KEY = "PIXLAB_API_KEY"
SUBMIT_URL = "https://api.pixlab.io/llmparse"
JOB_URL = "https://api.pixlab.io/job"

# Start an async LLM document parsing job.
submit = requests.post(
    SUBMIT_URL,
    json={
        "key": API_KEY,
        "url": "https://example.com/report.pdf",
        "format": "md",       # Optional: md, json, or text. Defaults to md.
        "extension": "pdf"    # Optional parser hint. Defaults to pdf.
    },
    timeout=60
)

job = submit.json()
if not job.get("rc"):
    raise RuntimeError(job.get("err") or job.get("error") or "LLM parse job was not accepted")

job_id = job["jobId"]
print(f"Queued document parsing job: {job_id}")

# Poll until the parser completes.
while True:
    status = requests.get(f"{JOB_URL}/{job_id}", params={"key": API_KEY}, timeout=30).json()
    state = status.get("status")

    if state == "completed":
        result = status["result"]
        print(result["format"])
        print(result["data"])
        break

    if state == "failed":
        raise RuntimeError(status.get("result", {}).get("error", "Document parsing failed"))

    time.sleep(2)

← Return to API Endpoint Listing