SharePoint workflow

Two-stage workflow: pull binary documents from SharePoint, convert them to markdown for downstream processing.

Pipeline shape

version: "1.0"

routes:
  sharepoint_ingest:
    inbound:
      queue: "/queue/sharepoint.enumerate"
      subscription: "sp-ingestors"
    adapters:
      - type: "sharepoint_ingest"
        config:
          site_id: "${SHAREPOINT_SITE_ID}"
          drive_id: "${SHAREPOINT_DRIVE_ID}"
          path: "/Shared Documents/Policies"

  document_converter:
    inbound:
      queue: "/queue/sharepoint.converter.in"
      subscription: "converters"
    adapters:
      - type: "document_converter"

init_message:
  route: "sharepoint_ingest"
  payload:
    full_sync: true

Authentication

Microsoft Graph via app registration. Env:

SHAREPOINT_TENANT_ID=...
SHAREPOINT_CLIENT_ID=...
SHAREPOINT_CLIENT_SECRET=...
SHAREPOINT_SITE_ID=...
SHAREPOINT_DRIVE_ID=...

Required Graph permissions: Sites.Read.All, Files.Read.All (application, not delegated).

Why separate ingest from conversion

The binary document is persisted between the two stages. That lets you re-run conversion (e.g., after upgrading the converter library) against the same source without re-hitting Graph.

Converter coverage

document_converter handles:

.docx → markdown
.pdf → markdown (text-based; not OCR)
.pptx → markdown (headings + body text)
Plain .txt, .md — passed through

Formats outside this list are logged and skipped.

Typical downstream

Feeds the markdown workflow for segmentation, then embeddings, then knowledge for concept extraction.