SharePoint workflow
Two-stage workflow: pull binary documents from SharePoint, convert them to markdown for downstream processing.
Pipeline shape
Section titled “Pipeline shape”version: "1.0"
routes: sharepoint_ingest: inbound: queue: "/queue/sharepoint.enumerate" subscription: "sp-ingestors" adapters: - type: "sharepoint_ingest" config: site_id: "${SHAREPOINT_SITE_ID}" drive_id: "${SHAREPOINT_DRIVE_ID}" path: "/Shared Documents/Policies"
document_converter: inbound: queue: "/queue/sharepoint.converter.in" subscription: "converters" adapters: - type: "document_converter"
init_message: route: "sharepoint_ingest" payload: full_sync: trueAuthentication
Section titled “Authentication”Microsoft Graph via app registration. Env:
SHAREPOINT_TENANT_ID=...SHAREPOINT_CLIENT_ID=...SHAREPOINT_CLIENT_SECRET=...SHAREPOINT_SITE_ID=...SHAREPOINT_DRIVE_ID=...Required Graph permissions: Sites.Read.All, Files.Read.All (application, not delegated).
Why separate ingest from conversion
Section titled “Why separate ingest from conversion”The binary document is persisted between the two stages. That lets you re-run conversion (e.g., after upgrading the converter library) against the same source without re-hitting Graph.
Converter coverage
Section titled “Converter coverage”document_converter handles:
.docx→ markdown.pdf→ markdown (text-based; not OCR).pptx→ markdown (headings + body text)- Plain
.txt,.md— passed through
Formats outside this list are logged and skipped.
Typical downstream
Section titled “Typical downstream”Feeds the markdown workflow for segmentation, then embeddings, then knowledge for concept extraction.