Skip to main content
On-Prem Architecture

Building a RAG Pipeline on SharePoint: Architecture Guide

11 min read

SharePoint is where enterprise knowledge goes to live — and, too often, to die. I have yet to work with an organization of meaningful size that does not have thousands of documents scattered across SharePoint sites, libraries, and subsites, representing years of accumulated institutional knowledge that employees cannot effectively search. Building a RAG pipeline on this data is one of the highest-value GenAI investments an enterprise can make, but the architecture decisions you make in the first two weeks determine whether the system reaches production or stalls in pilot.

This is not a "What is RAG?" primer. If you are reading this, you understand retrieval-augmented generation at a conceptual level. This is an architecture guide for building a RAG SharePoint enterprise pipeline that handles the realities of corporate document management: inconsistent metadata, complex permission models, documents that change weekly, and users who expect sub-second responses. I have built these pipelines for organizations with 5,000 to 25,000 employees, and the patterns here reflect what actually works at that scale.

The RAG SharePoint Enterprise Pipeline: End-to-End Architecture

The pipeline has six stages, and the temptation is to focus on the two that feel most like "AI" — embedding and generation. Resist that temptation. The stages that determine success or failure are ingestion and retrieval, and they deserve the majority of your engineering attention.

The stages in order: Discovery → Ingestion → Chunking → Embedding → Retrieval → Generation. Each stage has architecture decisions that compound downstream. A poor chunking strategy cannot be compensated for by a better embedding model. A retrieval pipeline that ignores permissions will be shut down by your security team regardless of how good the generated answers are.

Discovery: Mapping the SharePoint Landscape

Before writing any pipeline code, you need a complete inventory of what you are connecting to. SharePoint environments in large enterprises are sprawling. A typical organization I work with has 200-400 SharePoint sites, 50,000-200,000 documents, and a permission model that even the SharePoint administrators do not fully understand.

I use the Microsoft Graph API for discovery, specifically the /sites and /drives endpoints to enumerate site collections, document libraries, and their content. The critical output of this phase is not just a document inventory — it is a permission map. Every document library has an associated set of SharePoint permission groups, and these groups map (often inconsistently) to Azure AD security groups.

Document your permission model before you build anything. I have seen two RAG projects stall for months because the team built the entire pipeline, then discovered that implementing permission-aware retrieval required restructuring how they stored document metadata. That restructuring cascaded through the chunking layer, the vector store schema, and the retrieval logic.

Ingestion Architecture for RAG SharePoint Enterprise Data

Ingestion is where the pipeline connects to SharePoint and extracts document content in a form suitable for processing. The architecture decisions here center on three questions: how to handle the variety of document formats, how to implement incremental sync, and how to preserve metadata through the pipeline.

Document Parsing at Enterprise Scale

SharePoint stores documents in every format that Microsoft Office produces, plus PDFs, images, and occasionally HTML. Your parsing layer needs to handle all of them reliably. I standardize on a parsing stack built around Apache Tika for format detection and extraction, with specialized handlers for the formats that Tika handles poorly — complex PowerPoint layouts, scanned PDFs requiring OCR, and Visio diagrams.

The key architectural decision is whether to parse in-pipeline (synchronous, simpler) or as a separate service (asynchronous, more resilient). For corpora under 50,000 documents, in-pipeline parsing works fine. Above that, I run a dedicated parsing service with a job queue (Redis or RabbitMQ) so that parsing failures for individual documents do not block the rest of the pipeline.

Every parsed document produces a standardized output: raw text content, structural metadata (headings, sections, tables), and source metadata (SharePoint site, library, last modified date, author, permission groups). The structural metadata is essential for the next stage — it is what makes intelligent chunking possible.

Incremental Sync: The Feature That Makes or Breaks Production

Your initial ingestion processes the full document corpus. Every subsequent run must process only what changed. Without incremental sync, you are re-embedding your entire corpus on every pipeline run, which at enterprise scale means hours of GPU time and a knowledge base that is always stale by the time it finishes updating.

I implement incremental sync using SharePoint's delta endpoint on the Graph API, which returns changes since a given timestamp. The sync service tracks the last successful sync time per document library and requests only additions, modifications, and deletions since that point. Modified documents are re-parsed, re-chunked, and re-embedded. Deleted documents have their chunks removed from the vector store.

The edge case that catches most teams: document moves and renames. SharePoint's delta API reports these as a delete-then-add pair, which means your pipeline must handle the case where a document's vector store entries are removed and recreated with a new source path. If you do not handle this explicitly, users will see duplicate or orphaned results for moved documents.

Chunking Strategy: Where RAG Quality is Won or Lost

Chunking determines the quality ceiling of your entire RAG pipeline. I have seen retrieval relevance improve by 30-40% from chunking strategy changes alone, with no changes to the embedding model or retrieval logic.

Semantic Chunking Over Fixed-Size Windows

Fixed-size token windows (512 or 1024 tokens with overlap) are the default in most RAG tutorials. They are also the wrong choice for enterprise documents. Corporate documents have structure — headings, sections, numbered lists, tables — and that structure carries semantic meaning that fixed-size windows destroy.

I implement semantic chunking that respects document structure. Headings define chunk boundaries. A section titled "Q3 Financial Summary" stays together as a single chunk (or set of chunks if it exceeds the maximum size) rather than being split arbitrarily at a token boundary. Tables are chunked as complete units with their headers preserved. List items that belong to a common parent stay grouped.

The implementation uses the structural metadata extracted during parsing. If a section exceeds the maximum chunk size (I default to 1,500 tokens for pgvector with text-embedding-3-large), it is split at paragraph boundaries rather than mid-sentence. Every chunk carries its section heading lineage as metadata — "Q3 Financial Summary > Revenue by Region > EMEA" — which becomes part of the embedded text and dramatically improves retrieval for hierarchical queries.

Metadata Enrichment per Chunk

Every chunk in the vector store carries metadata beyond the text content: source document path, SharePoint site, document library, last modified date, section heading lineage, and — critically — the permission groups that govern access to the source document. This metadata is not embedded (it is stored as filterable attributes in the vector store), but it enables the permission-aware retrieval that enterprise deployments require.

Vector Storage and Permission-Aware Retrieval

The vector store is the central data structure of the RAG pipeline, and for enterprise deployments the requirements go well beyond "store embeddings and do similarity search."

Why pgvector for Enterprise RAG SharePoint Deployments

I standardize on pgvector (the PostgreSQL vector extension) for enterprise deployments because it eliminates an entire category of operational complexity. Your organization already runs PostgreSQL. Your DBAs already know how to back it up, monitor it, and failover it. Adding a vector extension to an existing PostgreSQL deployment is operationally trivial compared to introducing a dedicated vector database like Milvus, Qdrant, or Weaviate that your operations team has never managed.

pgvector with HNSW indexing handles corpora of up to two million vectors with sub-100ms query latency on modest hardware (32GB RAM, NVMe storage). For most enterprise SharePoint deployments — which translate to 500,000 to 1,500,000 chunks — this is more than sufficient. If your corpus exceeds this range, consider pgvector with table partitioning by SharePoint site before reaching for a dedicated vector database.

Permission-Aware Retrieval: The Enterprise Requirement

This is the feature that separates enterprise RAG from tutorial RAG. When a user queries the system, the retrieval layer must return only chunks from documents that the user is authorized to access. Ignoring this requirement will get your platform shut down by the security team regardless of how good the answers are.

I implement permission-aware retrieval as a filter on the vector similarity search. Each chunk's metadata includes the Azure AD security group IDs that grant read access to the source document. At query time, the retrieval layer resolves the querying user's group memberships (cached from Azure AD with a fifteen-minute refresh interval) and adds a WHERE clause to the vector search that restricts results to chunks whose permission groups intersect with the user's groups.

In PostgreSQL with pgvector, this looks like a filtered approximate nearest neighbor search: the HNSW index handles the vector similarity, and a GIN index on the permission group array handles the access control filter. The combined query adds approximately 15-20ms of latency compared to unfiltered search — acceptable for interactive use.

Hybrid Retrieval: Dense and Sparse Together

Pure vector similarity search works well for semantic queries ("What is our policy on remote work?") but fails for exact-match queries ("Find document TPS-2024-0847" or "What does policy section 4.3.2 say?"). Enterprise users ask both types of queries, and the system must handle both without requiring users to switch modes.

I implement hybrid retrieval using pgvector for dense vector search combined with PostgreSQL full-text search (tsvector/tsquery) for sparse keyword matching. Results from both retrieval paths are combined using reciprocal rank fusion, which merges two ranked lists into a single ranking without requiring score normalization between the different search methods. The fusion weight is tunable — I typically start at 0.6 dense / 0.4 sparse and adjust based on query log analysis after the first month of production use.

Generation Layer and Response Quality

The generation layer is the most visible part of the pipeline, but by the time you reach it, the hard architectural decisions are behind you. The quality of generated responses is bounded by the quality of retrieved chunks — no prompt engineering rescues irrelevant retrieval results.

Model Selection for On-Prem RAG

For on-prem deployments, I run open-weight models served through vLLM. The current default choice for enterprise RAG is a 70B parameter model (Llama 3 70B or Mixtral 8x7B) running on a cluster of four NVIDIA A100 80GB GPUs. These models produce response quality comparable to GPT-4 for document-grounded Q&A tasks, which is the primary use case in enterprise RAG.

The generation prompt includes the retrieved chunks with their source metadata, an instruction to ground answers in the provided context, and a directive to cite source documents. Citation is not optional in enterprise deployments — users need to verify answers against source documents, and the system must make that verification easy.

Production Deployment and Operational Patterns

Monitoring Beyond Uptime

Production RAG monitoring requires metrics beyond standard application monitoring. I track retrieval relevance (measured by click-through on cited sources), answer groundedness (percentage of responses that cite at least one retrieved chunk), query latency at each pipeline stage, and user feedback signals. These metrics feed a weekly quality review that identifies degrading document categories, underperforming retrieval patterns, and emerging query types that the system handles poorly.

The Feedback Loop That Improves Everything

The single most valuable production feature is a lightweight feedback mechanism — thumbs up/thumbs down on generated responses with an optional text comment. This data, combined with query logs and retrieval metrics, creates a continuous improvement loop. Low-rated responses are analyzed to determine whether the issue was retrieval (wrong chunks), generation (right chunks but poor synthesis), or knowledge gap (relevant documents not yet ingested). Each root cause has a different remediation path, and the feedback data tells you which path to take.

Starting Your RAG SharePoint Enterprise Pipeline

If you are planning a RAG pipeline on SharePoint data, start with the permission model. Map your SharePoint permission groups to Azure AD groups, understand the inheritance patterns, and document the edge cases. Then build the ingestion pipeline for a single SharePoint site — 500 to 2,000 documents — and iterate on chunking and retrieval quality with real users before scaling to the full corpus.

The organizations that reach production with enterprise RAG are the ones that treat it as an architecture discipline, not an AI experiment. The model is the easiest part to swap. The ingestion pipeline, the permission layer, and the operational monitoring — those are the investments that determine whether your RAG deployment becomes infrastructure or remains a demo.

I help enterprise technology leaders design and deploy RAG pipelines that handle the realities of corporate data at scale. If you are evaluating RAG for your SharePoint environment, book a discovery call and I will walk through the architecture decisions specific to your organization.

Cristian Lazar

AI & Technology Operations Advisory

Enterprise architect and AI advisor helping organizations design, build, and operationalize intelligent systems. Specializing in governance-first AI platforms, on-prem RAG architectures, and IT operations transformation.

LinkedIn

Related Articles

Share on LinkedIn

More articles coming soon.

AI & Automation Briefing

Get weekly insights on enterprise AI delivered to your inbox.