Strategy, Design & Engineering Partner

Your briefing is ready.

Enter the access code shared by your Rigoris team to view the full proposal.

That code doesn't match. Try again.

CTC · Medical AI · March 2026

Ascending to your proposal.

Building your view…

AI Knowledge Base

CTC AI Knowledge Base:
Architecture Proposal

Prepared by Rigoris March 2026 Confidential

Background and what we investigated

Last period, Rigoris conducted a detailed discovery into CTC's SharePoint environment, content structure, and workflow patterns. We reviewed the full Projects folder across multiple clients, mapped the internal folder template, analyzed the HubSpot deal history, and studied examples of real medical briefs and content outline templates used by CTC's writers.

What we found changed our thinking significantly, and in a good way. CTC's data is far more structured than it appears at the surface level. Every numbered project folder follows the same nine-subfolder template across all clients. Inside each project, the Medical content folder contains a clean separation between outlines, drafts, and finals. That structure is the foundation of what makes a precise, low-noise retrieval system possible.

How ChatGPT Business was being used and why it hit a ceiling

ChatGPT Business was integrated with SharePoint via Microsoft's native connector. When a writer asked a question, the system searched recently connected documents and returned an answer based on a small, limited window of content, typically three to five results. Here is what that flow looked like:

Step 1 · Input
Writer types a question in ChatGPT
Free-form text, no structure. Example: "do we have anything on COPD advisory boards?"
Step 2 · Retrieval
ChatGPT searches connected SharePoint files
Searches only recently synced or recently opened documents. Hard cap of 3–5 results. No filtering by date, client, or content type. No awareness of project relationships.
Step 3 · Generation
ChatGPT generates a response
Summarizes what it found. Cannot produce a structured content outline. Cannot reference files it did not retrieve. Treats a 2019 draft and a 2025 final deliverable as equal.
The core problem: ChatGPT Business was not built for deep library retrieval. It was built for conversational Q&A over a small document set. When CTC's library has hundreds of relevant past projects, returning three results means missing the majority of useful content every single time.

Beyond the retrieval ceiling, there are three structural gaps ChatGPT cannot solve. It has no concept of project relationships. It cannot filter by date, client, or therapeutic area. And it cannot generate CTC's specific output format: the Content Outline Template with program overview, section names, topic descriptions, and slide counts.

A modern three-layer architecture

The proposed system adds two layers of intelligence before any document search happens. This is what fundamentally separates it from ChatGPT's approach. Each layer narrows the search space so the final retrieval is working on a small, highly relevant pool rather than the entire library.

L1
Layer 1: Structured filtering (SQL)

CTC's HubSpot data gives us 830 projects with client name, therapeutic area, service type, and close date. This layer handles structured queries instantly, narrowing the candidate pool to only relevant projects before any document search happens. Fast, cheap, and deterministic.

Recommended tool: PostgreSQL or equivalent
L2
Layer 2: Relationship awareness (Graph database)

A graph database models how things connect. A therapeutic area links to multiple clients. A client links to multiple program types. This layer expands the candidate pool intelligently, surfacing relevant work from adjacent clients and related therapeutic areas that a keyword or semantic search would never find on its own. This is what gives the system genuine institutional memory.

Recommended tool: Graph database (options to be evaluated)
L3
Layer 3: Semantic content retrieval (RAG)

Only final deliverables (PPTX files inside each project's 3. Final subfolder) are embedded and searched semantically. Because Layers 1 and 2 have already narrowed the candidate set, RAG works on a small, highly relevant pool rather than the entire library. Each slide deck is processed with its project context preserved so retrieval always includes the metadata needed to understand what the content is.

Recommended tool: Vector database (options to be evaluated)
L4
Layer 4: Generation (LLM)

A large language model sits at the end of the pipeline. It never touches the full library; it only receives the brief and the small set of retrieved documents. It generates a structured content outline in CTC's exact template format: program overview, learning objectives, section names, topic descriptions, and approximate slide counts.

Recommended tool: Claude Sonnet, stronger on structured document output than alternatives

System architecture diagram

This diagram shows where each data source lives, what our system creates, and how the pieces connect at runtime when a writer submits a brief.

Data sources and system components
SharePoint
Source of truth for all files.
Files never leave SharePoint.
HubSpot
830 project records synced
to our SQL database once.
What we build (read-only, hosted on AWS)
SQL database
Client + project
metadata copy
+
Graph database
Relationship map
across projects
+
Vector database
Embedded content
from 3. Final only
+
LLM (Sonnet)
Generates the
content outline
Runtime query flow
Writer submits brief
SQL filters projects
Graph expands pool
RAG retrieves decks
LLM writes outline

The full end-to-end flow

Step 1 · Input
Writer submits a structured brief
A simple web form captures: client objectives, therapeutic area, program type, faculty, learning objectives, and key data. Takes 2–3 minutes to fill out.
Step 2 · Layer 1: SQL filter
HubSpot data narrows the project pool
Queries 830 projects and returns only those matching the therapeutic area, service type, and date range. Eliminates irrelevant content before any expensive search happens.
Step 3 · Layer 2: Graph expansion
Graph database finds related work
The filtered project list is expanded through the relationship graph. Similar clients, adjacent therapeutic areas, and related program types are surfaced, giving the retrieval layer a richer candidate pool.
Step 4 · Layer 3: Semantic retrieval
Vector search finds the most relevant final decks
The embedded 3. Final PPTX files from candidate projects are searched semantically against the brief. Top matching slides and decks are retrieved with full project context attached.
Step 5 · Layer 4: Generation
LLM writes the content outline
The brief and retrieved examples are passed to the language model. It generates a structured content outline in CTC's template format: program overview, section names, topics, slide counts.
Step 6 · Output
Writer receives a ready-to-edit outline
The system returns the generated outline along with the source documents it referenced, so the writer can verify and build on the result with full transparency.

What gets indexed and what gets ignored

The full SharePoint library is 4.5TB. The vast majority of that is noise: logistics spreadsheets, design files, invoices, budget trackers, and archived content going back to 2010. Our folder structure analysis identified a precise ingestion path: every numbered project folder across all 75 clients follows the same nine-subfolder template. Only one subfolder is relevant: Medical content / 3. Final.

Excluded from index

  • Logistics
  • Project and financials
  • Design and digital content
  • Client updates
  • External experts
  • Communications materials
  • Drafts and outlines
  • Department Folders entirely
  • Pre-2020 archive content

Indexed

  • Medical content / 3. Final only
  • PPTX and Word files only
  • Inside numbered project folders
  • Modified 2020 onwards
  • Linked to HubSpot project metadata

How the system stays current

The system uses SharePoint webhooks to stay up to date in real time. Every time a file is added, updated, or deleted inside a monitored folder, SharePoint fires a signal automatically. The ingestion service picks it up and updates the index within minutes, with no manual re-indexing needed for day-to-day changes. A nightly reconciliation job runs independently as a safety net to catch anything the webhook missed.

We have mapped out every edge case and how the system handles each one:

Scenario
How the system handles it
HubSpot lag
File is indexed immediately and flagged as metadata pending. Once the HubSpot deal is created, the sync job links them automatically.
HubSpot rate limits
We maintain our own SQL copy of the project data rather than querying HubSpot live. Syncs run on a scheduled basis, avoiding rate limit issues entirely.
File updated
Old vector is deleted, new one is created. Only the latest version lives in the index.
File moved
Webhook fires delete at old path, create at new path. Vector is re-embedded at the new location, old one is purged.
Folder renamed
All child file vectors are re-linked to the new path. SQL metadata is updated to match.
Retrospective additions
System uses the file's last modified date, not the ingestion date. A 2022 deck added today is correctly weighted as older content.
French duplicates
Ingestion script detects EN/FR subfolder pattern. Only EN files are embedded. FR versions are skipped automatically.
Non-PPTX finals
Word docs are included and chunk cleanly. PDFs are included but flagged as lower confidence. Everything else is skipped.
Empty Final folders
Nothing to index until content arrives. Project metadata is already linked and waiting from HubSpot.
Deleted files
Webhook fires a delete event. Vector is removed from the index immediately. No ghost vectors remain.
Webhook failure
Nightly reconciliation job compares current SharePoint state against the index and patches any gaps missed by the webhook.
Large file spikes
Ingestion queue processes files asynchronously in batches. Existing index remains fully operational while new files are being processed.
Archive folders
_Archive folders are excluded from the webhook watch path and not indexed. A one-time backfill can be run on demand if archive content is needed later.

Estimated monthly infrastructure cost

The system is designed to be built in two phases. Phase 1 delivers immediate value with SQL and RAG. Phase 2 adds the graph layer once the core system is proven and in use. Estimated indexable content is 5–15% of total SharePoint storage, likely 200–500GB of actual final deliverables.

ComponentWhat it doesPhase 1Phase 2
SQL databaseStores project and client metadata copy$0–25/mo$0–25/mo
Vector databaseStores embedded content for semantic search$50–100/mo$50–100/mo
LLM API (generation)Generates content outlines from retrieved context$30–60/mo$30–60/mo
Hosting (AWS)App server, webhooks, logging, data transfer$60–150/mo$60–150/mo
Graph databaseRelationship-aware retrieval layerN/A$65–100/mo
Estimated monthly infrastructure total$140–335/mo$205–485/mo

Both phases sit well within a $1,000/month infrastructure budget, with Phase 1 running $140–335/mo and Phase 2 at $205–485/mo. Exact figures will be confirmed once a full content inventory is available and specific tooling is selected.

Open questions for discussion

Before finalizing the build plan, we want to align on a few things:

Monthly figures reflect current provider pricing as of March 2026. Exact infrastructure costs and tooling will be confirmed after alignment on the proposed architecture. All retrieval is read-only; no CTC files are modified, moved, or stored outside the existing SharePoint environment. The system accesses only numbered project folders and only the Medical content / 3. Final subfolder within each one.

Your thoughts

Leave a note

Received.

Your Rigoris team will see this before the next call.