AI Knowledge Base

CTC AI Knowledge Base:
Architecture Proposal

Prepared by Rigoris March 2026 Confidential

Background and what we investigated

Last period, Rigoris conducted a detailed discovery into CTC's SharePoint environment, content structure, and workflow patterns. We reviewed the full Projects folder across multiple clients, mapped the internal folder template, analyzed the HubSpot deal history, and studied examples of real medical briefs and content outline templates used by CTC's writers.

What we found changed our thinking significantly, and in a good way. CTC's data is far more structured than it appears at the surface level. Every numbered project folder follows the same nine-subfolder template across all clients. Inside each project, the Medical content folder contains a clean separation between outlines, drafts, and finals. That structure is the foundation of what makes a precise, low-noise retrieval system possible.

How ChatGPT Business was being used and why it hit a ceiling

ChatGPT Business was integrated with SharePoint via Microsoft's native connector. When a writer asked a question, the system searched recently connected documents and returned an answer based on a small, limited window of content, typically three to five results. Here is what that flow looked like:

Step 1 · Input

Writer types a question in ChatGPT

Free-form text, no structure. Example: "do we have anything on COPD advisory boards?"

↓

Step 2 · Retrieval

ChatGPT searches connected SharePoint files

Searches only recently synced or recently opened documents. Hard cap of 3–5 results. No filtering by date, client, or content type. No awareness of project relationships.

↓

Step 3 · Generation

ChatGPT generates a response

Summarizes what it found. Cannot produce a structured content outline. Cannot reference files it did not retrieve. Treats a 2019 draft and a 2025 final deliverable as equal.

The core problem: ChatGPT Business was not built for deep library retrieval. It was built for conversational Q&A over a small document set. When CTC's library has hundreds of relevant past projects, returning three results means missing the majority of useful content every single time.

Beyond the retrieval ceiling, there are three structural gaps ChatGPT cannot solve. It has no concept of project relationships. It cannot filter by date, client, or therapeutic area. And it cannot generate CTC's specific output format: the Content Outline Template with program overview, section names, topic descriptions, and slide counts.

A modern three-layer architecture

The proposed system adds two layers of intelligence before any document search happens. This is what fundamentally separates it from ChatGPT's approach. Each layer narrows the search space so the final retrieval is working on a small, highly relevant pool rather than the entire library.

Layer 1: Structured filtering (SQL)

CTC's HubSpot data gives us 830 projects with client name, therapeutic area, service type, and close date. This layer handles structured queries instantly, narrowing the candidate pool to only relevant projects before any document search happens. Fast, cheap, and deterministic.

Recommended tool: PostgreSQL or equivalent

Layer 2: Relationship awareness (Graph database)

A graph database models how things connect. A therapeutic area links to multiple clients. A client links to multiple program types. This layer expands the candidate pool intelligently, surfacing relevant work from adjacent clients and related therapeutic areas that a keyword or semantic search would never find on its own. This is what gives the system genuine institutional memory.

Recommended tool: Graph database (options to be evaluated)

Layer 3: Semantic content retrieval (RAG)

Only final deliverables (PPTX files inside each project's 3. Final subfolder) are embedded and searched semantically. Because Layers 1 and 2 have already narrowed the candidate set, RAG works on a small, highly relevant pool rather than the entire library. Each slide deck is processed with its project context preserved so retrieval always includes the metadata needed to understand what the content is.

Recommended tool: Vector database (options to be evaluated)

Layer 4: Generation (LLM)

A large language model sits at the end of the pipeline. It never touches the full library; it only receives the brief and the small set of retrieved documents. It generates a structured content outline in CTC's exact template format: program overview, learning objectives, section names, topic descriptions, and approximate slide counts.

Recommended tool: Claude Sonnet, stronger on structured document output than alternatives

System architecture diagram

This diagram shows where each data source lives, what our system creates, and how the pieces connect at runtime when a writer submits a brief.

Data sources and system components

SharePoint

Source of truth for all files.
Files never leave SharePoint.

HubSpot

830 project records synced
to our SQL database once.

What we build (read-only, hosted on AWS)

SQL database

Client + project
metadata copy

Graph database

Relationship map
across projects

Vector database

Embedded content
from 3. Final only

LLM (Sonnet)

Generates the
content outline

Runtime query flow

Writer submits brief

→

SQL filters projects

→

Graph expands pool

→

RAG retrieves decks

→

LLM writes outline

The full end-to-end flow

Step 1 · Input

Writer submits a structured brief

A simple web form captures: client objectives, therapeutic area, program type, faculty, learning objectives, and key data. Takes 2–3 minutes to fill out.

↓

Step 2 · Layer 1: SQL filter

HubSpot data narrows the project pool

Queries 830 projects and returns only those matching the therapeutic area, service type, and date range. Eliminates irrelevant content before any expensive search happens.

↓

Step 3 · Layer 2: Graph expansion

Graph database finds related work

The filtered project list is expanded through the relationship graph. Similar clients, adjacent therapeutic areas, and related program types are surfaced, giving the retrieval layer a richer candidate pool.

↓

Step 4 · Layer 3: Semantic retrieval

Vector search finds the most relevant final decks

The embedded 3. Final PPTX files from candidate projects are searched semantically against the brief. Top matching slides and decks are retrieved with full project context attached.

↓

Step 5 · Layer 4: Generation

LLM writes the content outline

The brief and retrieved examples are passed to the language model. It generates a structured content outline in CTC's template format: program overview, section names, topics, slide counts.

↓

Step 6 · Output

Writer receives a ready-to-edit outline

The system returns the generated outline along with the source documents it referenced, so the writer can verify and build on the result with full transparency.

What gets indexed and what gets ignored

The full SharePoint library is 4.5TB. The vast majority of that is noise: logistics spreadsheets, design files, invoices, budget trackers, and archived content going back to 2010. Our folder structure analysis identified a precise ingestion path: every numbered project folder across all 75 clients follows the same nine-subfolder template. Only one subfolder is relevant: Medical content / 3. Final.

Excluded from index

Logistics
Project and financials
Design and digital content
Client updates
External experts
Communications materials
Drafts and outlines
Department Folders entirely
Pre-2020 archive content

        Indexed
        Medical content / 3. Final only
PPTX and Word files only
Inside numbered project folders
Modified 2020 onwards
Linked to HubSpot project metadata

      

How the system stays current

The system uses SharePoint webhooks to stay up to date in real time. Every time a file is added, updated, or deleted inside a monitored folder, SharePoint fires a signal automatically. The ingestion service picks it up and updates the index within minutes, with no manual re-indexing needed for day-to-day changes. A nightly reconciliation job runs independently as a safety net to catch anything the webhook missed.

We have mapped out every edge case and how the system handles each one:

Scenario

How the system handles it

HubSpot lag

File is indexed immediately and flagged as metadata pending. Once the HubSpot deal is created, the sync job links them automatically.

HubSpot rate limits

We maintain our own SQL copy of the project data rather than querying HubSpot live. Syncs run on a scheduled basis, avoiding rate limit issues entirely.

File updated

Old vector is deleted, new one is created. Only the latest version lives in the index.

File moved

Webhook fires delete at old path, create at new path. Vector is re-embedded at the new location, old one is purged.

Folder renamed

All child file vectors are re-linked to the new path. SQL metadata is updated to match.

Retrospective additions

System uses the file's last modified date, not the ingestion date. A 2022 deck added today is correctly weighted as older content.

French duplicates

Ingestion script detects EN/FR subfolder pattern. Only EN files are embedded. FR versions are skipped automatically.

Non-PPTX finals

Word docs are included and chunk cleanly. PDFs are included but flagged as lower confidence. Everything else is skipped.

Empty Final folders

Nothing to index until content arrives. Project metadata is already linked and waiting from HubSpot.

Deleted files

Webhook fires a delete event. Vector is removed from the index immediately. No ghost vectors remain.

Webhook failure

Nightly reconciliation job compares current SharePoint state against the index and patches any gaps missed by the webhook.

Large file spikes

Ingestion queue processes files asynchronously in batches. Existing index remains fully operational while new files are being processed.

Archive folders

_Archive folders are excluded from the webhook watch path and not indexed. A one-time backfill can be run on demand if archive content is needed later.

Estimated monthly infrastructure cost

The system is designed to be built in two phases. Phase 1 delivers immediate value with SQL and RAG. Phase 2 adds the graph layer once the core system is proven and in use. Estimated indexable content is 5–15% of total SharePoint storage, likely 200–500GB of actual final deliverables.

Component	What it does	Phase 1	Phase 2
SQL database	Stores project and client metadata copy	$0–25/mo	$0–25/mo
Vector database	Stores embedded content for semantic search	$50–100/mo	$50–100/mo
LLM API (generation)	Generates content outlines from retrieved context	$30–60/mo	$30–60/mo
Hosting (AWS)	App server, webhooks, logging, data transfer	$60–150/mo	$60–150/mo
Graph database	Relationship-aware retrieval layer	N/A	$65–100/mo
Estimated monthly infrastructure total		$140–335/mo	$205–485/mo

Both phases sit well within a $1,000/month infrastructure budget, with Phase 1 running $140–335/mo and Phase 2 at $205–485/mo. Exact figures will be confirmed once a full content inventory is available and specific tooling is selected.

Open questions for discussion

Before finalizing the build plan, we want to align on a few things:

Is a simple web form sufficient as the user interface, or do writers need a more interactive experience, for example the ability to refine results, see past queries, or give feedback on outputs?
Should the system keep a history of all generated outlines so writers can reference or compare previous outputs?
Is there a feedback mechanism you would want, for example a writer rating the quality of a result so the system can improve over time?
How often is the HubSpot project list updated? This determines how frequently we sync our SQL copy.
Should pre-2020 archive content be searchable eventually, or excluded permanently?
Are Word documents in 3. Final folders relevant content to index, or only PPTX?

Monthly figures reflect current provider pricing as of March 2026. Exact infrastructure costs and tooling will be confirmed after alignment on the proposed architecture. All retrieval is read-only; no CTC files are modified, moved, or stored outside the existing SharePoint environment. The system accesses only numbered project folders and only the Medical content / 3. Final subfolder within each one.

Your briefing is ready.

Ascending to your proposal.

CTC AI Knowledge Base:
Architecture Proposal

Background and what we investigated

How ChatGPT Business was being used and why it hit a ceiling

A modern three-layer architecture

System architecture diagram

The full end-to-end flow

What gets indexed and what gets ignored

Excluded from index

Indexed

How the system stays current

Estimated monthly infrastructure cost

Open questions for discussion