USE CASE · PROVIDER DATA FOR AI & RAG
Ground your AI agent in provider data it can cite back to the source.
A chunk-ready federal provider corpus with stable IDs and 14-field provenance on every chunk — drop it into a retrieval pipeline over REST or MCP without your model inventing provider facts. Built for AI teams shipping agents and the governance owners who have to stand behind every answer.
providers federal source familiesProvenance on every chunkREST + MCP
✓ No PHI✓ Citation-stable chunk IDs✓ 14-field provenance✓ Deterministic re-pull
The AI-governance exposure
An ungrounded provider fact is a liability.
When a model answers with a provider's enrollment status, a sanction, or an NPI and cannot point to a source, that answer cannot be defended. In healthcare, an invented or stale provider fact is not a typo — it is a claim no one can stand behind. The fix is not a bigger model; it is retrieval where each chunk already carries its source and date.
Fonteum attaches a federal source, snapshot date, and methodology version to every chunk, so a grounded answer can footnote the exact record it relied on — and a governance review can re-derive that fact months later.
The developer pain
Reference-heavy FHIR and index churn.
Raw FHIR JSON is reference-heavy — a Practitioner points to a PractitionerRole that points to an Organization that points to a Location — so building one chunk of context takes several round trips, and the coding-system URIs and extension blocks bloat every token budget. Worse, the raw files carry no chunk identity, so each re-pull re-embeds everything and churns your vector store.
One paginated GET returns pre-resolved, pre-chunked text with a stable chunk_id and the provenance block inline. The flattening, the chunking, and the citation are the response — not something you reconstruct downstream.
How it works
The corpus shape, and how a chunk stays citable.
Flat
Pre-resolved
Nested FHIR references are resolved server-side into a flat, fully-populated chunk — no client-side round trips to assemble context.
Stable
Chunk IDs
A deterministic chunk_id per record. Use it as the vector primary key; a re-pull upserts in place instead of duplicating vectors.
14
Provenance fields
Source, source URL, snapshot date, last-checked date, methodology, confidence — the full contract rides on every chunk for downstream citation.
MCP
Live agent path
The same graph is exposed as Model Context Protocol tools, so an agent can query live with the same provenance the static index carries.
Integration & workflow
One corpus. Two ways in.
Developers page the chunks endpoint into a vector store. Governance owners stand up a corpus where every retrieved fact carries a re-derivable citation. Same data, same provenance, same source snapshot.
GET /api/v1/rag/chunks
curl "https://fonteum.com/api/v1/rag/chunks?limit=50&cursor=0" \
-H "Accept: application/json"Response
{
"total": 124817,
"next_cursor": 50,
"chunks": [
{
"chunk_id": "source:nppes#overview",
"text": "The CMS NPPES registry enumerates US healthcare providers ...",
"cite": "CMS NPPES, snapshot 2026-05-01",
"source_url": "https://npiregistry.cms.hhs.gov/",
"dataset_id": "nppes/v1",
"provenance": {
"_source": "CMS NPPES",
"_source_url": "https://npiregistry.cms.hhs.gov/",
"_snapshot": "2026-05-01",
"_last_checked": "2026-06-17",
"_methodology": "rag-chunks/v1",
"_confidence": 1.0
}
}
]
}Public endpoint, rate limited per source IP. Map each chunk to a LangChain Document or LlamaIndex TextNode — chunk_id as the id, text as the embed body, the provenance block as metadata. Full LangChain / LlamaIndex / MCP walkthroughs live in /docs/integrations.
- 01Page the corpus with ?limit= and the returned next_cursor — one deterministic GET per page, no language model in the pipeline.
- 02Map each chunk to a vector node: chunk_id as the primary key, text as the embed body, the provenance block as metadata.
- 03Embed and index. Because chunk IDs are stable, a later re-pull upserts in place rather than duplicating vectors.
- 04At answer time, carry the chunk's cite string and source_url into the response so every grounded fact footnotes its federal source.
- 05Keep the provenance record with the answer — the audit trail that shows a model-surfaced provider fact is real, dated, and re-derivable.
Sample audit-evidence artifact
RAG CITATION — PROVENANCE RECORD
Answer fact ......... NPI 1003894328 is enrolled in Medicare
Grounded on chunk ... source:pecos#1003894328
Source .............. CMS PECOS
Source URL .......... https://data.cms.gov/provider-enrollment
Source snapshot ..... 2026-05-01
Last checked (UTC) .. 2026-06-17T14:02:11Z
Methodology ......... rag-chunks/v1
Re-derivable ........ GET /api/v1/rag/chunks (deterministic chunk_id)The provenance record is the governance artifact: it shows a model-surfaced provider fact, the chunk it was grounded on, and the federal source and snapshot it re-derives from.
Proof — not logos
Every chunk traces to a public source row.
Providers
Unique NPIs enumerated from the CMS NPI Registry — the identity backbone the corpus is chunked from.
Source families
Every federal source the graph draws from is named and dated — a retrieved chunk cites the originating file, never a black box.
rag-chunks/v1
Pinned methodology
The chunking methodology is versioned and stamped on every chunk, so the same version reproduces the same corpus for an audit.
Signed
Attestation
Snapshots are tied to Fonteum's Ed25519 witness chain, so the integrity of the data behind a citation is provable after the fact.
“A retrieved fact is only as good as the source it carries. The chunk ID, the source, and the snapshot are the product.”
PROVIDER DATA FOR AI & RAG
Index a citation-stable provider corpus over REST or MCP.
Questions
Before the security questionnaire.
Why not just embed the raw NPPES and CMS files myself?
You can. The raw FHIR and CSV files are reference-heavy and carry no chunk identity, so a re-pull churns your vector index and a retrieved fact has no source attached. Fonteum returns pre-resolved, pre-chunked text with a stable chunk_id and the full provenance block inline, so embeddings stay deterministic and every retrieved chunk already names its source.
How does a chunk keep its citation?
Every chunk arrives with a 14-field provenance contract — source name, source URL, snapshot date, last-checked date, methodology version — plus a ready-to-print cite string. Carry that metadata onto your vector node, and any answer your model grounds on the chunk can footnote the exact federal record and snapshot it came from.
Will the chunk IDs stay stable across pulls?
Yes. Chunk IDs are deterministic — the same record produces the same chunk_id on every pull, so an incremental re-pull upserts cleanly instead of duplicating vectors. Re-embedding only touches chunks whose underlying federal record actually changed.
Can my agent call this over MCP instead of REST?
Yes. The same provider graph is exposed as Model Context Protocol tools — search, resolve-by-NPI, exclusion check, dataset info, source list — and every tool response carries the same provenance block as the REST chunks endpoint. Use REST to build a static index, MCP to let an agent query live.
Is any patient data in the corpus?
No. The corpus is built only from public federal and state provider records keyed by NPI, CCN, and PECOS-ID. There is no PHI in the pipeline and none is required to index, retrieve, or cite a provider fact.
Go deeper
The data and platform behind the corpus.
Solutions
All solutions — by use case & buyer →
Platform
The capability layer — API, MCP, FHIR →
Developers
Docs, quickstart, and SDKs — the dev hub →
Data catalog
NPPES provider registry →
Data catalog
Browse every federal dataset →
Research
Original studies built on the graph →
Use case
Credentialing & provider-data enrichment →
Use case
Exclusion & sanctions screening →
For developers
Fonteum for developers & AI teams →
FONTEUM · PROVIDER DATA FOR AI
Ground your agent on public data only. No PHI.