By Dwight Gunning — Mar 31, 2026

The AI Ecosystem Around EdgarTools

See how the most popular Python library for SEC EDGAR data became the default financial data layer for AI projects: from model training to RAG to MCP servers.

In March 2026, Chroma published Context-1 — a 20-billion-parameter search agent trained on four domains. For the finance domain, they parsed 7,264 SEC filings using edgartools. Finance hit 93% verification accuracy, the second-highest of any domain. Training on that data nearly doubled retrieval performance.

That would make a good story on its own, but Chroma is one of 67+ AI and ML projects that depend on edgartools. Microsoft's Azure-Samples org uses it in an agentic RAG workbench and SurrealDB uses it in their RAG and graph examples. At present there are at least eight separate MCP servers built around edgartools, not counting the official MCP server inside the library. ChatGPT and Claude both refer users to it when asked about SEC data.

The library was created in late 2022 and has grown alongside the post-ChatGPT AI boom. AI builders adopted it because nothing else parsed SEC filings reliably at scale. This article maps that ecosystem for the first time.

When Does EdgarTools Touch Your Data?

The easiest way to understand the ecosystem is by when edgartools runs relative to your AI system. Each layer solves a different problem.

EdgarTools AI ecosystem: four layers from model training to shipped products, built on edgartools as the foundation

SEC Filings as AI Training Data

At this layer, edgartools parses SEC filings into the training corpora, fine-tuning datasets, and evaluation benchmarks that shape model weights. This happens once, offline, and the model carries the knowledge forward.

Chroma Context-1 is the headline example. Their data pipeline:

Sampled 1,707 random companies from SEC EDGAR
Parsed 7,264 10-K and 20-F filings using edgartools
Chunked the filings (average: 31,500 tokens per filing)
Generated synthetic multi-hop questions requiring facts from multiple chunks
Verified extraction accuracy at 93%

The result: Context-1's finance retrieval F1 score jumped from 0.42 (untrained baseline) to 0.82 after training — and 97% of queries found the correct final answer.

Context-1 retrieval F1 score: base model 0.42, trained on SEC filings 0.82 — a 95% improvement

from edgar import Company

# This is the kind of pipeline Chroma built —
# edgartools handles the parsing, you handle the rest
company = Company("MSFT")
filing = company.get_filings(form="10-K").latest(1)
tenk = filing.obj()

# Structured sections ready for chunking
md_a = tenk["Item 7"]   # Management Discussion & Analysis
risk  = tenk["Item 1A"]  # Risk Factors

Finance domain had the second-highest data quality of Context-1's four training domains:

Domain	Source	Verification Accuracy
Legal	USPTO patents	98.3%
Finance	SEC filings via edgartools	93.0%
Email	Epstein + Enron emails	87.5%
Web	Wikipedia + web scraping	84.4%

Why so high? SEC filings are legally mandated disclosures. Companies are liable for accuracy. Auditors verify the numbers. The data has been through more quality control than almost anything on the open web.

Other build-time projects:

- virattt/financial-datasets — curated financial datasets for ML training
- CagataySavasli/MergerGraphNet — graph neural networks for M&A prediction using SEC filing data
- yxc20089/TraderBench — financial benchmarks for LLMs

SEC Filings for RAG and Knowledge Bases

edgartools populates the vector store, search index, or knowledge graph that your system queries over. The filings are parsed, chunked, embedded, and stored before any user question arrives. This is the classic RAG architecture.

Microsoft chose edgartools for adaptive-rag-workbench, their official Azure AI Foundry reference architecture for context-aware agentic RAG with multi-source verification. When Microsoft needed to show developers how to build production RAG over financial documents, they reached for edgartools.

SurrealDB uses it in their official examples — SEC filings ingested into SurrealDB as vectors and knowledge graphs for financial Q&A. When a multi-model database company needs to demonstrate RAG and graph queries together, SEC filings are the dataset and edgartools is the parser.

UC Berkeley's EPIC Lab uses edgartools in docetl-examples to download 8-K filings and process them through their LLM-powered document transformation framework.

The RAG category alone includes 10+ repos: mindria-ai/sec-rag-app, akshata29/agenticrag, avnlp/rag-pipelines, xoxo121/SECure-RAG, Neel-K26/Advanced-RAG-Pipeline, Kurry/chromadb-company-sec-filings, and others. The pattern is always the same: edgartools parses the filing, an embedding model chunks it, a vector database stores it, and an LLM retrieves from it.

from edgar import Company

# A typical ingest pipeline starts here
company = Company("NVDA")
filings = company.get_filings(form="10-K").head(5)

for filing in filings:
    tenk = filing.obj()
    # Each section is already extracted and cleaned
    for item in ["Item 1", "Item 1A", "Item 7", "Item 8"]:
        text = tenk[item]
        # → chunk, embed, store in your vector DB

MCP Servers and AI Agents for Live SEC Data

This is the fastest-growing layer. edgartools is called during inference — the LLM reaches out to SEC EDGAR on demand, in response to a user's question. No pre-indexing required. The standard that makes this work is MCP (Model Context Protocol).

The MCP Server Phenomenon

Eight independent teams built MCP servers around edgartools — without coordination, without being asked. When that many people solve the same problem the same way, you're looking at a genuine gap in the infrastructure.

EdgarTools ships its own MCP server — 13 intent-based tools, zero API keys, structured access to every SEC filing ever made. Unlike API wrappers that expose endpoints, the tools are organized around what analysts actually ask:

Category	Tools	What They Do
Discover	`edgar_company`, `edgar_search`, `edgar_screen`, `edgar_text_search`, `edgar_monitor`	Find companies, search filings, screen by industry, monitor live SEC feeds
Examine	`edgar_filing`, `edgar_read`, `edgar_notes`	Parse any filing into structured data, extract specific sections, drill into disclosures
Analyze	`edgar_trends`, `edgar_compare`, `edgar_ownership`, `edgar_fund`, `edgar_proxy`	Revenue time series, peer comparison, insider trades, fund holdings, executive comp

Setup takes two minutes:

{
  "mcpServers": {
    "edgartools": {
      "command": "uvx",
      "args": ["--from", "edgartools[ai]", "edgartools-mcp"],
      "env": {
        "EDGAR_IDENTITY": "Your Name your.email@example.com"
      }
    }
  }
}

That's it. No API key, no signup, no usage dashboard. Ask Claude "What did Apple report last quarter?" and it pulls the real numbers from the actual filing.

The community MCP servers show the breadth of demand:

stefanoamorelli/sec-edgar-mcp (243 stars, 64 forks) — the most popular third-party implementation. PyPI, Docker, and conda distribution. Academic DOI citation. Promptfoo evaluation suite.
dynamicdeploy/edgartools-mcpserver — Docker-based wrapper
sareegpt/edgartools-mcp, huweihua123/stock-mcp, cotrane/mcp-edgar-sec, druce/deep-research-machine, and others

The ai-integration.md page has the highest unique-view ratio of any page in the edgartools repository. The .claude/ directory is a top-10 viewed path — people are studying the Claude Code integration to build their own.

AI Agents

Beyond MCP servers, 16+ agent projects use edgartools as their SEC data backbone:

KevorkSulahian/FinMas — agentic LLM for financial analysis
Hariharan-afk/Automated-Due-Diligence-Agent — automated due diligence and market intelligence
jacob187/analyst-agent, jabelk/advisor-agent — financial analyst and advisor agents
flash131307/multi-agent-investment — multi-agent investment system
sosodennis/value-investment-agent, dagudelo88/A.R.G.U.S, lucasastorian/intellifin-agent, and more

LLMs as a Discovery Channel

This one surprised me. ChatGPT and Claude both send users to edgartools when they ask about SEC data:

Referrer	Views (14-day window)
chatgpt.com	25–58
claude.ai	23–26

LLMs have become a discovery channel for the library itself. Users ask "How do I get SEC filings in Python?" and the model recommends edgartools by name.

Fintech Products Built on EdgarTools

edgartools is a dependency in full applications that end users interact with. At this layer, the library becomes invisible — users see the product, not the parser underneath.

ValueCell-ai/valuecell (9,893 stars) — community-driven multi-agent platform for financial applications. The largest indirect distribution channel. Every ValueCell user is an edgartools consumer.
ginlix-ai/LangAlpha (55 stars) — "From Vibe Coding to Vibe Investing"
DDMYmia/FusionQuantAI — quantitative AI platform
evgenyigumnov/ai-asset-screener — AI-powered asset screening
ralliesai/tenk — AI analysis of 10-K filings

Why SEC Filings Make Better AI Training Data Than Web Scrapes

Chroma's Context-1 results quantify something practitioners have been discovering on their own: SEC filings make better AI training data than most web sources.

Context-1 verification accuracy by domain: Finance 93.0% via edgartools vs Web 84.4% via scraping

Here's why:

Regulated and verified. Companies are legally liable for what they file. The data has been reviewed by lawyers, auditors, and compliance teams before it reaches EDGAR. There is no equivalent quality control for web-scraped text.

Structured by design. XBRL tags, standardized section headers (Item 1, Item 1A, Item 7...), consistent HTML formatting across form types. SEC filings are machine-readable not by accident but by regulation.

Massive scale. 667,000+ reporting entities. Decades of filings. Multiple form types — 10-K, 10-Q, 8-K, 13F, Form 4, DEF 14A, and dozens more. Multi-modal: text narratives, financial tables, exhibits, all in the same document.

Free and public. No licensing barriers. SEC explicitly provides bulk access. No terms-of-service restrictions on training data usage.

Proven quality. Context-1's 93% verification accuracy on finance vs. 84.4% on web data isn't a fluke. It's what you'd expect when your source material has been through auditor review and legal signoff.

The Numbers

Metric	Value
GitHub stars	1,923
Forks	334
PyPI downloads/month	~485K (~348K organic)
Dependent repos (total)	245+
AI/ML dependent repos	67+ (~40% of all dependents)
Independent MCP servers	8
Tier-1 org dependents	Microsoft, Chroma, SurrealDB
LLM referral sources	ChatGPT, Claude
Competitors with AI ecosystems	None

That last row matters. sec-edgar-downloader has ~5K monthly downloads and no AI dependents. secedgar has similar numbers. The competitive field for "SEC data library that AI teams actually use" is empty.

The Community Is Shaping EdgarTools for AI

The ecosystem isn't just consuming edgartools — it's pushing the library toward AI-native capabilities. Recent community contributions:

#732 — to_markdown() methods for LLM-optimized output
#731 — keyword-based cover page detection (improves text extraction for AI pipelines)
#735 — Claude Desktop MCP schema compatibility fixes
#654 — MCP tool responses for 10-Q narrative sections

When your users start submitting PRs to make your library work better with LLMs, you know the center of gravity has shifted.

Get Involved

EdgarTools is MIT-licensed, free to use, and requires no API key. Whether you're training a model, building a RAG pipeline, or wiring up an MCP server, the starting point is the same:

GitHub — source code, issues, and discussions
Documentation — quickstart guides, API reference, and the MCP server setup
PyPI — pip install edgartools
edgar.tools — a hosted platform built on edgartools with a visual filing browser, REST API, and MCP server — no infrastructure required

If you're using edgartools in an AI project, we'd like to hear about it. Open a discussion or give the repo a star — it helps others find the library the same way 67+ AI teams already have.