Edgartools 5
The 5th major iteration of edgartools is now live on PyPI. This features a complete rewrite of the HTML parser and some additional api improvements
Last week I released EdgarTools v5.0. This was not the most significant architectural update since the library's inception, but it is still important. Under the hood, is a ground up rewrite of the how edgartools parses SEC HTML filings. This is a feature I am eminently proud of, though annoyingly Claude Sonnet 4 has been claiming co-author status on every git commit.
The Problem We Solved
Parsing and rendering HTML is an extremely hard problem, and SEC filings come in remarkably diverse formats. A 10-K from Apple looks completely different from one filed by General Electric. Some use tables of contents, others use cross-reference indexes. Some have clear section headers, others bury them in nested tables.
Our original parser handled the common cases well, but struggled with edge cases:
- GE-style filings: Companies using cross-reference indexes instead of traditional sections
- 10-Q ambiguity: "Item 1" appears in both Part I (Financial Statements) and Part II (Legal Proceedings)
- Complex nesting: Sections buried deep in table structures
Under the Hood: The edgar.documents Module
The edgar.documents module is a complete HTML parsing infrastructure built specifically for SEC filings. At its core is HTMLParser, which transforms raw HTML into a structured Document object containing a tree of typed nodes—HeadingNode, ParagraphNode, TableNode, SectionNode, and XBRLNode.
The parser uses a multi-strategy architecture for section detection: it first attempts table-of-contents parsing, falls back to header pattern matching, and can detect cross-reference indexes used by some large filers like GE. Tables are semantically classified (financial statements, compensation tables, ownership tables) and can be extracted directly as pandas DataFrames. The module includes DocumentSearch with four search modes that should make it easier to search content in filings.
For large filings exceeding the streaming threshold (default 50MB), the parser switches to chunk-based processing to avoid memory exhaustion. The node tree supports visitor pattern traversal, enabling custom analysis pipelines without modifying the core parser.
Part-Aware 10-Q Section Access
Previously, requesting "Item 1" from a 10-Q was ambiguous—it could mean Financial Statements (Part I) or Legal Proceedings (Part II). Now sections are properly qualified:
from edgar import Company
filing = Company("AAPL").get_filings(form="10-Q").latest()
tenq = filing.obj()
# Access Part I sections
financial_statements = tenq["Part I, Item 1"]
md_and_a = tenq["Part I, Item 2"]
# Access Part II sections
legal_proceedings = tenq["Part II, Item 1"]
risk_factors = tenq["Part II, Item 1A"]
# List all available sections
print(tenq.sections)Cross-Reference Index Support
Some large filers (like GE) use cross-reference indexes instead of embedding content directly. The new parser detects and handles this pattern:
from edgar import Company
ge = Company("GE")
filing = ge.get_filings(form="10-K").latest()
tenk = filing.obj()
# Works even for cross-reference style filings
business_section = tenk["Item 1"]Enhanced Document API
The filing document now exposes a richer API for analysis:
filing = Company("MSFT").get_filings(form="10-K").latest()
doc = filing.document
# Search within the document
results = doc.search("climate risk")
# Extract all tables
tables = doc.tables
# Access XBRL facts inline with the document
facts = doc.xbrl_factsConclusion
EdgarTools is open source and we welcome contributions:
- GitHub: https://github.com/dgunning/edgartools
- Issues: Report bugs or request features
- Discussions: Share how you're using EdgarTools
EdgarTools makes SEC EDGAR data accessible to Python developers. Whether you're building financial models, conducting research, or automating compliance workflows, EdgarTools provides a simple, powerful interface to SEC filings.