Edgartools 5

The 5th major iteration of edgartools is now live on PyPI. This features a complete rewrite of the HTML parser and some additional api improvements

Edgartools 5

Last week I released EdgarTools v5.0. This was not the most significant architectural update since the library's inception, but it is still important. Under the hood, is a ground up rewrite of the how edgartools parses SEC HTML filings. This is a feature I am eminently proud of, though annoyingly Claude Sonnet 4 has been claiming co-author status on every git commit.

The Problem We Solved

Parsing and rendering HTML is an extremely hard problem, and SEC filings come in remarkably diverse formats. A 10-K from Apple looks completely different from one filed by General Electric. Some use tables of contents, others use cross-reference indexes. Some have clear section headers, others bury them in nested tables.

Our original parser handled the common cases well, but struggled with edge cases:

  • GE-style filings: Companies using cross-reference indexes instead of traditional sections
  • 10-Q ambiguity: "Item 1" appears in both Part I (Financial Statements) and Part II (Legal Proceedings)
  • Complex nesting: Sections buried deep in table structures

Under the Hood: The edgar.documents Module

The edgar.documents module is a complete HTML parsing infrastructure built specifically for SEC filings. At its core is HTMLParser, which transforms raw HTML into a structured Document object containing a tree of typed nodes—HeadingNodeParagraphNodeTableNodeSectionNode, and XBRLNode

The parser uses a multi-strategy architecture for section detection: it first attempts table-of-contents parsing, falls back to header pattern matching, and can detect cross-reference indexes used by some large filers like GE. Tables are semantically classified (financial statements, compensation tables, ownership tables) and can be extracted directly as pandas DataFrames. The module includes DocumentSearch with four search modes that should make it easier to search content in filings.

For large filings exceeding the streaming threshold (default 50MB), the parser switches to chunk-based processing to avoid memory exhaustion. The node tree supports visitor pattern traversal, enabling custom analysis pipelines without modifying the core parser.

Part-Aware 10-Q Section Access

Previously, requesting "Item 1" from a 10-Q was ambiguous—it could mean Financial Statements (Part I) or Legal Proceedings (Part II). Now sections are properly qualified:

from edgar import Company

filing = Company("AAPL").get_filings(form="10-Q").latest()
tenq = filing.obj()

# Access Part I sections
financial_statements = tenq["Part I, Item 1"]
md_and_a = tenq["Part I, Item 2"]

# Access Part II sections
legal_proceedings = tenq["Part II, Item 1"]
risk_factors = tenq["Part II, Item 1A"]

# List all available sections
print(tenq.sections)

Cross-Reference Index Support

Some large filers (like GE) use cross-reference indexes instead of embedding content directly. The new parser detects and handles this pattern:

from edgar import Company

ge = Company("GE")
filing = ge.get_filings(form="10-K").latest()
tenk = filing.obj()

# Works even for cross-reference style filings
business_section = tenk["Item 1"]

Enhanced Document API

The filing document now exposes a richer API for analysis:

filing = Company("MSFT").get_filings(form="10-K").latest()
doc = filing.document

# Search within the document
results = doc.search("climate risk")

# Extract all tables
tables = doc.tables

# Access XBRL facts inline with the document
facts = doc.xbrl_facts

Conclusion

EdgarTools is open source and we welcome contributions:


EdgarTools makes SEC EDGAR data accessible to Python developers. Whether you're building financial models, conducting research, or automating compliance workflows, EdgarTools provides a simple, powerful interface to SEC filings.

Subscribe to EdgarTools

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
jamie@example.com
Subscribe