I learnt XBRL mappings from 32,000 SEC Filings

EdgarTools 5.22.0 introduces data-driven XBRL standardization built from 32,240 real SEC filings, with industry-aware concept mappings, multi-year statement stitching, and IFRS support for Python developers working with SEC EDGAR data.

I learnt XBRL mappings from 32,000 SEC Filings

EdgarTools has frequent releases - maybe 2 or 3 per week. But I will highlight release 5.22.0 - for a significant upgrade to XBRL standardized mappings.

To get that data, I analyzed every 10-K and 10-Q filed with XBRL over the past decade — 32,240 filings spanning thousands of companies across every industry. This dramatically expanded concept coverage from 96 to 234 standardized financial concepts, revealed industry-specific differences in how companies use the same XBRL tags, and gave every mapping a confidence score backed by real filing data.

This post walks through what I found and how it improves multi-year financial comparisons, IFRS coverage, and stitched statement reliability.

Data-Driven XBRL Concept Mappings

Every mapping in the new gaap_mappings.json carries embedded metadata derived from real filings — not editorial judgment:

{
  "NetCashProvidedByUsedInOperatingActivities": {
    "standard_tags": ["NetCashFromOperatingActivities"],
    "display_name": "Net Cash from Operating Activities",
    "statement": "CashFlowStatement",
    "is_total": true,
    "confidence": 1.0,
    "company_count": 8137,
    "occurrence_rate": 0.905,
    "temporal_consistency": 0.917,
    "industry_overrides": { ... }
  }
}

Each entry answers questions that hand-curation can't:

- How confident is the mapping? A confidence of 1.0 means every filing I observed used this tag the same way. Lower scores flag tags that need industry-specific handling.
- How many companies use it? company_count: 8137 means this isn't a niche tag — it's a backbone concept. Tags used by only a handful of filers get lower weight.
- How stable is it over time? temporal_consistency: 0.917 means 91.7% of companies that used this tag in one year used it again the next. Low temporal consistency is an early warning that a tag is being phased out or renamed.

The result is 2,770 tags mapped to 234 standard concepts — up from 2,077 tags and 96 concepts in the hand-maintained file.

Industry-Aware XBRL Standardization

The analysis revealed something I couldn't have found by hand: 42 XBRL tags that mean different things depending on the industry filing them.

Take RetainedEarningsAccumulatedDeficit. For most companies, this is a total — the cumulative bottom line of the equity section. But for aerospace companies, banks, and agricultural firms, it's a line item within a larger equity structure. If your code treats it as a total everywhere, you'll silently double-count equity for entire industries.

To resolve these ambiguities, I needed an industry classification system. I chose the Fama-French 48 — a well-established framework from academic finance that groups companies into 48 industries based on SIC codes. Its granularity matches the level at which XBRL ambiguity actually occurs: banks behave differently from insurers, pharma differently from biotech.

The system works automatically:

  1. Each filing carries a SIC code. EdgarTools maps it to one of 48 industries (6022 → Banks, 3720 → Aero, 2830 → Drugs).
  2. When standardizing a tag, the system checks for an industry override. If one exists, it uses the industry-specific classification.
  3. 769 industry overrides across all 48 industries ensure that tags like RetainedEarnings are correctly treated as a total or a line item depending on who's filing.

No configuration required. If you call company.get_financials(), the industry detection happens behind the scenes.

Stitching That Survives Concept Evolution

Companies don't just differ from each other — they differ from their past selves. When you stitch together multiple years of filings to build a time series, you're assuming the same financial line item has the same tag across years. It often doesn't. Apple used aapl:DerivativeInstrument in one year and switched to us-gaap:CashFlowHedge the next. Disney broke CostOfGoodsAndServicesSold into Service and Product components on a dimensional axis, with no aggregate total at all.

The stitching engine now handles these shifts with four new strategies:

Same-label merging. When a company switches XBRL concepts between fiscal years but the financial line item hasn't changed, the stitcher detects complementary rows (same label, non-overlapping periods) and merges them. Apple's derivative instrument tag change no longer produces two half-populated rows.

Concept alias detection. XBRL has a verbosity problem. NetCashFromOperatingActivitiesContinuingOperations is often just a longer name for NetCashFromOperatingActivities. The stitcher uses substring containment to identify these aliases, with a critical safety guard: overlapping period values must agree. This prevents false merges between genuinely different line items that happen to share name fragments.

# These are aliases — one name contains the other, and values agree:
NetCashFromOperatingActivities
NetCashFromOperatingActivitiesContinuingOperations

# These are NOT aliases — substring match but different economic items:
NetCashProvidedByUsedInOperatingActivities
CashProvidedByUsedInOperatingActivitiesDiscontinuedOperations

Equivalent standard concepts. Some concept pairs are economically identical but have different standard names: CashAndCashEquivalents and CashAndMarketableSecurities describe the same balance sheet line. A declared equivalence table tells the stitcher to merge these without requiring the substring heuristic.

Graceful gaps. Not every filing has every statement. VALE's 20-F filings lack a cash flow presentation role. Previously, stitching would abort. Now it skips the missing period and continues, producing a time series with a gap instead of no time series at all.

IFRS Support: International Filers

Companies that file on U.S. exchanges using IFRS — Novo Nordisk, Sony, Sanofi — use an entirely separate tag vocabulary. I added 150 IFRS tag mappings that normalize these to their US-GAAP equivalents:

ifrs-full_Revenue           → Revenue
ifrs-full_CostOfSales       → CostOfGoodsAndServicesSold
ifrs-full_GrossProfit       → GrossProfit
ifrs-full_ProfitLoss        → NetIncome

I verified on Novo Nordisk's 20-F filing: 93% of income statement concepts78% of balance sheet concepts, and 76% of cash flow concepts now resolve correctly. That's the difference between a statement with scattered blanks and one that's actually usable.

IFRS concepts are also integrated into the stitching engine's ordering templates, so multi-year international statements display line items in the expected financial order.

Dimensional Total Synthesis

Some companies report a financial line item only as dimensional components — without an aggregate total. Disney reports CostOfGoodsAndServicesSold broken into Service costs and Product costs on a ProductOrServiceAxis dimension, but never as a single number.

EdgarTools now synthesizes the total by summing the dimensional members:

  1. Group all dimensional facts by their axis
  2. Pick the axis with the most members (the most complete breakdown)
  3. Sum the values
  4. Guard against nonsensical aggregation — per-share values and ratios are excluded

This means CostOfGoodsAndServicesSold now appears on Disney's income statement even though Disney never reported it as a single number.

What This Means for You

If you use company.get_financials(), multi-year financial comparisons are significantly more reliable. Stitched statements have fewer gaps, fewer duplicate rows, and correct industry-specific classification. International filers work. Dimensional breakdowns aggregate properly.

You don't need to know any of the above to benefit from it. The intelligence is behind the same API:

from edgar import Company

company = Company("DIS")
financials = company.get_financials()
income = financials.income_statement()
print(income)

The statement that comes back is cleaner, more complete, and more correct than it was in 5.21. That's the goal — the complexity stays in the library so it doesn't have to live in your code.

You can use the new standardization now in edgartools by upgrading

pip install -U edgartools

See what I'm building

The improvements in this release are already running on edgar.tools:

  • Disney — dimensional total synthesis on CostOfGoodsAndServicesSold
  • Apple — stitched multi-year statements with concept evolution handled

No code, same engine. Browse any company →

Subscribe to EdgarTools

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
jamie@example.com
Subscribe