What's in SEC EDGAR? Nearly 1 Million Entities

The SEC EDGAR database has almost a million registered entities, and nearly a third of them are individual people. Here's what the data looks like, and how to navigate it with Python.

What's in SEC EDGAR? Nearly 1 Million Entities

The SEC assigns a Central Index Key (CIK) to every entity that files with it. As of early 2026, there are 980,843 of them.

SEC EDGAR entity breakdown: from 980K CIKs to 8K listed companies
SEC EDGAR entity breakdown: from 980K CIKs to 8K listed companies

284,000 of Them Are People

Almost a third of SEC CIKs belong to individual people, not companies. The individuals are corporate insiders filing Forms 3, 4, and 5 when they buy or sell stock, investment adviser principals named on Form ADV registrations, and beneficial owners filing Schedule 13D/G. Jamie Dimon has a CIK. So does every director who's ever filed a stock transaction.

Jensen Huang, NVIDIA's CEO, actually has two CIKs: 1106277 "HUANG JEN-HSUN" from his first 5 filings in 2010, and 1197649 "HUANG JEN HSUN" (without the hyphen). The same person, registered under two slightly different name spellings. These records will probably never be merged since the SEC is mostly immutable, and later filings will show that latter CIK.

The SEC doesn't have a clean flag to distinguish people from companies, so EdgarTools had to reverse-engineer it. The is_company and is_individual checks use a priority-based signal system: whether the entity has tickers, a state of incorporation, company-only form types like 10-K or S-1, an EIN, or name keywords like "INC", "CORP", "LLC", or "FUND". There are edge cases everywhere (Reed Hastings has a state of incorporation, Warren Buffett has an EIN), but the combination of signals works reliably across the full dataset.

from edgar import Entity

entity = Entity("0000019617")    # JPMorgan Chase
entity.is_company                # True

entity = Entity("0001195345")    # Jamie Dimon
entity.is_individual             # True

From 697K Companies to 8K Listed Stocks

With individuals filtered out, we're left with ~697K company entities. EdgarTools exposes 563,611 of those with metadata (name, SIC code, exchange, state of incorporation). Most of them are private companies, defunct shells, SPVs, or entities that never traded publicly.

Only 7,928 have a ticker symbol at all. Those break down across exchanges like this:

Filter Count
All SEC company entities 563,611
With any ticker 7,928
Nasdaq 3,372
NYSE 2,601
OTC 1,738
CBOE 16
No exchange listed 201

The OTC companies are a mix of penny stocks, foreign ADRs, and companies that were delisted from a major exchange. Depending on what you're building, you might want them or you might not. For most screening and research work, NYSE + Nasdaq (5,973 companies) is the practical universe.

from edgar.reference import get_companies_by_exchanges

# Major exchanges only
listed = get_companies_by_exchanges(['NYSE', 'Nasdaq'])
print(len(listed))  # ~5,973

# Include OTC
all_traded = get_companies_by_exchanges(['NYSE', 'Nasdaq', 'OTC'])
print(len(all_traded))  # ~7,711

Classifying SEC Entities: 12 Business Categories

Even within those ~8,000 companies with tickers, you can't treat them all the same. The SEC assigns every entity a type (operating, investment, or other), but it's not much help for classification. A quarter of listed companies end up as "other," including real pharma and software companies. And a quarter of what the SEC calls "operating" is actually asset-backed securities vehicles. The labels were never designed for analytical use.

EdgarTools classifies every SEC company into one of 12 business categories using multiple signals: SIC codes first, then SEC form types, then name patterns. Each layer catches what the previous one missed.

Category What it is Examples
Operating Company Standard businesses AAPL, MSFT, TSLA
ETF Exchange-traded funds SPY, QQQ, ARKB
Mutual Fund Open-end funds VFIAX
Closed-End Fund Closed-end funds UTF, GOF
BDC Business development companies ARCC, MAIN
REIT Real estate investment trusts O, AMT, PLD
Investment Manager Asset managers BLK, BX, KKR
Bank Commercial banks JPM, BAC, WFC
Insurance Company Life, health, P&C ALL, MET, PRU
SPAC Blank-check companies Various
Holding Company Pure holding companies BRK-A
Unknown Unclassifiable

This matters because different entity types need completely different analytical frameworks. P/E ratios don't mean anything for an ETF, and revenue growth is meaningless for a SPAC shell. You screen REITs on FFO and dividend yield, banks on net interest margin, and operating companies on earnings and cash flow. The category tells you which framework to use.

Why SIC codes alone aren't enough

No single signal is enough on its own. SIC codes are the most reliable indicator (6798 is always a REIT, 6770 is always a SPAC), but many entities have misleading SIC codes.

Crypto and commodity ETFs are a good example. The Bitwise Bitcoin ETF, the ARK 21Shares Bitcoin ETF, and the Goldman Sachs Physical Gold ETF all have SIC code 6211: "Security Brokers, Dealers and Flotation Companies." That's the same code as actual broker-dealers. These ETFs don't file investment company forms (N-CSR, NPORT-P) because they're structured as grantor trusts, not 1940 Act funds, but they do have "ETF" right in their name.

SPACs have a similar problem. Many file under their target sector's SIC code rather than 6770 (Blank Checks). "Altenergy Acquisition Corp" has SIC 3711 (Motor Vehicles). "Best SPAC I Acquisition Corp" has SIC 8200 (Educational Services). Their names follow the unmistakable "Acquisition Corp" pattern, but the SIC codes say otherwise.

The classifier handles these with a priority chain:

  1. Definitive SIC codes: REIT, SPAC, Bank, Insurance
  2. Investment company forms: N-CSR/NPORT filings signal funds
  3. SPAC name patterns: "Acquisition Corp", "Blank Check"
  4. BDC indicators: N-2 forms or 814- file numbers
  5. Investment managers: 13F filers with broker/adviser SIC
  6. Holding companies: SIC 6719
  7. ETF names: "ETF" in the company name
  8. Fund/trust names: SIC 6200s with "Trust" or "Fund" in name
  9. Default: Operating Company

Each layer catches entities that slipped through the one above it.

SIC Code Industry Distribution on NYSE and Nasdaq

With ~8,000 companies classified and filtered, the next question is where they cluster by industry. The ~6,000 listed on major exchanges span 395 unique SIC codes, but the distribution is heavily concentrated.

SIC code industry concentration on NYSE and Nasdaq
SIC code industry concentration on NYSE and Nasdaq

The top 10 industries cover 37.5% of all listed companies. Pharmaceutical Preparations alone accounts for 528 companies, nearly 9% of the exchange-listed universe. Add biotech (161) and medical devices (123), and life sciences represents over 13% of all listed entities.

from edgar.reference import (
    get_pharmaceutical_companies,
    get_biotechnology_companies,
    get_semiconductor_companies,
    get_banking_companies,
    get_companies_by_industry,
)

pharma = get_pharmaceutical_companies()         # SIC 2834
biotech = get_biotechnology_companies()         # SIC 2833-2836
semis = get_semiconductor_companies()           # SIC 3674

# By SIC range or keyword
healthcare_devices = get_companies_by_industry(sic_range=(3841, 3845))
software = get_companies_by_industry(sic_description_contains="software")

NYSE vs Nasdaq: Sector Composition Compared

That industry concentration looks very different depending on which exchange you're looking at. NYSE and Nasdaq are not interchangeable pools.

NYSE vs Nasdaq sector mix comparison
NYSE vs Nasdaq sector mix comparison

Nasdaq is the life sciences exchange. Nearly 1 in 3 Nasdaq-listed companies is in healthcare, mostly small-cap pharma and biotech. Add technology (19%) and those two sectors alone account for over half of Nasdaq.

NYSE tilts toward financials and industrials. Financials (30%), industrials (12%), materials (11%), and energy (9%) give NYSE a more diversified, old-economy profile.

NYSE Nasdaq
Healthcare 7.4% 31.4%
Technology 9.1% 19.1%
Financials 29.8% 22.7%
Energy 8.8% 1.4%
Materials 10.6% 3.5%

If your screener returns nothing but biotech micro-caps, you're probably scanning Nasdaq without realizing it. If it's heavy on REITs and utilities, you're likely in NYSE-only territory.

from edgar.reference import get_companies_by_exchanges

nyse = get_companies_by_exchanges(['NYSE'])
nasdaq = get_companies_by_exchanges(['Nasdaq'])

Putting It Together

The SEC universe starts at nearly a million CIKs. A third are individuals. Of the companies, 99% don't have tickers. Of the ones that do, a quarter are ETFs, SPACs, BDCs, or other entity types that need completely different analysis from operating companies. And the SIC codes that are supposed to tell you what industry a company is in can be misleading.

That's a lot of layers to peel back before you can do any real analysis, and it's why EdgarTools exists:

from edgar import Company
from edgar.reference import get_companies_by_exchanges, CompanySubset

# Start with listed companies
listed = get_companies_by_exchanges(['NYSE', 'Nasdaq'])

# Or use the fluent builder for complex filters
tech_companies = (CompanySubset(use_comprehensive=True)
    .from_exchange('Nasdaq')
    .filter_by(lambda df: df[df['sic'].between(7371, 7379)])
    .get())

# Classify any company
company = Company('AAPL')
company.business_category        # 'Operating Company'
company.is_operating_company()   # True
company.is_fund()                # False
company.industry                 # 'Electronic Computers'

Beyond the Library: The Company Directory Dataset

Everything in this post — the entity classification, the business categories, the ticker mappings, the SIC enrichment — runs on demand when you call EdgarTools. That works great for interactive research and small-scale analysis. But if you're building a screener, a data pipeline, or anything that needs the full universe at once, calling the SEC for every entity doesn't scale.

So we built the Company Directory as a structured dataset. It's the entire SEC entity universe, pre-classified and enriched, delivered as Parquet files you can drop into your own infrastructure.

Company Directory dataset preview: 500K+ entities with SIC codes, Fama-French industries, business categories, and ticker mappings
Company Directory dataset preview: 500K+ entities with SIC codes, Fama-French industries, business categories, and ticker mappings

What's in it:

Asset Records What you get
Companies 500,000+ Every SEC company entity with SIC, Fama-French 12 & 48 industries, entity type, jurisdiction, fiscal year end
Persons 284,000+ Individual filers — insiders, beneficial owners, adviser principals
Ticker mappings 10,000+ CIK-to-ticker with exchange attribution (NYSE, Nasdaq, OTC, CBOE)
Former names 100,000+ Historical company names with date ranges — essential for backtest matching
Security master 9,000+ CIK-CUSIP-ticker mapping for cross-dataset joins

The whole thing ships as Parquet + a pre-built DuckDB database with analytical views like TickerToCIK, ActiveOperatingCompanies, and SectorDirectory. Weekly full rebuilds on Sundays, daily incremental updates for new entities within 24 hours of their first SEC filing.

This is the same data that powers all of our other datasets — insider trading, institutional holdings, material events. The Company Directory is the reference layer that ties them together through CIK and CUSIP mappings.

If you're building something that needs the full SEC entity universe — a screener, a compliance system, a research pipeline — take a look at our data products catalog or tell us what you're building and we'll get you set up.

In the next post, we'll use these building blocks to construct an actual stock screener: pulling financials for each company, calculating growth and profitability metrics, and filtering down to a shortlist of stocks that meet specific criteria.

Subscribe to EdgarTools

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
jamie@example.com
Subscribe