What's in SEC EDGAR? Nearly 1 Million Entities
The SEC EDGAR database has almost a million registered entities, and nearly a third of them are individual people. Here's what the data looks like, and how to navigate it with Python.
The SEC assigns a Central Index Key (CIK) to every entity that files with it. As of early 2026, there are 980,843 of them.

284,000 of Them Are People
Almost a third of SEC CIKs belong to individual people, not companies. The individuals are corporate insiders filing Forms 3, 4, and 5 when they buy or sell stock, investment adviser principals named on Form ADV registrations, and beneficial owners filing Schedule 13D/G. Jamie Dimon has a CIK. So does every director who's ever filed a stock transaction.
Jensen Huang, NVIDIA's CEO, actually has two CIKs: 1106277 "HUANG JEN-HSUN" from his first 5 filings in 2010, and 1197649 "HUANG JEN HSUN" (without the hyphen). The same person, registered under two slightly different name spellings. These records will probably never be merged since the SEC is mostly immutable, and later filings will show that latter CIK.
The SEC doesn't have a clean flag to distinguish people from companies, so EdgarTools had to reverse-engineer it. The is_company and is_individual checks use a priority-based signal system: whether the entity has tickers, a state of incorporation, company-only form types like 10-K or S-1, an EIN, or name keywords like "INC", "CORP", "LLC", or "FUND". There are edge cases everywhere (Reed Hastings has a state of incorporation, Warren Buffett has an EIN), but the combination of signals works reliably across the full dataset.
from edgar import Entity
entity = Entity("0000019617") # JPMorgan Chase
entity.is_company # True
entity = Entity("0001195345") # Jamie Dimon
entity.is_individual # TrueFrom 697K Companies to 8K Listed Stocks
With individuals filtered out, we're left with ~697K company entities. EdgarTools exposes 563,611 of those with metadata (name, SIC code, exchange, state of incorporation). Most of them are private companies, defunct shells, SPVs, or entities that never traded publicly.
Only 7,928 have a ticker symbol at all. Those break down across exchanges like this:
| Filter | Count |
|---|---|
| All SEC company entities | 563,611 |
| With any ticker | 7,928 |
| Nasdaq | 3,372 |
| NYSE | 2,601 |
| OTC | 1,738 |
| CBOE | 16 |
| No exchange listed | 201 |
The OTC companies are a mix of penny stocks, foreign ADRs, and companies that were delisted from a major exchange. Depending on what you're building, you might want them or you might not. For most screening and research work, NYSE + Nasdaq (5,973 companies) is the practical universe.
from edgar.reference import get_companies_by_exchanges
# Major exchanges only
listed = get_companies_by_exchanges(['NYSE', 'Nasdaq'])
print(len(listed)) # ~5,973
# Include OTC
all_traded = get_companies_by_exchanges(['NYSE', 'Nasdaq', 'OTC'])
print(len(all_traded)) # ~7,711Classifying SEC Entities: 12 Business Categories
Even within those ~8,000 companies with tickers, you can't treat them all the same. The SEC assigns every entity a type (operating, investment, or other), but it's not much help for classification. A quarter of listed companies end up as "other," including real pharma and software companies. And a quarter of what the SEC calls "operating" is actually asset-backed securities vehicles. The labels were never designed for analytical use.
EdgarTools classifies every SEC company into one of 12 business categories using multiple signals: SIC codes first, then SEC form types, then name patterns. Each layer catches what the previous one missed.
| Category | What it is | Examples |
|---|---|---|
| Operating Company | Standard businesses | AAPL, MSFT, TSLA |
| ETF | Exchange-traded funds | SPY, QQQ, ARKB |
| Mutual Fund | Open-end funds | VFIAX |
| Closed-End Fund | Closed-end funds | UTF, GOF |
| BDC | Business development companies | ARCC, MAIN |
| REIT | Real estate investment trusts | O, AMT, PLD |
| Investment Manager | Asset managers | BLK, BX, KKR |
| Bank | Commercial banks | JPM, BAC, WFC |
| Insurance Company | Life, health, P&C | ALL, MET, PRU |
| SPAC | Blank-check companies | Various |
| Holding Company | Pure holding companies | BRK-A |
| Unknown | Unclassifiable |

This matters because different entity types need completely different analytical frameworks. P/E ratios don't mean anything for an ETF, and revenue growth is meaningless for a SPAC shell. You screen REITs on FFO and dividend yield, banks on net interest margin, and operating companies on earnings and cash flow. The category tells you which framework to use.
Why SIC codes alone aren't enough
No single signal is enough on its own. SIC codes are the most reliable indicator (6798 is always a REIT, 6770 is always a SPAC), but many entities have misleading SIC codes.
Crypto and commodity ETFs are a good example. The Bitwise Bitcoin ETF, the ARK 21Shares Bitcoin ETF, and the Goldman Sachs Physical Gold ETF all have SIC code 6211: "Security Brokers, Dealers and Flotation Companies." That's the same code as actual broker-dealers. These ETFs don't file investment company forms (N-CSR, NPORT-P) because they're structured as grantor trusts, not 1940 Act funds, but they do have "ETF" right in their name.
SPACs have a similar problem. Many file under their target sector's SIC code rather than 6770 (Blank Checks). "Altenergy Acquisition Corp" has SIC 3711 (Motor Vehicles). "Best SPAC I Acquisition Corp" has SIC 8200 (Educational Services). Their names follow the unmistakable "Acquisition Corp" pattern, but the SIC codes say otherwise.
The classifier handles these with a priority chain:
- Definitive SIC codes: REIT, SPAC, Bank, Insurance
- Investment company forms: N-CSR/NPORT filings signal funds
- SPAC name patterns: "Acquisition Corp", "Blank Check"
- BDC indicators: N-2 forms or 814- file numbers
- Investment managers: 13F filers with broker/adviser SIC
- Holding companies: SIC 6719
- ETF names: "ETF" in the company name
- Fund/trust names: SIC 6200s with "Trust" or "Fund" in name
- Default: Operating Company
Each layer catches entities that slipped through the one above it.
SIC Code Industry Distribution on NYSE and Nasdaq
With ~8,000 companies classified and filtered, the next question is where they cluster by industry. The ~6,000 listed on major exchanges span 395 unique SIC codes, but the distribution is heavily concentrated.

The top 10 industries cover 37.5% of all listed companies. Pharmaceutical Preparations alone accounts for 528 companies, nearly 9% of the exchange-listed universe. Add biotech (161) and medical devices (123), and life sciences represents over 13% of all listed entities.
from edgar.reference import (
get_pharmaceutical_companies,
get_biotechnology_companies,
get_semiconductor_companies,
get_banking_companies,
get_companies_by_industry,
)
pharma = get_pharmaceutical_companies() # SIC 2834
biotech = get_biotechnology_companies() # SIC 2833-2836
semis = get_semiconductor_companies() # SIC 3674
# By SIC range or keyword
healthcare_devices = get_companies_by_industry(sic_range=(3841, 3845))
software = get_companies_by_industry(sic_description_contains="software")NYSE vs Nasdaq: Sector Composition Compared
That industry concentration looks very different depending on which exchange you're looking at. NYSE and Nasdaq are not interchangeable pools.

Nasdaq is the life sciences exchange. Nearly 1 in 3 Nasdaq-listed companies is in healthcare, mostly small-cap pharma and biotech. Add technology (19%) and those two sectors alone account for over half of Nasdaq.
NYSE tilts toward financials and industrials. Financials (30%), industrials (12%), materials (11%), and energy (9%) give NYSE a more diversified, old-economy profile.
| NYSE | Nasdaq | |
|---|---|---|
| Healthcare | 7.4% | 31.4% |
| Technology | 9.1% | 19.1% |
| Financials | 29.8% | 22.7% |
| Energy | 8.8% | 1.4% |
| Materials | 10.6% | 3.5% |
If your screener returns nothing but biotech micro-caps, you're probably scanning Nasdaq without realizing it. If it's heavy on REITs and utilities, you're likely in NYSE-only territory.
from edgar.reference import get_companies_by_exchanges
nyse = get_companies_by_exchanges(['NYSE'])
nasdaq = get_companies_by_exchanges(['Nasdaq'])Putting It Together
The SEC universe starts at nearly a million CIKs. A third are individuals. Of the companies, 99% don't have tickers. Of the ones that do, a quarter are ETFs, SPACs, BDCs, or other entity types that need completely different analysis from operating companies. And the SIC codes that are supposed to tell you what industry a company is in can be misleading.
That's a lot of layers to peel back before you can do any real analysis, and it's why EdgarTools exists:
from edgar import Company
from edgar.reference import get_companies_by_exchanges, CompanySubset
# Start with listed companies
listed = get_companies_by_exchanges(['NYSE', 'Nasdaq'])
# Or use the fluent builder for complex filters
tech_companies = (CompanySubset(use_comprehensive=True)
.from_exchange('Nasdaq')
.filter_by(lambda df: df[df['sic'].between(7371, 7379)])
.get())
# Classify any company
company = Company('AAPL')
company.business_category # 'Operating Company'
company.is_operating_company() # True
company.is_fund() # False
company.industry # 'Electronic Computers'Beyond the Library: The Company Directory Dataset
Everything in this post — the entity classification, the business categories, the ticker mappings, the SIC enrichment — runs on demand when you call EdgarTools. That works great for interactive research and small-scale analysis. But if you're building a screener, a data pipeline, or anything that needs the full universe at once, calling the SEC for every entity doesn't scale.
So we built the Company Directory as a structured dataset. It's the entire SEC entity universe, pre-classified and enriched, delivered as Parquet files you can drop into your own infrastructure.

What's in it:
| Asset | Records | What you get |
|---|---|---|
| Companies | 500,000+ | Every SEC company entity with SIC, Fama-French 12 & 48 industries, entity type, jurisdiction, fiscal year end |
| Persons | 284,000+ | Individual filers — insiders, beneficial owners, adviser principals |
| Ticker mappings | 10,000+ | CIK-to-ticker with exchange attribution (NYSE, Nasdaq, OTC, CBOE) |
| Former names | 100,000+ | Historical company names with date ranges — essential for backtest matching |
| Security master | 9,000+ | CIK-CUSIP-ticker mapping for cross-dataset joins |
The whole thing ships as Parquet + a pre-built DuckDB database with analytical views like TickerToCIK, ActiveOperatingCompanies, and SectorDirectory. Weekly full rebuilds on Sundays, daily incremental updates for new entities within 24 hours of their first SEC filing.
This is the same data that powers all of our other datasets — insider trading, institutional holdings, material events. The Company Directory is the reference layer that ties them together through CIK and CUSIP mappings.
If you're building something that needs the full SEC entity universe — a screener, a compliance system, a research pipeline — take a look at our data products catalog or tell us what you're building and we'll get you set up.
In the next post, we'll use these building blocks to construct an actual stock screener: pulling financials for each company, calculating growth and profitability metrics, and filtering down to a shortlist of stocks that meet specific criteria.