Short answer: Use a layered approach — regex for the obvious patterns (NASDAQ: XYZ, NYSE:ABC, (OTCQB: TICK)), then validate hits against an exchange ticker reference list to filter out false positives like dollar amounts, abbreviations, and English words that look like tickers. Full Python implementation below. Works with PR Newswire releases pulled via the Apify scraper or any other release source.
Why naive regex fails
The first instinct is something like r"\b[A-Z]{1,5}\b" and bucketing every uppercase 1–5 character sequence as a ticker. This catches everything: U.S., CEO, SEC, EBITDA, USD, NYSE itself, and literally every capitalised English word in a headline. False-positive rate from real PR Newswire releases is roughly 95%. Unusable.
The next instinct is to require an exchange prefix: r"(NASDAQ|NYSE|OTCQB|OTCMKTS|TSX|TSXV|LSE|ASX):\s*([A-Z]{1,5})". This works for the formal disclosure conventions ("Acme Corp (NASDAQ: ACME)") but misses two real cases: cashtags ($AAPL) and bare tickers in body paragraphs ("AAPL shares rose 3%"). It also misses the increasingly common dual-listing format (NYSE: ABC; TSX: ABC.TO).
The reliable approach
Three layers:
- Pattern extraction. Match all four canonical formats: exchange-prefixed, cashtag, parenthesised, and bare-ticker-in-body.
- Reference validation. Check candidate tickers against a known exchange listing reference. NASDAQ Trader, NYSE, and FINRA all publish daily ticker lists; the SEC's company tickers JSON is the cleanest free source.
- Context filter. Require either an exchange context within ~50 characters, OR a known company-name co-occurrence (look up issuer name → expected ticker map).
Reference data
The SEC publishes a free, regularly updated mapping of CIK → ticker → company name at https://www.sec.gov/files/company_tickers.json. Roughly 12,000 active US-listed entities. Download nightly, cache as a dict.
import json, urllib.request
def load_sec_tickers():
url = "https://www.sec.gov/files/company_tickers.json"
req = urllib.request.Request(url, headers={"User-Agent": "your-email@example.com"})
with urllib.request.urlopen(req) as r:
data = json.loads(r.read())
# Returns: {"AAPL": {"cik": 320193, "name": "Apple Inc."}, ...}
return {e["ticker"]: {"cik": e["cik_str"], "name": e["title"]} for e in data.values()}
Extractor implementation
import re
from collections import defaultdict
EXCHANGE_PATTERNS = re.compile(
r"\((?:NASDAQ|NYSE|NYSEAMERICAN|OTCQB|OTCQX|OTCMKTS|OTC|TSX|TSXV|LSE|ASX|HKEX|SGX|FRA|XETRA)\s*:?\s*([A-Z][A-Z0-9.\-]{0,5})\)?",
re.IGNORECASE,
)
PREFIX_PATTERN = re.compile(
r"\b(?:NASDAQ|NYSE|NYSEAMERICAN|OTCQB|OTCQX|OTCMKTS|OTC|TSX|TSXV|LSE|ASX)\s*:\s*([A-Z][A-Z0-9.\-]{0,5})",
re.IGNORECASE,
)
CASHTAG_PATTERN = re.compile(r"\$([A-Z]{1,5})\b")
def extract_tickers(text, sec_tickers, issuer_hint=None):
candidates = defaultdict(lambda: {"hits": 0, "exchange": None})
for m in EXCHANGE_PATTERNS.finditer(text):
t = m.group(1).upper()
candidates[t]["hits"] += 2
candidates[t]["exchange"] = m.group(0).split(":")[0].strip("(").upper()
for m in PREFIX_PATTERN.finditer(text):
t = m.group(1).upper()
candidates[t]["hits"] += 2
candidates[t]["exchange"] = m.group(0).split(":")[0].upper()
for m in CASHTAG_PATTERN.finditer(text):
t = m.group(1).upper()
candidates[t]["hits"] += 1
# Validate: keep only those in SEC ticker reference OR strongly contextual
validated = []
for t, meta in candidates.items():
if t in sec_tickers and meta["hits"] >= 1:
validated.append({
"ticker": t,
"company": sec_tickers[t]["name"],
"exchange_in_text": meta["exchange"],
"confidence": "high" if meta["hits"] >= 2 else "medium",
})
elif meta["hits"] >= 2 and meta["exchange"]:
# Likely non-US listing
validated.append({
"ticker": t,
"company": None,
"exchange_in_text": meta["exchange"],
"confidence": "medium",
})
# Tie-break: if issuer hint matches one candidate, boost it to primary
if issuer_hint:
for v in validated:
if v["company"] and issuer_hint.lower() in v["company"].lower():
v["primary"] = True
return validated
End-to-end with the Apify PR Newswire scraper
import os, json, urllib.request
APIFY_TOKEN = os.environ["APIFY_TOKEN"]
ACTOR = "nexgendata~pr-newswire-press-releases-scraper"
def fetch_releases(category="financial-services-latest-news", n=100):
payload = json.dumps({"category": category, "maxResults": n, "includeBody": True}).encode()
url = f"https://api.apify.com/v2/acts/{ACTOR}/run-sync-get-dataset-items?token={APIFY_TOKEN}"
req = urllib.request.Request(url, data=payload, method="POST",
headers={"Content-Type": "application/json"})
with urllib.request.urlopen(req, timeout=600) as r:
return json.loads(r.read())
sec_tickers = load_sec_tickers()
releases = fetch_releases()
for rel in releases:
text = (rel.get("headline","") + "\n" + rel.get("body",""))
tickers = extract_tickers(text, sec_tickers, issuer_hint=rel.get("issuer"))
if tickers:
print(rel["publishedAt"], rel["issuer"], "->", [t["ticker"] for t in tickers])
Edge cases worth handling
-
Class-A vs Class-B shares.
BRK.A,BRK.B,GOOG/GOOGL. The regex above handles the dot suffix; the SEC reference handles the dual mapping. -
Foreign listings with dot suffixes.
SHOP.TOfor Shopify on TSX. Allowed by the regex. - Multi-listing announcements. A single release may mention 3–8 tickers if it covers a basket or a partnership. Return the list and let downstream consumers pick.
-
Mentions of competitors. A press release from Acme might name Globex's ticker incidentally. Use the
primaryflag (issuer-name match) to distinguish the issuer's own ticker from incidental mentions. - Embargoed releases. Some wires include "EMBARGOED UNTIL" notices; tickers in those should be flagged for restricted use.
What to do with the tickers
Once you have a clean (release, ticker) table you can:
- Join against price data and run an event study on cumulative abnormal returns over T+0 to T+5 days.
- Aggregate by issuer to see release frequency, which sometimes correlates with corporate activity intensity.
- Feed into the trading-signal layer described in Building Event-Driven Trading Signals from PR Newswire Data.
- Pipe into the competitor-monitoring stack from How to Monitor Competitor Press Releases Automatically for ticker-aware alerting.
Performance notes
The extractor above runs at roughly 8,000 releases/second on a single core. The SEC reference dict is ~1MB. For 1M releases, you can do the whole extraction in under three minutes on a laptop. Validation rate against SEC reference: roughly 85–90% of high-confidence candidates match an active ticker; the remaining 10–15% are usually foreign listings, recently delisted names, or formatting artefacts.
Try it
Pull a sample of releases with the NexGenData PR Newswire scraper, run the extractor, and inspect. The combination produces a clean structured ticker-mention table in about 10 minutes of setup.
Related Reading
- PR Newswire API: The 2026 Complete Guide
- 7 PR Newswire Alternatives Compared (2026)
- Cision Alternative for Small PR Agencies in 2026
- How to Monitor Competitor Press Releases Automatically (Python Guide)
- PR Newswire vs BusinessWire vs GlobeNewswire: Data Coverage Compared
- Building Event-Driven Trading Signals from PR Newswire Data
- How to Scrape PR Newswire Legally (and Without Getting Blocked)
Top comments (0)