DEV Community

agenthustler
agenthustler

Posted on

Yahoo Finance Scraping: Extract Stock Prices, Financial Data and Market News

Yahoo Finance is one of the most popular financial data platforms on the internet, offering a wealth of information including real-time stock quotes, historical price data, financial statements, earnings reports, analyst ratings, and market news. For data analysts, quantitative researchers, and fintech developers, being able to extract this data programmatically is invaluable.

In this comprehensive guide, we'll explore Yahoo Finance's structure, demonstrate how to scrape stock prices, financial statements, and news feeds using both Python and Node.js, and show how to scale your extraction using Apify's cloud platform.

Understanding Yahoo Finance's Data Structure

Yahoo Finance organizes financial data around ticker symbols. Each company page (finance.yahoo.com/quote/{TICKER}) serves as a hub linking to multiple data views:

Quote Page

The main quote page shows the current price, daily change, volume, market cap, P/E ratio, dividend yield, and 52-week range. It also includes a mini chart and recent news.

Historical Data

Available at /quote/{TICKER}/history/, this section provides daily, weekly, or monthly OHLCV (Open, High, Low, Close, Volume) data going back decades for most stocks.

Financial Statements

Under the Financials tab (/quote/{TICKER}/financials/), you'll find:

  • Income Statement: Revenue, operating income, net income, EPS
  • Balance Sheet: Assets, liabilities, equity
  • Cash Flow Statement: Operating, investing, and financing cash flows

Each can be viewed annually or quarterly.

Analysis & Earnings

The Analysis page shows analyst recommendations, price targets, earnings estimates, and revenue estimates. The earnings calendar shows upcoming and past earnings dates with EPS estimates vs actuals.

News Feed

Yahoo Finance aggregates financial news from multiple sources, with both general market news and stock-specific news on each quote page.

Method 1: Using yfinance (Python Library)

The yfinance library is the easiest way to get started with Yahoo Finance data:

import yfinance as yf
import pandas as pd
from datetime import datetime, timedelta

class YahooFinanceExtractor:
    def __init__(self):
        self.cache = {}

    def get_stock_info(self, ticker):
        """Get comprehensive stock information."""
        stock = yf.Ticker(ticker)
        info = stock.info

        return {
            "symbol": ticker,
            "name": info.get("longName"),
            "sector": info.get("sector"),
            "industry": info.get("industry"),
            "market_cap": info.get("marketCap"),
            "current_price": info.get("currentPrice"),
            "pe_ratio": info.get("trailingPE"),
            "forward_pe": info.get("forwardPE"),
            "dividend_yield": info.get("dividendYield"),
            "fifty_two_week_high": info.get("fiftyTwoWeekHigh"),
            "fifty_two_week_low": info.get("fiftyTwoWeekLow"),
            "avg_volume": info.get("averageVolume"),
            "beta": info.get("beta"),
            "earnings_date": info.get("earningsTimestamp"),
            "target_mean_price": info.get("targetMeanPrice"),
            "recommendation": info.get("recommendationKey"),
        }

    def get_historical_prices(self, ticker, period="1y", interval="1d"):
        """Get historical OHLCV data."""
        stock = yf.Ticker(ticker)
        hist = stock.history(period=period, interval=interval)

        records = []
        for date, row in hist.iterrows():
            records.append({
                "date": date.strftime("%Y-%m-%d"),
                "open": round(row["Open"], 2),
                "high": round(row["High"], 2),
                "low": round(row["Low"], 2),
                "close": round(row["Close"], 2),
                "volume": int(row["Volume"]),
            })

        return records

    def get_financial_statements(self, ticker):
        """Get income statement, balance sheet, and cash flow."""
        stock = yf.Ticker(ticker)

        return {
            "income_statement": stock.financials.to_dict() if stock.financials is not None else {},
            "balance_sheet": stock.balance_sheet.to_dict() if stock.balance_sheet is not None else {},
            "cash_flow": stock.cashflow.to_dict() if stock.cashflow is not None else {},
        }

    def get_earnings_data(self, ticker):
        """Get earnings history and estimates."""
        stock = yf.Ticker(ticker)

        earnings_hist = stock.earnings_history
        if earnings_hist is not None and not earnings_hist.empty:
            earnings_list = earnings_hist.to_dict("records")
        else:
            earnings_list = []

        return {
            "earnings_history": earnings_list,
            "earnings_dates": stock.earnings_dates.to_dict("records") if stock.earnings_dates is not None and not stock.earnings_dates.empty else [],
        }

    def get_news(self, ticker):
        """Get recent news for a stock."""
        stock = yf.Ticker(ticker)
        news = stock.news

        articles = []
        for item in news:
            articles.append({
                "title": item.get("title"),
                "publisher": item.get("publisher"),
                "link": item.get("link"),
                "published": item.get("providerPublishTime"),
                "type": item.get("type"),
            })

        return articles

# Usage
extractor = YahooFinanceExtractor()

# Get Apple stock info
info = extractor.get_stock_info("AAPL")
print(f"Company: {info['name']}")
print(f"Price: ${info['current_price']}")
print(f"P/E Ratio: {info['pe_ratio']}")
print(f"Market Cap: ${info['market_cap']:,}")

# Get historical prices
prices = extractor.get_historical_prices("AAPL", period="6mo")
print(f"\nHistorical data points: {len(prices)}")
print(f"Latest: {prices[-1]['date']} - Close: ${prices[-1]['close']}")

# Get financials
financials = extractor.get_financial_statements("AAPL")
print(f"\nFinancial statements retrieved successfully")
Enter fullscreen mode Exit fullscreen mode

Method 2: Direct Web Scraping with Python

For data that yfinance doesn't expose, or when you need more control, you can scrape Yahoo Finance directly:

import requests
from bs4 import BeautifulSoup
import json
import re
import time

class YahooFinanceScraper:
    BASE_URL = "https://finance.yahoo.com"

    def __init__(self):
        self.session = requests.Session()
        self.session.headers.update({
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                         "AppleWebKit/537.36 (KHTML, like Gecko) "
                         "Chrome/120.0.0.0 Safari/537.36",
            "Accept-Language": "en-US,en;q=0.9",
        })

    def scrape_quote_page(self, ticker):
        """Scrape the main quote page for real-time data."""
        url = f"{self.BASE_URL}/quote/{ticker}/"
        response = self.session.get(url)
        soup = BeautifulSoup(response.text, "html.parser")

        # Extract price data from the page
        price_el = soup.select_one('[data-testid="qsp-price"]')
        change_el = soup.select_one('[data-testid="qsp-price-change"]')

        # Extract key statistics
        stats = {}
        stat_rows = soup.select('[data-testid="quote-statistics"] li')
        for row in stat_rows:
            label = row.select_one("span:first-child")
            value = row.select_one("span:last-child")
            if label and value:
                stats[label.text.strip()] = value.text.strip()

        return {
            "ticker": ticker,
            "price": price_el.text.strip() if price_el else None,
            "change": change_el.text.strip() if change_el else None,
            "statistics": stats,
        }

    def scrape_historical_data(self, ticker, period1=None, period2=None):
        """Scrape historical price data using Yahoo's download API."""
        import time as t

        if period2 is None:
            period2 = int(t.time())
        if period1 is None:
            period1 = period2 - (365 * 24 * 60 * 60)  # 1 year ago

        url = (f"https://query1.finance.yahoo.com/v7/finance/download/{ticker}"
               f"?period1={period1}&period2={period2}"
               f"&interval=1d&events=history")

        response = self.session.get(url)

        if response.status_code == 200:
            lines = response.text.strip().split("\n")
            headers = lines[0].split(",")
            data = []
            for line in lines[1:]:
                values = line.split(",")
                row = dict(zip(headers, values))
                data.append(row)
            return data

        return []

    def scrape_financials(self, ticker, statement="income"):
        """Scrape financial statements from the financials page."""
        statement_map = {
            "income": "financials",
            "balance": "balance-sheet",
            "cashflow": "cash-flow",
        }

        slug = statement_map.get(statement, "financials")
        url = f"{self.BASE_URL}/quote/{ticker}/{slug}/"

        response = self.session.get(url)
        soup = BeautifulSoup(response.text, "html.parser")

        # Yahoo Finance renders financials via JavaScript,
        # so we look for the embedded JSON data
        scripts = soup.find_all("script")
        for script in scripts:
            if script.string and "root.App.main" in script.string:
                json_str = re.search(
                    r"root\.App\.main\s*=\s*({.*?});",
                    script.string
                )
                if json_str:
                    data = json.loads(json_str.group(1))
                    return self._extract_financial_data(data, ticker)

        return {}

    def scrape_news(self, ticker):
        """Scrape news articles for a specific stock."""
        url = f"{self.BASE_URL}/quote/{ticker}/news/"
        response = self.session.get(url)
        soup = BeautifulSoup(response.text, "html.parser")

        articles = []
        news_items = soup.select("section.container li")

        for item in news_items:
            title_el = item.select_one("h3")
            link_el = item.select_one("a")
            source_el = item.select_one(".publishing")

            if title_el:
                articles.append({
                    "title": title_el.text.strip(),
                    "url": link_el["href"] if link_el else None,
                    "source": source_el.text.strip() if source_el else None,
                })

        return articles

# Usage
scraper = YahooFinanceScraper()

# Scrape Tesla quote
quote = scraper.scrape_quote_page("TSLA")
print(f"TSLA Price: {quote['price']}")
print(f"Change: {quote['change']}")
for key, value in quote['statistics'].items():
    print(f"  {key}: {value}")

# Get historical data
history = scraper.scrape_historical_data("TSLA")
print(f"\nHistorical records: {len(history)}")
Enter fullscreen mode Exit fullscreen mode

Method 3: Node.js Scraping

For JavaScript developers, here's a Node.js approach:

const axios = require('axios');
const cheerio = require('cheerio');

class YahooFinanceScraper {
    constructor() {
        this.baseUrl = 'https://finance.yahoo.com';
        this.headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
            'Accept-Language': 'en-US,en;q=0.9',
        };
    }

    async getQuote(ticker) {
        const url = `${this.baseUrl}/quote/${ticker}/`;
        const { data } = await axios.get(url, { headers: this.headers });
        const $ = cheerio.load(data);

        const price = $('[data-testid="qsp-price"]').text().trim();
        const change = $('[data-testid="qsp-price-change"]').text().trim();

        const stats = {};
        $('[data-testid="quote-statistics"] li').each((_, el) => {
            const spans = $(el).find('span');
            if (spans.length >= 2) {
                const label = spans.first().text().trim();
                const value = spans.last().text().trim();
                stats[label] = value;
            }
        });

        return { ticker, price, change, stats };
    }

    async getHistoricalPrices(ticker, range = '1y') {
        // Use Yahoo Finance API v8 for historical data
        const url = `https://query1.finance.yahoo.com/v8/finance/chart/${ticker}`;
        const params = { range, interval: '1d' };

        try {
            const { data } = await axios.get(url, {
                headers: this.headers,
                params,
            });

            const result = data.chart.result[0];
            const timestamps = result.timestamp;
            const quote = result.indicators.quote[0];

            return timestamps.map((ts, i) => ({
                date: new Date(ts * 1000).toISOString().split('T')[0],
                open: quote.open[i]?.toFixed(2),
                high: quote.high[i]?.toFixed(2),
                low: quote.low[i]?.toFixed(2),
                close: quote.close[i]?.toFixed(2),
                volume: quote.volume[i],
            }));
        } catch (error) {
            console.error(`Error fetching historical data: ${error.message}`);
            return [];
        }
    }

    async getNews(ticker) {
        const url = `${this.baseUrl}/quote/${ticker}/news/`;
        const { data } = await axios.get(url, { headers: this.headers });
        const $ = cheerio.load(data);

        const articles = [];
        $('section.container li').each((_, el) => {
            const title = $(el).find('h3').text().trim();
            const link = $(el).find('a').attr('href');
            const source = $(el).find('.publishing').text().trim();

            if (title) {
                articles.push({ title, url: link, source });
            }
        });

        return articles;
    }

    async getMultipleQuotes(tickers) {
        const quotes = await Promise.all(
            tickers.map(async (ticker) => {
                try {
                    const quote = await this.getQuote(ticker);
                    return quote;
                } catch (err) {
                    return { ticker, error: err.message };
                }
            })
        );
        return quotes;
    }
}

// Usage
(async () => {
    const scraper = new YahooFinanceScraper();

    // Get multiple quotes
    const tickers = ['AAPL', 'GOOGL', 'MSFT', 'AMZN'];
    const quotes = await scraper.getMultipleQuotes(tickers);

    quotes.forEach(q => {
        if (!q.error) {
            console.log(`${q.ticker}: $${q.price} (${q.change})`);
        }
    });

    // Get historical data
    const history = await scraper.getHistoricalPrices('AAPL', '6mo');
    console.log(`\nHistorical data points: ${history.length}`);
    console.log(`Latest: ${history[history.length-1].date} - $${history[history.length-1].close}`);
})();
Enter fullscreen mode Exit fullscreen mode

Extracting Financial Statements in Detail

Financial statements are among the most valuable data on Yahoo Finance. Here's a specialized approach:

import yfinance as yf
import pandas as pd

def extract_detailed_financials(ticker):
    """Extract and structure detailed financial data."""
    stock = yf.Ticker(ticker)

    # Income Statement
    income = stock.financials
    quarterly_income = stock.quarterly_financials

    # Balance Sheet
    balance = stock.balance_sheet
    quarterly_balance = stock.quarterly_balance_sheet

    # Cash Flow
    cashflow = stock.cashflow
    quarterly_cashflow = stock.quarterly_cashflow

    # Key metrics derived from financial data
    if income is not None and not income.empty:
        latest = income.iloc[:, 0]  # Most recent year

        revenue = latest.get("Total Revenue", 0)
        net_income = latest.get("Net Income", 0)
        operating_income = latest.get("Operating Income", 0)

        metrics = {
            "revenue": revenue,
            "net_income": net_income,
            "operating_income": operating_income,
            "profit_margin": round(net_income / revenue * 100, 2) if revenue else 0,
            "operating_margin": round(operating_income / revenue * 100, 2) if revenue else 0,
        }
    else:
        metrics = {}

    # Growth rates (year over year)
    if income is not None and income.shape[1] >= 2:
        current_rev = income.iloc[:, 0].get("Total Revenue", 0)
        prev_rev = income.iloc[:, 1].get("Total Revenue", 0)
        if prev_rev:
            metrics["revenue_growth"] = round(
                (current_rev - prev_rev) / prev_rev * 100, 2
            )

    return {
        "ticker": ticker,
        "key_metrics": metrics,
        "annual_income_statement": income.to_dict() if income is not None else {},
        "quarterly_income_statement": quarterly_income.to_dict() if quarterly_income is not None else {},
        "annual_balance_sheet": balance.to_dict() if balance is not None else {},
        "annual_cash_flow": cashflow.to_dict() if cashflow is not None else {},
    }

# Extract and display financials
data = extract_detailed_financials("AAPL")
print(f"Revenue: ${data['key_metrics'].get('revenue', 0):,.0f}")
print(f"Net Income: ${data['key_metrics'].get('net_income', 0):,.0f}")
print(f"Profit Margin: {data['key_metrics'].get('profit_margin', 0)}%")
print(f"Revenue Growth: {data['key_metrics'].get('revenue_growth', 'N/A')}%")
Enter fullscreen mode Exit fullscreen mode

Scaling with Apify

For production-grade Yahoo Finance scraping, Apify provides the infrastructure to handle high volumes reliably. Here's an Apify actor for Yahoo Finance:

const { Actor } = require('apify');
const { CheerioCrawler } = require('crawlee');

Actor.main(async () => {
    const input = await Actor.getInput();
    const {
        tickers = ['AAPL', 'GOOGL', 'MSFT'],
        scrapeHistorical = true,
        scrapeFinancials = true,
        scrapeNews = true,
    } = input;

    const dataset = await Actor.openDataset('yahoo-finance-data');

    const crawler = new CheerioCrawler({
        maxConcurrency: 3,  // Be gentle with Yahoo Finance
        maxRequestRetries: 3,

        async requestHandler({ request, $, log }) {
            const { ticker, dataType } = request.userData;

            if (dataType === 'quote') {
                const price = $('[data-testid="qsp-price"]').text().trim();
                const change = $('[data-testid="qsp-price-change"]').text().trim();

                const stats = {};
                $('[data-testid="quote-statistics"] li').each((_, el) => {
                    const spans = $(el).find('span');
                    if (spans.length >= 2) {
                        stats[spans.first().text().trim()] = spans.last().text().trim();
                    }
                });

                await dataset.pushData({
                    type: 'quote',
                    ticker,
                    price,
                    change,
                    statistics: stats,
                    scrapedAt: new Date().toISOString(),
                });

                log.info(`Scraped quote for ${ticker}: $${price}`);
            } else if (dataType === 'news') {
                const articles = [];
                $('section.container li').each((_, el) => {
                    const title = $(el).find('h3').text().trim();
                    const link = $(el).find('a').attr('href');
                    if (title) {
                        articles.push({ title, url: link });
                    }
                });

                await dataset.pushData({
                    type: 'news',
                    ticker,
                    articles,
                    scrapedAt: new Date().toISOString(),
                });

                log.info(`Scraped ${articles.length} news articles for ${ticker}`);
            }
        },
    });

    // Build request list
    const requests = [];
    for (const ticker of tickers) {
        requests.push({
            url: `https://finance.yahoo.com/quote/${ticker}/`,
            userData: { ticker, dataType: 'quote' },
        });

        if (scrapeNews) {
            requests.push({
                url: `https://finance.yahoo.com/quote/${ticker}/news/`,
                userData: { ticker, dataType: 'news' },
            });
        }
    }

    await crawler.run(requests);
    log.info(`Scraping complete for ${tickers.length} tickers`);
});
Enter fullscreen mode Exit fullscreen mode

Why Use Apify for Yahoo Finance?

  1. Proxy management: Yahoo Finance aggressively blocks scrapers. Apify's proxy pool ensures consistent access.

  2. Scheduling: Set up daily or hourly scraping runs to maintain fresh market data.

  3. Data export: Export to JSON, CSV, or push directly to your database via webhooks.

  4. Monitoring: Get alerts when scraping fails, so you never miss market data.

  5. Scalability: Scrape hundreds of tickers simultaneously without infrastructure headaches.

Building a Stock Screener

Combine all the techniques above to build a powerful stock screener:

import yfinance as yf
import json

def screen_stocks(tickers, criteria):
    """Screen stocks based on financial criteria."""
    results = []

    for ticker in tickers:
        try:
            stock = yf.Ticker(ticker)
            info = stock.info

            # Apply screening criteria
            passes = True
            stock_data = {"ticker": ticker, "name": info.get("longName")}

            for metric, (min_val, max_val) in criteria.items():
                value = info.get(metric)
                if value is None:
                    passes = False
                    break
                if min_val is not None and value < min_val:
                    passes = False
                    break
                if max_val is not None and value > max_val:
                    passes = False
                    break
                stock_data[metric] = value

            if passes:
                results.append(stock_data)

        except Exception as e:
            print(f"Error processing {ticker}: {e}")

    return results

# Define screening criteria
criteria = {
    "trailingPE": (5, 25),           # P/E between 5 and 25
    "dividendYield": (0.02, None),   # Dividend yield > 2%
    "marketCap": (10e9, None),       # Market cap > $10B
    "beta": (None, 1.5),             # Beta < 1.5
}

# Screen S&P 500 stocks (sample)
tickers = ["AAPL", "MSFT", "JNJ", "PG", "KO", "PEP", "XOM", "CVX", "JPM", "BAC"]

matches = screen_stocks(tickers, criteria)
print(f"Stocks matching criteria: {len(matches)}")
for stock in matches:
    print(f"  {stock['ticker']}: {stock['name']}")
    print(f"    P/E: {stock.get('trailingPE', 'N/A'):.1f}")
    print(f"    Div Yield: {stock.get('dividendYield', 0)*100:.1f}%")
Enter fullscreen mode Exit fullscreen mode

Handling Common Challenges

Rate Limiting

Yahoo Finance rate-limits aggressive requests. Solutions include adding delays between requests (2-5 seconds), rotating user agents, using proxy services, and caching responses to avoid redundant requests.

Dynamic Content

Some Yahoo Finance data loads via JavaScript. For these sections, consider using Puppeteer or Playwright to render the page, extracting data from embedded JSON in script tags, or using Yahoo Finance's undocumented API endpoints.

Data Accuracy

Always cross-reference scraped financial data with official SEC filings. Market data may have slight delays. Use multiple data sources for critical financial decisions.

Ethical Considerations

  1. Terms of Service: Review Yahoo Finance's ToS regarding automated data collection.

  2. Rate limiting: Always implement respectful delays. Don't hammer their servers.

  3. Data usage: Financial data may have redistribution restrictions. Check licensing.

  4. Not financial advice: Scraped data should supplement, not replace, professional financial analysis.

  5. Personal data: Avoid scraping or storing user comments or profile data without consent.

Conclusion

Yahoo Finance offers an incredible depth of financial data that, when extracted programmatically, can power everything from personal stock screeners to institutional-grade research platforms. Whether you start with the yfinance Python library for quick prototyping, build custom scrapers for specialized needs, or scale up with Apify's cloud infrastructure, the techniques covered in this guide provide a solid foundation.

Remember to scrape responsibly, respect rate limits, verify your data against multiple sources, and always comply with terms of service. The financial data landscape is rich — with the right tools and approach, you can build powerful data pipelines that keep you ahead of the market.

Happy scraping, and may your portfolios prosper!

Top comments (0)