Crawlbase

Posted on Jan 25, 2024 • Edited on Feb 27, 2024 • Originally published at crawlbase.com

Extract Latest News Articles from Bloomberg

#javascript #html #scraping #python

This blog was originally posted to Crawlbase Blog

Scrape Bloomberg to discover the most recent news highlights from this influential global financial information and media powerhouse, established in 1981. With an expansive user base worldwide, Bloomberg offers real-time financial data, market insights, and breaking news. Investors, analysts, and businesses rely on its comprehensive coverage of diverse markets, making Bloomberg an essential tool for informed decision-making in the dynamic world of finance.

In this blog post, we explore web scraping to gather current news from Bloomberg. We're utilizing sophisticated technologies such as the Crawlbase Crawling API and JavaScript for this endeavor. Our focus will be on extracting key information like leading news stories, financial data, market trends, and additional relevant details. Join us as we outline the steps involved in extracting data from Bloomberg, emphasizing its significance for obtaining timely updates and valuable financial insights.

Bloomberg’s Website Structure
Data to Scrape
- Headline News
- Financial Insights
- Market Trends
- Additional Relevant Data Sets
Prerequisites
- Learn Basic JavaScript
- Get Crawlbase API Token
- Set Up the Coding Environment
Scrape Bloomberg Using Crawlbase
Scrape Bloomberg News Article Data
Conclusion
Frequently Asked Questions

Bloomberg's Website Structure

Bloomberg's website is thoughtfully designed, reflecting its commitment to providing a seamless user experience in accessing financial news and market insights. The homepage typically features sections dedicated to various financial instruments, market indices, and headline news. Navigational elements are strategically placed, offering users easy access to different segments, such as stocks, commodities, and currencies.

The layout is often dynamic, with real-time updates and a user-friendly interface that caters to both novice and seasoned investors. Sections like market summaries, top news, and analysis are usually prominently displayed, ensuring users can quickly access key information upon landing on the site.

Data to Scrape:

To successfully extract data from Bloomberg, it's essential to pinpoint the specific elements within the website's structure that house the desired information. This involves understanding the HTML structure of the pages and identifying the unique identifiers associated with each data point.

Headline News:

When scraping headline news from Bloomberg, developers must identify the specific HTML tags containing crucial information such as article headlines, timestamps, and related metadata. This involves meticulously examining the website's source code to pinpoint the exact elements representing breaking news.
The scraping process focuses on retrieving real-time updates and capturing the latest and most relevant news articles. By constantly monitoring and extracting data from the identified HTML tags, users can stay abreast of breaking news developments in the financial world.

Financial Insights:

Extracting financial insights involves locating and isolating sections within Bloomberg's website specifically dedicated to comprehensive financial data. This could include areas that provide in-depth analyses, stock prices, and other critical financial metrics.
The web scraping script targets areas dedicated to financial insights, allowing for the extraction of detailed information on market trends, livestock prices, and thorough financial analyses. This data can be invaluable for making informed investment decisions.

Market Trends:

In scraping for market trends, developers need to pinpoint the HTML tags that encapsulate data related to the performance of various financial instruments. This involves identifying elements that display trends, charts, and other visual representations of market movements.
The scraping process aims to extract detailed insights into the performance of diverse financial instruments. This could include data on stock movements, commodity prices, and other market indicators, providing users with a comprehensive view of current market trends.

Additional Relevant Data Sets:

Beyond headline news and financial insights, web scraping can be extended to explore and identify additional HTML elements that house valuable data sets. This could include information on commodity prices, currency exchange rates, economic indicators, and more.
The scraping script can be configured to collect a wide array of data, ranging from commodity prices to currency exchange rates and any other relevant information. This enhances the breadth of insights that users can gather from the Bloomberg platform.

Prerequisites

Learn Basic JavaScript:

To scrape data from Bloomberg, start by understanding basic JavaScript concepts. Get familiar with DOM manipulation, which allows you to interact with different parts of a webpage. Learn how to make HTTP requests to fetch data and handle asynchronous operations for smoother coding. Knowing these basics will be essential for our project.

Get Crawlbase API Token:

To enable Bloomberg scraping, obtain a token from Crawlbase.

Log in to your Crawlbase account.
Go to the "Account Documentation" page in your Crawlbase dashboard.
Look for the "JavaScript token" code on that page. Copy this code; it's like a private key for communication with Bloomberg.

Set up the coding environment:

Prepare your tools for the JavaScript code. Follow these steps:

Create Project Folder: Open your terminal and type "mkdir bloomberg_scraper" to make a new project folder.

mkdir bloomberg_scraper

Navigate to Project Folder: Type "cd bloomberg_scraper" to enter the new folder, making it easier to manage project files.

cd bloomberg_scraper

Create JavaScript File: Type "touch scraper.js" to create a new file named scraper.js (you can choose a different name).

touch scraper.js

Install Crawlbase Package: Type "npm install crawlbase" to add the Crawlbase tool to your project. This tool is important because it helps you talk to the Crawlbase Crawling API, making it easier to get information from websites.

npm install crawlbase

By following these steps, you're setting up the foundation for your Bloomberg scraping project. You'll have a dedicated folder, a JavaScript file for your code, and the necessary Crawlbase tool for organized and efficient scraping.

Scrape Bloomberg Using Crawlbase

Once you have your API credentials and the Node.js library for web scraping installed, it's time to start working on the "scraper.js" file. Choose the Bloomberg page that you want to scrape. In this example, we'll focus on scraping data from the Bloomberg technology page. In the "scraper.js" file, use Node.js and the fs library to pull out information from the selected Bloomberg page. Be sure to replace the placeholder URL in the code with the actual URL of the page you want to scrape.

To use the Crawlbase Crawling API, follow these steps:

Ensure you have the "scraper.js" file ready, as instructed earlier.
Copy and paste the provided script into that file.
Run the script in your terminal by typing "node scraper.js".

const { CrawlingAPI } = require('crawlbase'),
  fs = require('fs'),
  crawlbaseToken = 'YOUR_CRAWLBASE_JS_TOKEN',
  api = new CrawlingAPI({ token: crawlbaseToken }),
  bloombergPageURL = 'https://www.bloomberg.com/technology';

api.get(bloombergPageURL).then(handleCrawlResponse).catch(handleCrawlError);

function handleCrawlResponse(response) {
  if (response.statusCode === 200) {
    fs.writeFileSync('response.html', response.body);
    console.log('HTML saved to response.html');
  }
}

function handleCrawlError(error) {
  console.error(error);
}

HTML Response:

Scrape Bloomberg News Article Data

This section will show you how to gather information from a Bloomberg news article page. The data we aim to collect includes the article's headline, abstract, imageURL, authors, publication date, and more. To achieve this, we'll start by obtaining the HTML code of the Bloomberg news article page. Then, we'll create a custom JavaScript scraper using two libraries: cheerio, commonly used for web scraping, and fs, which helps with file operations. The provided script goes through the HTML code of the Bloomberg news article page, picks out the necessary data, and stores it in a JSON array.

const { CrawlingAPI } = require('crawlbase'),
  fs = require('fs'),
  crawlbaseToken = 'YOUR_CRAWLBASE_JS_TOKEN',
  api = new CrawlingAPI({ token: crawlbaseToken }),
  bloombergPageURL =
    'https://www.bloomberg.com/news/articles/2024-01-18/tsmc-s-second-fab-in-arizona-delayed-as-us-grants-remain-in-flux?srnd=technology-vp';

api.get(bloombergPageURL).then(handleCrawlResponse).catch(handleCrawlError);

function handleCrawlResponse(response) {
  if (response.statusCode === 200) {
    fs.writeFileSync('response.html', response.body);
    console.log('HTML saved to response.html');
  }
}

function handleCrawlError(error) {
  console.error(error);
}

const fs = require('fs'),
  cheerio = require('cheerio');

try {
  // Read HTML content from the response.html file
  const htmlContent = fs.readFileSync('response.html', 'utf-8');

  const $ = cheerio.load(htmlContent);

  // Extracting article category, headline, and abstract
  const category = $('.Eyebrow_sectionTitle-Wew2fboZsjA- a').text().trim();
  const headline = $('.HedAndDek_headline-D19MOidHYLI-').text().replace(/\n\s+/g, ' ').trim();
  const abstractItems = [];
  $('.HedAndDek_abstract-XX636-2bHQw- li').each((index, element) => {
    abstractItems.push($(element).text().trim().replace(/\n\s+/g, ' '));
  });

  const imageUrl = $('div.ledeImage_ledeImage__nrpgq img.ui-image').attr('src');

  const author = $('.Byline_bylineAuthors-Ts-ifi4q-HY- a')
    .map((index, element) => $(element).text().trim())
    .get();

  // Extract publish date
  const publishDate = $('time').attr('datetime').split('T')[0];

  // Creating a JSON object with abstract as an array
  const jsonData = {
    category: category,
    headline: headline,
    abstract: abstractItems,
    imageUrl: imageUrl,
    author: author,
    publishDate: publishDate,
  };

  // Displaying the scraped data in JSON format
  console.log(JSON.stringify(jsonData, null, 2));
} catch (error) {
  console.error('Error reading or parsing the HTML file:', error);
}

In the first code block, the JavaScript code uses the Crawlbase Crawling API to fetch the HTML content of a Bloomberg news article page. The response is then saved to a local file named "response.html" if the HTTP status code is 200. The second block of code utilizes the "cheerio" library to parse the saved HTML file, extracting relevant information such as the article's category, headline, abstract, image URL, author information, and publish date. The extracted data is then organized into a JSON object and displayed in a structured format as shown below:

JSON Response:

{
  "category": "Technology",
  "headline": "TSMC’s Second Fab in Arizona Delayed as US Grants Remain in Flux",
  "abstract": [
    "The firm’s first fab in Arizona has been pushed back to 2025",
    "Biden White House has yet to hand out promised chip subsidies"
  ],
  "author": ["Jane Lanhee Lee", "Debby Wu"],
  "publishDate": "2024-01-18"
}

Conclusion

In conclusion, this tutorial helps you scrape Bloomberg data using JavaScript and the Crawlbase Crawling API. It makes it easy to crawl the raw HTML from Bloomberg pages. It allows you to scrape different datasets from news articles, including category, headline, abstract, image URL, author, and publication date. Explore our additional guides for similar procedures on Yandex, Bing, FlipKart, and Product Hunt. These guides are valuable resources to enhance your data scraping skills across various platforms.

Explore additional scraping guides from Crawlbase:

Web Scrape Expedia Using JavaScript
Web Scrape Booking.com with JavaScript
How to Scrape Glassdoor
Scrape Questions and Answers with Quora Scraper

Frequently Asked Questions

What types of data can be scraped from Bloomberg using Crawlbase?

Crawlbase simplifies Bloomberg scraping, offering a robust solution for extracting diverse financial and market data. The Bloomberg scraper allows users to access real-time information on stocks, investments, and financial markets, ensuring accuracy and timeliness. The tool works for various Bloomberg sections, including Markets, Technology, Politics, Pursuits, Business week, Green, and CityLab. Through advanced capabilities and AI integration, Crawlbase enables efficient scraping, covering areas such as economics, deals, fixed income, ETFs, foreign exchange, and more.

Can API requests in Crawlbase be geolocated to a specific country?

Crawlbase have the flexibility to geolocate API requests to a specific country. By passing the &country parameter in their requests, users can tailor the API to extract data relevant to their targeted geographic location. This feature enhances the customization and precision of data retrieval, ensuring that users obtain region-specific information from Bloomberg. Whether you want to focus on markets in America, Europe, or Asia-Pacific. Crawlbase empowers users to refine their scraping efforts and easily obtain location-specific data.

Can I customize Bloomberg scraping in Crawlbase for specific news categories?

In Crawlbase, the Bloomberg scraping process is customizable, allowing you to target specific news categories like finance or technology. This flexibility ensures that you extract only the data relevant to your needs, enhancing the efficiency and precision of the scraping experience. With this adaptability, users can focus on gathering the latest news articles from Bloomberg that align with their specific areas of interest or analysis requirements.

How does Crawlbase comply with Bloomberg's terms and legal regulations while scraping?

Crawlbase is very careful about following the rules of Bloomberg and legal regulations when scraping data. The platform takes strong measures to stick to Bloomberg's guidelines, like always keeping an eye on things and adjusting as needed. Crawlbase is committed to high legal standards to avoid problems and gives users an ethical scraping solution. By following the rules, Crawlbase reduces the chance of legal issues, making it a reliable and trustworthy tool for Bloomberg scraping while keeping things honest and legal in the world of web scraping.

How fast is the Crawlbase API in responding to requests?

The Crawlbase API is quick and responsive, with an average response time ranging from 4 to 10 seconds when users make requests to scrape Bloomberg. Users can further optimize their results by leveraging parallel requests, as the API accommodates up to 20 requests per second by default. Additionally, Crawlbase offers the flexibility for users to contact support if a rate limit increase is needed to meet specific production requirements, ensuring a responsive and efficient scraping experience.