DEV Community

Theo Vasilis for Apify

Posted on • Originally published at blog.apify.com on

How to extract and download news articles online

Hey, we're Apify . You can build, deploy, share, and monitor any data extraction tools on the Apify platform. Check us out .

Why do I need a tool to download articles?

If youre thinking of gathering many articles about one or several topics (say, the latest news on the economy) and then building a corpus from them, simply doing a Google search on a selected website every time is highly impractical.

A faster and more efficient way to extract content from a website for analysis and research is with a tool designed to collect and download news articles. One such tool is Smart Article Extractor, and in this article, we'll show you how to use it.

💡 Smart Article Extractor has been used for data journalism:

➡️ Czech media and their word choices

➡️ Terror or clickbait?

Is extracting news articles legal?

It is perfectly legal to extract publicly available texts from the web but remember that many of them are protected by copyright law. That means you should not publish articles you have collected without prior permission. If youre simply web scraping for research and citations for a dissertation, you wont have any problems, but make sure you dont republish intellectual property without consent.

A step-by-step guide to downloading articles from a website

Well show you how to use text scraping to download articles from a website with Smart Article Extractor.

You can test the scraper by using the default inputs. The default setting is configured this way:

{
    "enqueueFromArticles": false,
    "extendOutputFunction": "($) => {\n const result = {};\n // Uncomment to add a title to the output\n // result.pageTitle = $('title').text().trim();\n\n return result;\n}",
    "isUrlArticleDefinition": {
        "minDashes": 4,
        "hasDate": true,
        "linkIncludes": [
            "article",
            "storyid",
            "?p=",
            "id=",
            "/fpss/track",
            ".html",
            "/content/"
        ]
    },
    "mustHaveDate": true,
    "onlyInsideArticles": true,
    "onlyNewArticles": false,
    "onlyNewArticlesPerDomain": false,
    "proxyConfiguration": {
        "useApifyProxy": true
    },

Enter fullscreen mode Exit fullscreen mode

First, we'll take you through the configuration options of the extractor, and then we'll show you a real-world example of Smart Article Extractor scraping and downloading data from a website. So, lets go through the different options step by step:

1. Choose start URLs or article URLs

You can configure the scraper by choosing start URLs in the website/category URLs input field. Article pages are detected and crawled from these, and they can be any category or subpage URL, for example, https://www.bbc.com/

Text scraping with start URLs or article URLs

Alternatively, you can insert article URLs in the second input field. These are direct URLs for the articles to be extracted, for example, https://www.bbc.com/uk-62836057. No extra pages are crawled from article pages.

Use the advanced options to select the HTTP method to request the URLs and the payload sent with the HTTP request. You also have header and data user options where you can insert a JSON object.

Text scraping HTTP request

2. Select optional Booleans

Text scraping booleans

You have two only new articles options, one for small runs and a saved per domain option for the use of the extractor on a large scale. With these options, the extractor will only scrape new articles each time you run it. For small runs, scraped URLs are saved in a dataset, while the per domain option saves scraped articles in a dataset and compares them with new ones.

If you go with the default only inside domain articles option, the extractor will only scrape articles on the domain from where they are linked. If the domain presents links to articles on different domains, e.g., https://www.bbc.com/ vs. https://www.bbc.co.uk, the extractor will not scrape them.

The enqueue articles from articles option allows the scraper to extract articles linked within articles. Otherwise, it will only scrape articles from category pages.

The extractor will scan different sitemaps from the initial article URL with the find articles in sitemaps option. Because this can lead to loading a vast amount of data, including old articles, the time and cost of the scraper will increase. Instead, we recommend using the optional array, sitemap URLs, below.

If youre not sure what a sitemap URL is, it's an XML file that lists the URLs for a site. To get a sitemap URL, all you need to do is append /sitemap.xml to the domain URL.

With the sitemap URLs option, you can provide selected sitemap URLs that include the articles you need to scrape. Lets say you want the sitemap URL for apify.com. Just insert https://apify.com/sitemap.xml.

Smart Article Extractor Sitemap URLs

You can choose to save the full HTML of the article page, but keep in mind that this will make the data less readable. The use Googlebot headers option allows you to bypass protection and paywalls on some websites, but this increases your chances of getting blocked, so use it with caution.

3. Choose what articles you want to extract

Choose what articles to extract to scrape text from a website

The default minimum word value is 150. This is typically sufficient for article recognition.

You can also use the date option to command the scraper to extract articles from a specific day. Otherwise, it will scrape all articles. You can use two formats for this option: YYYY-MM-DD, e.g., 2019-12-31, or a number type, e.g., 1 week, or 20 days.

The default must have date value lets the extractor know that it should only scrape articles with publication dates.

In the is the URL an article? option, you can input JSON settings to define what URLs should be considered articles by the scraper. If any are true, it will open the link and extract the article.

4. Custom enqueuing and pseudo URLs

You can use the pseudo URLs function in the custom enqueuing box to include more links like pagination or categories. Read more about pseudo URLs here.

Scrape text from websites with pseudo URLs

Use the link selector option to limit the tags which will be enqueued. To activate this option, you need to add a.some-class.

The max depth input is for the depth of crawling, i.e., how many times the scraper picks up a link to other web pages. If you input a number of total pages to be crawled in the max pages per crawl box, the extractor will stop automatically after reaching that number. The maximum number of total pages crawled includes the home page, pagination pages, and invalid articles.

The max articles per crawl option is the maximum number of valid articles the extractor will scrape and will stop automatically after reaching that number.

Use the max concurrency option to limit the speed of the scraper to avoid getting blocked.

5. Proxy configuration

Use a proxy for text scraping

The default input is automatic proxy. If you want to use your own proxies, use the ProxyConfigurationOptions.proxyUrls option, and the configuration will rotate your list of proxy URLs.

6. Browser options

Use browser (Puppeteer) for text scraping

The use browser (Puppeteer) option is more expensive, but it allows you to evaluate JavaScript and wait for dynamically loaded data.

The wait on each page (ms) value is the number of milliseconds the extractor will wait on each page before scraping data. Wait for selector on each page is an optional string to tell the extractor for what selector to wait on each page before scraping the data.

7. Extend output function

Extend output function for text scraping

This function allows you to merge your custom extraction with the default one. You can only return an object from this function. This object will be merged/overwritten with the default output for each article.

8. Compute units and notifications

Choose compute units and notifications options for text scraping

With the above options, you can command the scraper to stop running after reaching a certain number of compute units and to send notifications to specified email addresses when the number of CUs is reached.

9. Options

Build, timeout, and memory options for text scraping

Finally, you can use the final box of options for the tag or number of the build you want to run (this can be something like latest, beta or 1.2.34.), the number of seconds at which the scraper should time out (zero value means it will run until completion or forever), and the RAM allocated for the extractor in megabytes.

How to download news articles (example)

Now you've seen how Smart Article Extractor works, let's do a quick and simple step-by-step demonstration of the tool in action.

Step 1. Go to Smart Article Extractor on the Apify platform

First, go to Smart Article Extractor on the Apify platform and click Try for free.

Scrape text from websites and try Smart Article Extractor for free

Youll be redirected to sign up first if you dont have an account (you don't need a credit card, and there's no time limit on your free subscription). Otherwise, you can get started right away.

Scrape URLs to extract text data

Step 2. Add URLs for the articles you want to download

Well scrape the start URL https://theconversation.com/global. If you keep the remaining default values, all you need to do is click Save & Start (step 4).

Step 3. Choose your settings (optional)

You can configure the tool for your specific case by following the step-by-step guide we covered earlier. Here are three of the most important options to keep in mind:

  1. You can select the publication dates from which you want articles to be extracted

  2. You can extend the search to pseudo URLs. Read more about pseudo URLs here.

  3. You can choose the minimum word count per article (the default 150 is the recommended minimum for article recognition)

Step 4. Start downloading articles

Once youre happy with your configuration, or if youre using the default settings, just click the Save & Start button. The extractor will now begin collecting articles. You will see the data in the log while the tool is running, but wait until the status has changed to succeeded before you try to download the information.

Text scraping run with Smart Article Extractor

Step 5. Export and download the article data

Once the article extractor has finished, click on the Storage tab to download the information in any of the available formats.

Scrape and download articles from websites in JSON and other formats

Lets go with JSON. Here's the data for the first of the 45 results we got:

{
  "url": "https://theconversation.com/electric-planes-are-coming-short-hop-regional-flights-could-be-running-on-batteries-in-a-few-years-190098",
  "loadedUrl": "https://theconversation.com/electric-planes-are-coming-short-hop-regional-flights-could-be-running-on-batteries-in-a-few-years-190098",
  "loadedDomain": "theconversation.com",
  "title": "Short-hop regional flights could be running on batteries in a few years",
  "softTitle": "Electric planes are coming: Short-hop regional flights could be running on batteries in a few years",
  "date": "2022-09-19T12:21:03.000Z",
  "author": [
    "Gkin nar"
  ],
  "publisher": "The Conversation",
  "copyright": "20102022",
  "favicon": "https://cdn.theconversation.com/static/tc/@theconversation/ui/dist/esm/logos/favicon-cdcdc0dd51ffe5238483c3f27fd2eb57.ico",
  "description": "Air Canada and United Airlines both have orders for hybrid electric 30-seaters. An aerospace engineer explains where electrification, hydrogen and sustainable aviation fuels are headed.",
  "lang": "en",
  "canonicalLink": "https://theconversation.com/electric-planes-are-coming-short-hop-regional-flights-could-be-running-on-batteries-in-a-few-years-190098",
  "tags": [],
  "image": "https://images.theconversation.com/files/484695/original/file-20220914-9158-ybu2z4.jpg?ixlib=rb-1.1.0&rect=0%2C528%2C5043%2C2521&q=45&auto=format&w=1356&h=668&fit=crop",
  "videos": [
    {
      "height": "400",
      "width": "100%"
    },
    {
      "src": "https://datawrapper.dwcdn.net/5mb3z/6/",
      "height": "400px",
      "width": "100%"
    }
  ],
  "links": [
    {
      "text": "quietly buzzing around Europe",
      "href": "https://investor.textron.com/news/news-releases/press-release-details/2022/Textron-Completes-Acquisition-of-Pipistrel/default.aspx"
    },
    {
      "text": "electric sea planes",
      "href": "https://harbourair.com/harbour-air-and-magnix-announce-successful-flight-of-worlds-first-commercial-electric-airplane/"
    },
    {
      "text": "Air Canada",
      "href": "http://heartaerospace.com/heart-aerospace-unveils-new-airplane-design-confirms-air-canada-and-saab-as-new-shareholders/"
    },
    {
      "text": "first hybrid electric 50- to 70-seat",
      "href": "https://www.nrel.gov/docs/fy22osti/80220.pdf"
    },
    {
      "text": "could be ready",
      "href": "https://www.electricaviationgroup.com/electric-flight/"
    },
    {
      "text": "three to five times more",
      "href": "https://www.nrel.gov/docs/fy22osti/80220.pdf"
    },
    {
      "text": "Gkin nar",
      "href": "https://scholar.google.com/citations?user=KIbLE10AAAAJ&hl=en"
    },
    {
      "text": "electric alternative",
      "href": "https://www.mdpi.com/2071-1050/14/10/5880"
    },
    {
      "text": "cut fuel use by about 10%",
      "href": "https://arc.aiaa.org/doi/10.2514/1.C036919"
    },
    {
      "text": "make more use of regional airports",
      "href": "https://sacd.larc.nasa.gov/sacd/wp-content/uploads/sites/167/2021/04/2021-04-20-RAM.pdf"
    },
    {
      "text": "corn, oilseeds",
      "href": "https://www.energy.gov/eere/bioenergy/2016-billion-ton-report"
    },
    {
      "text": "algae",
      "href": "https://biomassmagazine.com/articles/18484/honeywell-technology-enables-jet-flights-with-saf-from-algal-oil"
    },
    {
      "text": "by around 80%",
      "href": "https://www.iata.org/en/programs/environment/sustainable-aviation-fuels/"
    },
    {
      "text": "route planning",
      "href": "https://theconversation.com/why-the-aviation-industry-must-look-beyond-carbon-to-get-serious-about-climate-change-186947"
    },
    {
      "text": "green hydrogen",
      "href": "https://www.energy.gov/eere/fuelcells/hydrogen-production-electrolysis"
    },
    {
      "text": "still takes up more space",
      "href": "https://www.iata.org/contentassets/d13875e9ed784f75bac90f000760e998/fact_sheet7-hydrogen-fact-sheet_072020.pdf"
    },
    {
      "text": "aiming to have mature technology by 2025",
      "href": "https://www.airbus.com/en/innovation/zero-emission/hydrogen/zeroe"
    },
    {
      "text": "testing a 34-seat, hydrogen-electric airplane",
      "href": "https://australianaviation.com.au/2022/07/rex-to-trial-electric-planes-on-short-routes-in-2024/"
    },
    {
      "text": "International Civil Aviation Organization",
      "href": "https://www.icao.int/about-icao/Pages/default.aspx"
    },
    {
      "text": "cut net carbon dioxide emissions 50%",
      "href": "https://www.icao.int/Meetings/2022-ICAO-LTAG-GLADS/Pages/default.aspx"
    }
  ],
  "text": "Electric planes might seem futuristic, but they arent that far off, at least for short hops.\n\nTwo-seater Velis Electros are already quietly buzzing around Europe, electric sea planes are being tested in British Columbia, and larger planes are coming. Air Canada announced on Sept. 15, 2022, that it would buy 30 electric-hybrid regional aircraft from Swedens Heart Aerospace, which expects to have its 30-seat plane in service by 2028. Analysts at the U. S. National Renewable Energy Lab note that the first hybrid electric 50- to 70-seat commuter plane could be ready not long after that. In the 2030s, they say, electric aviation could really take off.\n\nThat matters for managing climate change. About 3% of global emissions come from aviation today, and with more passengers and flights expected as the population expands, aviation could be producing three to five times more carbon dioxide emissions by 2050 than it did before the COVID-19 pandemic.\n\nAerospace engineer and assistant professor Gkin nar develops sustainable aviation concepts, including hybrid-electric planes and hydrogen fuel alternatives, at the University of Michigan. We asked her about the key ways to cut aviation emissions today and where technologies like electrification and hydrogen are headed.\n\nAircraft are some of the most complex vehicles out there, but the biggest problem for electrifying them is the battery weight.\n\nIf you tried to fully electrify a 737 with todays batteries, you would have to take out all the passengers and cargo and fill that space with batteries just to fly for under an hour.\n\nJet fuel can hold about 50 times more energy compared to batteries per unit mass. So, you can have 1 pound of jet fuel or 50 pounds of batteries. To close that gap, we need to either make lithium-ion batteries lighter or develop new batteries that hold more energy. New batteries are being developed, but they arent yet ready for aircraft.\n\nEven though we might not be able to fully electrify a 737, we can get some fuel burn benefits from batteries in the larger jets by using hybrid propulsion systems. We are trying to make that happen in the short term, with a 2030-2035 target for smaller regional planes. The less fuel burned during flight, the fewer greenhouse gas emissions.\n\nHybrid electric aircraft are similar to hybrid electric cars in that they use a combination of batteries and aviation fuels. The problem is that no other industry has the weight limitations that we do in the aerospace industry.\n\nThats why we have to be very smart about how and how much we are hybridizing the propulsion system.\n\nUsing batteries as a power assist during takeoff and climb are very promising options. Taxiing to the runway using just electric power could also save a significant amount of fuel and reduce the local emissions at airports. There is a sweet spot between the added weight of the battery and how much electricity you can use to get net fuel benefits. This optimization problem is at the center of my research.\n\nHybrids would still burn fuel during flight, but it could be considerably less than just relying entirely on jet fuel.\n\nI see hybridization as a mid-term option for larger jets, but a near-term solution for regional aircraft.\n\nFor 2030 to 2035, were focused on hybrid turboprops, typically regional aircraft with 50-80 passengers or used for freight. These hybrids could cut fuel use by about 10%.\n\nWith electric hybrids, airlines could also make more use of regional airports, reducing congestion and time larger planes spend idling on the runway.\n\nShorter term well see more use of sustainable aviation fuels, or SAF. With todays engines, you can dump sustainable aviation fuel into the same fuel tank and burn it. Fuels made from corn, oilseeds, algae and other fats are already being used.\n\nSustainable aviation fuels can reduce an aircrafts net carbon dioxide emissions by around 80%, but supply is limited, and using more biomass for fuel could compete with food production and lead to deforestation.\n\nA second option is using synthetic sustainable aviation fuels, which involves capturing carbon from the air or other industrial processes and synthesizing it with hydrogen. But thats a complex and costly process and does not have a high production scale yet.\n\nAirlines can also optimize their operations in the short term, such as route planning to avoid flying nearly empty planes. That can also reduce emissions.\n\nHydrogen fuel has been around a very long time, and when its green hydrogen produced with water and electrolysis powered by renewable energy it doesnt produce carbon dioxide. It can also hold more energy per unit of mass than batteries.\n\nThere are two ways to use hydrogen in an airplane: either in place of regular jet fuel in an engine, or combined with oxygen to power hydrogen fuel cells, which then generate electricity to power the aircraft.\n\nThe problem is volume hydrogen gas takes up a lot of space. Thats why engineers are looking at methods like keeping it very cool so it can be stored as liquid until its burned as a gas. It still takes up more space than jet fuel, and the storage tanks are heavy, so how to store, handle or distribute it on aircraft is still being worked out.\n\nAirbus is doing a lot of research on hydrogen combustion using modified gas turbine engines with an A380 platform, and aiming to have mature technology by 2025. Australias Rex airline expects to start testing a 34-seat, hydrogen-electric airplane for short hops in the next few years.\n\nDue to the variety of options, I see hydrogen as one of the key technologies for sustainable aviation.\n\nThe problem with aviation emissions isnt their current levels its the fear that their emissions will increase rapidly as demand increases. By 2050, we could see three to five times more carbon dioxide emissions from aviation than before the pandemic.\n\nThe International Civil Aviation Organization, a United Nations agency, generally defines the industrys goals, looking at whats feasible and how aviation can push the boundaries.\n\nIts long-term goal is to cut net carbon dioxide emissions 50% by 2050 compared with 2005 levels. Getting there will require a mix of different technologies and optimization. I dont know if were going to be able to reach it by 2050, but I believe we must do everything we can to make future aviation environmentally sustainable."
}

Enter fullscreen mode Exit fullscreen mode

Start downloading articles

We barely scratched the surface with that example, so we suggest you get started with your own text-scraping tasks and enjoy discovering what else Smart Article Extractor can do. If you have any troubles, you can reach out to us, and well be happy to help.

Top comments (0)