DEV Community

sivakon
sivakon

Posted on • Updated on

Easy concurrent scraping with Scala with Akka streams

Step 1: Download ammonite

# Download Ammonite
$ curl -L https://github.com/lihaoyi/ammonite/releases/download/2.0.4/2.12-2.0.4-boostrap > amm212 && chmod +x amm212

Step 2. Start ammonite

$ amm212 --class-based

Step 3. We need akka-streams

import $ivy.`com.typesafe.akka::akka-stream:2.6.3`
// standard imports
import akka.stream._
import akka.stream.scaladsl._
import akka.{ Done, NotUsed }
import akka.actor.ActorSystem
import akka.util.ByteString
import scala.concurrent._
import scala.concurrent.duration._

// Json parsing
import ujson._

// Handle files
import java.nio.file.Paths


implicit val system = ActorSystem("QuickStart")

val hotelids = scala.io.Source.fromFile("source.txt") // we need to scrape these urls or some ids for an endpoint

def parser(i: Int): (Int, String) = {
      val url = s"ENDPOINT/{i}"
      val urlResponse = Try(ujson.read(requests.get(url).data.toString))
      val result = urlResponse match {
        case Success(res) => "PARSED_RESULT"
        case Failure(e) => "null"
      }
      return (i, result)
    }

Step 4. We create file sink to store the data (you can print to console as well)

def lineSink(filename: String): Sink[(Int, String), Future[IOResult]] =
    Flow[(Int, String)].map(s => ByteString(s._1 + "," + s._2 + "\n")).toMat(FileIO.toPath(Paths.get(filename)))(Keep.right)

Step 5. Introduce throttling behavior because some websites disallow too many requests. Here, we make 25 requests in 10 seconds (5 requests in 2 seconds)

val resultFile = source.map(_.toInt).throttle(25, 10.second).map(parser).runWith(lineSink("output.txt"))

All done from console with Ammonite shell!!

Top comments (0)