<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Irio Musskopf</title>
    <description>The latest articles on DEV Community by Irio Musskopf (@irio).</description>
    <link>https://dev.to/irio</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F242962%2F8915fc56-fd36-47c7-9aec-4a19acb2cb7a.jpeg</url>
      <title>DEV Community: Irio Musskopf</title>
      <link>https://dev.to/irio</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/irio"/>
    <language>en</language>
    <item>
      <title>Running a scraping platform at Google Cloud for as little as US$ 0.05/month</title>
      <dc:creator>Irio Musskopf</dc:creator>
      <pubDate>Fri, 04 Oct 2019 20:43:11 +0000</pubDate>
      <link>https://dev.to/irio/running-a-scraping-platform-at-google-cloud-for-as-little-as-us-0-05-month-3hic</link>
      <guid>https://dev.to/irio/running-a-scraping-platform-at-google-cloud-for-as-little-as-us-0-05-month-3hic</guid>
      <description>&lt;p&gt;I was recently faced with the problem of finding an apartment in Berlin. Following my previous experience in this same effort, I decided to automate the task and write a software to send me an alert of the best deals. In this article, I explain how I built the foundations of this platform.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The platform I've written is a Go application deployed to Google Cloud using Terraform. Also, it has Continuous Deployment from a private GitHub repository.&lt;/strong&gt;&lt;/p&gt;




&lt;p&gt;After a quick research, I came to the following list of platforms to monitor:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;eBay Kleinanzeigen&lt;/li&gt;
&lt;li&gt;ImmobilienScout24&lt;/li&gt;
&lt;li&gt;Immowelt&lt;/li&gt;
&lt;li&gt;Nestpick&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A few hours later, I have a Go binary that does everything I need to run the application locally. It uses a web scraping framework called &lt;a href="https://github.com/gocolly/colly" rel="noopener noreferrer"&gt;Colly&lt;/a&gt; to browse all the platforms listings, extract basic attributes, and export to CSV files in the local filesystem.&lt;/p&gt;

&lt;p&gt;Since I didn't want to maintain the application running locally, my first choice would be to get a cheap instance at &lt;a href="https://cloud.google.com/" rel="noopener noreferrer"&gt;Google Cloud&lt;/a&gt;. Once I had this rented virtual machine, I could write a startup script to compile the app from GitHub, and set up a crontab to scrape the platforms on a daily basis.&lt;/p&gt;

&lt;p&gt;Probably the best decision for this specific project, but could I use this personal problem as an opportunity to explore the integration of Google Cloud services?&lt;/p&gt;

&lt;p&gt;Since, in the past, I was involved in multiple projects involving some sort of scraping application, I believed it was worth the effort. I could easily reuse this setup in the future.&lt;/p&gt;

&lt;p&gt;My architecture started with a few premises:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;It should use Google Cloud services.&lt;/li&gt;
&lt;li&gt;It should support data collection every few minutes, even though I would start collecting only once a day.&lt;/li&gt;
&lt;li&gt;It should be as cost-effective as a cheap droplet at DigitalOcean (US$ 5).&lt;/li&gt;
&lt;li&gt;It should be easy to deploy. Ideally, it should implement Continuous Deployment.&lt;/li&gt;
&lt;li&gt;It should support to trigger a data collection process over demand - e.g., after an HTTP POST request.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;My hypothesis was that I didn't need a virtual machine running 24/7; thus, it should not cost the same as a full month price. In fact, my application was able to download all the properties I was interested in under 3 minutes, so I expected something significantly lower.&lt;/p&gt;

&lt;h2&gt;
  
  
  The architecture
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fejklpucgin5lktt23pk1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fejklpucgin5lktt23pk1.png" alt="Architecture diagram"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;My exploration through the latest Google Cloud services resulted in finding &lt;a href="https://cloud.google.com/run/" rel="noopener noreferrer"&gt;Cloud Run&lt;/a&gt;, a service that "run(s) stateless containers on a fully managed environment or in your own GKE cluster." Still classified as a beta product by Google Cloud, it is built on top of &lt;a href="https://knative.dev/" rel="noopener noreferrer"&gt;Knative&lt;/a&gt; and &lt;a href="https://kubernetes.io/" rel="noopener noreferrer"&gt;Kubernetes&lt;/a&gt;. The key proposal is its pricing model: it charges in chunks of milliseconds rather than hours of runtime.&lt;/p&gt;

&lt;p&gt;With a few tweaks, my Go application was wrapped in a Docker container to be runnable by Cloud Run. Once it gets a HTTP POST request, it collects attributes from all the advertised properties and publishes as CSV files to a Google Storage bucket. For my use case, I created two possible ways to hit this endpoint: an Internet-accessible address so I can trigger it whenever I want, and through Cloud Scheduler, which is configured to hit it once a day.&lt;/p&gt;

&lt;h2&gt;
  
  
  The application
&lt;/h2&gt;

&lt;p&gt;The application is fairly simple: it consists of an HTTP server with a single endpoint. On every hit, it scrapes all the platforms and saves results in CSVs inside a Storage bucket.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fh3nyvdur0l99arjvfr00.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fh3nyvdur0l99arjvfr00.png" alt="Directory tree of the project"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  ./main.go
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fygrspdvcbbo4ekpzez8v.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fygrspdvcbbo4ekpzez8v.png" alt="Screenshot of main.go"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  ./Dockerfile
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fnwa4efbxsdwcct53916f.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fnwa4efbxsdwcct53916f.png" alt="Screenshot of Dockerfile"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Other application files can be found in &lt;a href="https://gist.github.com/Irio/3da6ee4dea8cad6613c1337a15044f09" rel="noopener noreferrer"&gt;this Gist&lt;/a&gt;. All the feedback is appreciated, as this is one of my first Go projects.&lt;/p&gt;

&lt;h2&gt;
  
  
  The deployment
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;a href="https://www.terraform.io/" rel="noopener noreferrer"&gt;Install Terraform&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://cloud.google.com/sdk/docs/quickstarts" rel="noopener noreferrer"&gt;Install Google Cloud CLI&lt;/a&gt; and sign in to your account with

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;$ gcloud auth login&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://console.cloud.google.com/projectcreate" rel="noopener noreferrer"&gt;Create a Google Cloud project&lt;/a&gt; and configure the CLI to use it with

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;$ gcloud config set project PROJECT_NAME&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://console.cloud.google.com/iam-admin/serviceaccounts" rel="noopener noreferrer"&gt;Create a Google Cloud Service Account&lt;/a&gt; for using with Terraform, giving it the "Owner" role.&lt;/li&gt;
&lt;li&gt;Create and download a JSON key for this new service account. Place it in &lt;strong&gt;deployment/credentials.json&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://console.developers.google.com/apis/library" rel="noopener noreferrer"&gt;Enable the following Cloud APIs&lt;/a&gt;:

&lt;ul&gt;
&lt;li&gt;App Engine Admin API&lt;/li&gt;
&lt;li&gt;Cloud Build API&lt;/li&gt;
&lt;li&gt;Cloud Run API&lt;/li&gt;
&lt;li&gt;Cloud Scheduler API&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://console.cloud.google.com/iam-admin/iam" rel="noopener noreferrer"&gt;Give appropriate roles&lt;/a&gt; to the API service account ending with &lt;strong&gt;@cloudbuild.gserviceaccount.com&lt;/strong&gt;:

&lt;ul&gt;
&lt;li&gt;Cloud Run Admin&lt;/li&gt;
&lt;li&gt;Cloud Run Service Agent&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Create a &lt;a href="https://source.cloud.google.com/repo/new" rel="noopener noreferrer"&gt;Cloud Source Repository&lt;/a&gt; based on your GitHub repository.&lt;/li&gt;
&lt;li&gt;Set appropriate variable values in &lt;strong&gt;terraform.tfvars&lt;/strong&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Now with permissions already given, use Terraform to set up the rest of the infrastructure.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ cd deployment
$ terraform init
$ terraform apply
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The initial deployment may take about five minutes since Terraform waits for Cloud Run to build and start before configuring Cloud Scheduler.&lt;/p&gt;

&lt;p&gt;Since Cloud Run is still in beta - with API endpoints in alpha stage -I was not able to declare all the infrastructure in Terraform files. As a temporary workaround, I've written a couple of auxiliary bash scripts that trigger the Cloud API through its CLI command. Fortunately, all this happens in background when a developer triggers &lt;strong&gt;terraform apply&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  The result
&lt;/h2&gt;

&lt;p&gt;Every day, without any human interaction, Cloud Scheduler creates a new folder with a number of CSV files with the most recently available apartments in my city.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Feqyazcfi2zbhbl4if6ln.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Feqyazcfi2zbhbl4if6ln.png" alt="Google Cloud bucket with newly created csv files"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The costs
&lt;/h2&gt;

&lt;p&gt;Not all the services in use are available in the &lt;a href="https://cloud.google.com/products/calculator/#id=2e6bb472-7ce9-4dd2-a2b9-15502f810fb9" rel="noopener noreferrer"&gt;official calculator&lt;/a&gt;. Either way, I've made a rough estimation for my personal use, considering an unrealistic number of one deployment each day.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cloud Storage - US$ 0.02/month
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Location: US&lt;/li&gt;
&lt;li&gt;Class A operations: 4*30 = 120&lt;/li&gt;
&lt;li&gt;1st month

&lt;ul&gt;
&lt;li&gt;Storage: 2MB*30 = 60MB&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;12nd month

&lt;ul&gt;
&lt;li&gt;Storage: 2MB*365 = 730MB&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;h3&gt;
  
  
  Cloud Run - US$ 0.00/month
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Location: us-east1&lt;/li&gt;
&lt;li&gt;Cpu allocated: 1&lt;/li&gt;
&lt;li&gt;Memory allocated: 1GB&lt;/li&gt;
&lt;li&gt;Concurrent requests per container instance: 1&lt;/li&gt;
&lt;li&gt;Execution Time per Request (ms): 5000&lt;/li&gt;
&lt;li&gt;Outbound Network Bandwidth per request execution (KB): 1000&lt;/li&gt;
&lt;li&gt;Requests per Month: 30&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Cloud Build - US$ 0.00/month
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;free quota of 120 builds-minutes/day&lt;/li&gt;
&lt;li&gt;4 build-minutes/day&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Container Registry - US$ 0.02–0.19/month
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;$0.026/GB&lt;/li&gt;
&lt;li&gt;1st month

&lt;ul&gt;
&lt;li&gt;20MB*30 = 600MB&lt;/li&gt;
&lt;li&gt;600/1024 * 0.026 = 0.02&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;12nd month

&lt;ul&gt;
&lt;li&gt;Storage: 20MB*365 = 7300MB&lt;/li&gt;
&lt;li&gt;7300/1024 * 0.026 = 0.19&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;h3&gt;
  
  
  Cloud Source Repositories - US$ 0.00/month
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;free quota of 5 project-users&lt;/li&gt;
&lt;li&gt;1 project&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Cloud Scheduler - US$ 0.00/month
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;free quota of 3 free jobs/month&lt;/li&gt;
&lt;li&gt;1 job&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;For comparison, an f1-micro instance - with 0.6GB of RAM - running over a full month on Google Cloud, is included in the free tier; a g1-small instance, with 1.7GB, would cost US$ 13.80 per month. Also, it is reasonable to consider the cost could decrease or increase depending on how accurate were my initial assumptions and further optimizations.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://medium.com/@irio" rel="noopener noreferrer"&gt;https://medium.com/@irio&lt;/a&gt; on September 27, 2019.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>go</category>
      <category>devops</category>
      <category>serverless</category>
      <category>architecture</category>
    </item>
  </channel>
</rss>
