<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Kevin Coleman</title>
    <description>The latest articles on DEV Community by Kevin Coleman (@kevincolemaninc).</description>
    <link>https://dev.to/kevincolemaninc</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F98543%2F8c15c754-6eaf-4054-ad18-8c7bb4ba8daf.png</url>
      <title>DEV Community: Kevin Coleman</title>
      <link>https://dev.to/kevincolemaninc</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/kevincolemaninc"/>
    <language>en</language>
    <item>
      <title>How To: NSFW Image detection on Digital Ocean Apps</title>
      <dc:creator>Kevin Coleman</dc:creator>
      <pubDate>Tue, 08 Jun 2021 16:16:40 +0000</pubDate>
      <link>https://dev.to/kevincolemaninc/how-to-nsfw-image-detection-on-digital-ocean-apps-28gi</link>
      <guid>https://dev.to/kevincolemaninc/how-to-nsfw-image-detection-on-digital-ocean-apps-28gi</guid>
      <description>&lt;h2&gt;
  
  
  NSWF Porn detection microservice
&lt;/h2&gt;

&lt;p&gt;I built a low-cost NSFW API hosted on &lt;a href="https://docs.digitalocean.com/products/app-platform/"&gt;Digital Ocean's new App Platform&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  How do image tagging ML models work?
&lt;/h3&gt;

&lt;p&gt;Making predictions based on images involves two basic steps: training the data and then processing the prediction. How to train the ML model can be found in the Github repo: &lt;a href="https://github.com/GantMan/nsfw_model#training-folder-contents"&gt;GantMan/nsfw_model&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The prediction API first fetches the remote image and saves the bytes to disk. Persisting to disk simplifies communicating with the ML library since the library accepts a file path, not a byte stream.&lt;/p&gt;

&lt;p&gt;Then the image is resized to fit the dimensions of the ML model. The ML algorithm needs to compare apples to apples and so resizing to match the same size of the image training data is critical for developing the right comparison.&lt;/p&gt;

&lt;p&gt;The resized image is categorized using the &lt;a href="https://github.com/KevinColemanInc/NSFW-FLASK/tree/master/mobilenet_v2_140_2240"&gt;attached model&lt;/a&gt;. This provides a float score for each of the categories: drawings, hentai, neutral, porn, and sexy. The higher the score, the more likely the image is in this category.&lt;/p&gt;

&lt;p&gt;Once the prediction is created, we clean up after ourselves by deleting the image from the disk and return the response.&lt;/p&gt;

&lt;p&gt;On the client, these scores are converted to 3 states:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Definitely Adult Content&lt;/li&gt;
&lt;li&gt;Unknown&lt;/li&gt;
&lt;li&gt;Definitely Safe Content&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The unknown state will need to be human-reviewed and bucketed into one of the "definite" categories. For my first pass, I use a combination of "sexy" and "porn" scores to determine if it's "definitely adult content" and I look at the "neutral" score to know if the image is "Definitely safe Content."&lt;/p&gt;

&lt;h3&gt;
  
  
  Quick Start
&lt;/h3&gt;

&lt;p&gt;Self-hosting and using only takes a couple of hours since the API is so simple and Digital Ocean's App Platform allows for Heroku-like deployment.&lt;/p&gt;

&lt;h4&gt;
  
  
  Flask API
&lt;/h4&gt;

&lt;p&gt;You will need to develop your client, but there are only 2 HTTP endpoints you would need to implement: POST &lt;code&gt;/predict&lt;/code&gt; and GET &lt;code&gt;/health&lt;/code&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  POST /predict
&lt;/h4&gt;

&lt;p&gt;The service accepts a URL of an image to fetch and process. Instead of passing the image bytes directly, the URL reduces the workload on the client and avoids the overhead of base64 encoding images for the transfer (&lt;a href="https://lemire.me/blog/2019/01/30/what-is-the-space-overhead-of-base64-encoding/"&gt;base64 has a ~33% worse space overhead&lt;/a&gt;).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$curl -XPOST 'http://localhost:8080/predict?url=https://www.kcoleman.me/images/hills.jpg'

{"drawings":0.11510543525218964,"hentai":0.024719053879380226,"neutral":0.803202748298645,"porn":0.0172234196215868,"sexy":0.039749305695295334}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  GET /health
&lt;/h4&gt;

&lt;p&gt;The health endpoint helps you monitor if the service is running without needing to process an image.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ curl 'http://localhost:8080/health'

{"status":"ok"}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Hosting ML microservices
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Heroku
&lt;/h4&gt;

&lt;p&gt;Unfortunately, Heroku &lt;a href="https://devcenter.heroku.com/changelog-items/1145"&gt;limits the slug size to 500MB&lt;/a&gt;. After compilation, the flask app is 635MB (due to needing to load the ML model (250MB) and PyTorch. It is impossible to host ML services on Heroku.&lt;/p&gt;

&lt;h4&gt;
  
  
  Digital Ocean
&lt;/h4&gt;

&lt;p&gt;&lt;a href="/images/digitalocean-nsfw-flask.png" class="article-body-image-wrapper"&gt;&lt;img src="/images/digitalocean-nsfw-flask.png" alt="row of hot dogs with various sauces and condiments" title="row of hot dogs with various sauces and condiments"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The $10/mo Digital Ocean 1GB/1vCPU &lt;a href="https://docs.digitalocean.com/products/app-platform/"&gt;App Platform&lt;/a&gt; hosts this project perfectly. The first deployment takes 20+ minutes, but it will eventually startup. There is a health check endpoint at &lt;code&gt;/health&lt;/code&gt; where you can verify the service is running.&lt;/p&gt;

&lt;p&gt;This machine takes about 600ms per request and has 2 workers, so can take about 0.8 requests per second or 72,000 images per day. Not too shabby for a $10/mo ML microservice.&lt;/p&gt;

&lt;p&gt;Sample App config&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;nsfw-flask&lt;/span&gt;
&lt;span class="na"&gt;region&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;nyc&lt;/span&gt;
&lt;span class="na"&gt;services&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;environment_slug&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;python&lt;/span&gt;
  &lt;span class="na"&gt;github&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;branch&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;master&lt;/span&gt;
    &lt;span class="na"&gt;deploy_on_push&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="no"&gt;true&lt;/span&gt;
    &lt;span class="na"&gt;repo&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;KevinColemanInc/NSFW-FLASK&lt;/span&gt;
  &lt;span class="na"&gt;health_check&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;http_path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/health&lt;/span&gt;
  &lt;span class="na"&gt;http_port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;8080&lt;/span&gt;
  &lt;span class="na"&gt;instance_count&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
  &lt;span class="na"&gt;instance_size_slug&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;basic-s&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;nsfw-flask&lt;/span&gt;
  &lt;span class="na"&gt;routes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/&lt;/span&gt;
  &lt;span class="na"&gt;run_command&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gunicorn --worker-tmp-dir /dev/shm app:app&lt;/span&gt;
  &lt;span class="na"&gt;source_dir&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Special Thanks
&lt;/h3&gt;

&lt;p&gt;The flask service is a wrapper for the &lt;a href="https://github.com/GantMan/nsfw_model"&gt;GantMan/nsfw_model&lt;/a&gt;. They performed the heavy lift of developing the ML model and the prediction code.&lt;/p&gt;

&lt;p&gt;You can play with a web host version of the model on &lt;a href="https://nsfwjs.com"&gt;nsfwjs.com&lt;/a&gt; since we use the same model.&lt;/p&gt;

</description>
      <category>howto</category>
      <category>digitalocean</category>
      <category>machinelearning</category>
      <category>nsfw</category>
    </item>
    <item>
      <title>Designing a distributed web crawler</title>
      <dc:creator>Kevin Coleman</dc:creator>
      <pubDate>Sun, 05 Jul 2020 21:49:57 +0000</pubDate>
      <link>https://dev.to/kevincolemaninc/designing-a-distributed-web-crawler-2dp2</link>
      <guid>https://dev.to/kevincolemaninc/designing-a-distributed-web-crawler-2dp2</guid>
      <description>&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;Design a web crawler that fetches every page on en.wikipedia.org exactly 1 time. You have 10,000 servers you can use and you are not allowed to fetch a URL more than once. If a URL fails to be fetched (because of a timeout or server failure), it can be discarded.&lt;/p&gt;

&lt;h3&gt;
  
  
  👉👉Have a question? Join our FB group: &lt;a href="https://www.facebook.com/groups/3331243270259787"&gt;System Designers&lt;/a&gt;
&lt;/h3&gt;

&lt;h2&gt;
  
  
  Related Companies
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Facebook.com (Interview question)&lt;/li&gt;
&lt;li&gt;Wikipedia.org (Example website to crawl)&lt;/li&gt;
&lt;li&gt;Archive.org&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Topics Discussed
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Hashing&lt;/li&gt;
&lt;li&gt;Distributed Systems&lt;/li&gt;
&lt;li&gt;Consistent Hashing&lt;/li&gt;
&lt;li&gt;Bloom Filter&lt;/li&gt;
&lt;li&gt;Trie Data Structures&lt;/li&gt;
&lt;li&gt;Consumer Groups (Kafka)&lt;/li&gt;
&lt;li&gt;Paxos&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Requirements
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Functional
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;  Download all (6.5m) URLs from en.wikipedia.com&lt;/li&gt;
&lt;li&gt;  Only download each URL &lt;strong&gt;&lt;span&gt;once&lt;/span&gt;&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;  Using 10k 2-core servers&lt;/li&gt;
&lt;li&gt;  Only processing on the content is extract URLs otherwise persist the content to local storage&lt;/li&gt;
&lt;li&gt;  Don't crawl images&lt;/li&gt;
&lt;li&gt;  Only crawl English Wikipedia&lt;/li&gt;
&lt;li&gt;  Minimize network traffic between each server.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Non-functional
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;  No need to self-rate limit&lt;/li&gt;
&lt;li&gt;  Fetch the content as fast as possible&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  High-level Analysis
&lt;/h2&gt;

&lt;h3&gt;
  
  
  How much storage will we need to store 6,500,000 URLs and their HTML documents?
&lt;/h3&gt;

&lt;p&gt;The average internet URL length is 66 characters. Since we don't need to track the domain name or HTTPS prefix, we will round down to 60 characters.&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight plaintext"&gt;&lt;code&gt;60 characters = 60 bytes

60 bytes *  6,500,000 URLs = 390000000 bytes or 390 Megabytes
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The average HTML page is about 100kb.&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight plaintext"&gt;&lt;code&gt;100 kilobytes * 6,500,000 documents = 650 Gigabytes
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;h3&gt;
  
  
  How do we want to store the URLs?
&lt;/h3&gt;

&lt;p&gt;Similar to the URL shortening system design problems, the most practical way of storing URLs that we have visited is a Set Structure for O(1) lookup. While the hashes approach will consume more memory, since we will be storing the full URL, than a &lt;a href="https://en.wikipedia.org/wiki/Trie"&gt;Trie Data Structure&lt;/a&gt; or binary search tree, the lookups will be much faster (O(1) vs O(n)) and the additional memory cost will be manageable.&lt;/p&gt;
&lt;h3&gt;
  
  
  Where can we store the URLs?
&lt;/h3&gt;

&lt;p&gt;390 Megabytes for all URLs can easily be stored in RAM, meaning we can get away with using an in-memory RAM solution for managing which URLs we have tracked.&lt;/p&gt;

&lt;p&gt;65 Gigabytes is more than we can cost-effectively store in RAM on a single server. If we need to store the documents on a single server, we will need to write to the local hard drive. Because we have 10,000 servers, we could evenly distribute the documents, so each server would only need to store 3.9 GB of HTML. 3.9 GB can easily be stored in RAM at a reasonable price.&lt;/p&gt;
&lt;h3&gt;
  
  
  Where will be the limitations? CPU? Disk? Network?
&lt;/h3&gt;

&lt;p&gt;CPU: The most expensive task for the CPU will be extracting the URLs to be crawled from the HTML documents crawled so far. This task should take less than 1ms per document.&lt;/p&gt;

&lt;p&gt;Disk: As mentioned above, we probably don't need to be writing to disk at all since the documents, when distributed across the 10k servers, will fit into memory.&lt;/p&gt;

&lt;p&gt;Network: Round trip to wikipedia.org for a single document may take ~200ms depending on their load and the distance our servers will be from theirs.&lt;/p&gt;

&lt;p&gt;This will be a Network bound task with the opportunity while waiting for the network responses for our CPUs to parse the existing HTML documents of their URLs.&lt;/p&gt;
&lt;h2&gt;
  
  
  Design Options
&lt;/h2&gt;
&lt;h3&gt;
  
  
  Option 1: Single server
&lt;/h3&gt;

&lt;p&gt;We will start simple and expand to maximize the resources of the problem.&lt;/p&gt;

&lt;p&gt;A naive approach would be for a single server to fetch a URL, extract the URLs from the document and then query the next URL&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;queue&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"https://wikipedia.org"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;seen&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;set&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="n"&gt;queue&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
  &lt;span class="n"&gt;URL&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;queue&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pop&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
  &lt;span class="n"&gt;page&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;download&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;URL&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="n"&gt;URLs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;extract_URL&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;URL&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;URLs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
   &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;URL&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;seen&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
      &lt;span class="n"&gt;queue&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;URL&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Follow up questions they may ask:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;How do we know how many URLs we can safely fetch from one server at a time?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;For this, we will need to experiment with timeouts to determine when we are rate limited. Typically systems throttle too many connections coming from a single ip address to prevent them from abusing the service.&lt;/p&gt;

&lt;h4&gt;
  
  
  Pros
&lt;/h4&gt;

&lt;ol&gt;
&lt;li&gt;Simple&lt;/li&gt;
&lt;/ol&gt;

&lt;h4&gt;
  
  
  Cons
&lt;/h4&gt;

&lt;ol&gt;
&lt;li&gt;Does not utilize the 10k servers&lt;/li&gt;
&lt;li&gt;Wastes CPU cycles waiting for the web request to complete&lt;/li&gt;
&lt;li&gt;A server failure results in complete data loss&lt;/li&gt;
&lt;/ol&gt;

&lt;h4&gt;
  
  
  👉👉Have a question? Join our FB group: &lt;a href="https://www.facebook.com/groups/3331243270259787"&gt;System Designers&lt;/a&gt;
&lt;/h4&gt;

&lt;h3&gt;
  
  
  Option 2: Distributed Systems
&lt;/h3&gt;

&lt;p&gt;Assigning each URL to a specific server lets each server manage which URLs need to be fetched or have already been fetched. Each server will get its own id number starting from 0 to 99,999. Hashing each URL and calculating the modulus of the hash with 10,000 can define the id of the server we need to fetch the URL from.&lt;/p&gt;

&lt;p&gt;In a master/slave design, a single master server could map the server ids to specific ip addresses. Since the problem asks us to reduce network traffic, we can either pre-configure each server with an id to ip address or rely on a DNS server to map hostnames to ip addresses (e.g. 123.crawler.company.com points to server 123). In case there is a failure and the server id needs to be assigned to a new server, the dns record would be updated to point to the new healthy server.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;server_num&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;hash&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"/wiki/The_Entire_History_of_You"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="mi"&gt;10000&lt;/span&gt;

&lt;span class="c1"&gt;##  Directly talk to the server
&lt;/span&gt;&lt;span class="n"&gt;server_ip&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;num_to_ip_dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;server_num&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="c1"&gt;##  Using a DNS server
&lt;/span&gt;&lt;span class="n"&gt;server_host&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;f'http://&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;server_num&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;.crawler.company.com'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;Since every URL will be uniquely assigned to a single server number, each server will internally track which URLs it has already crawled, just like the single server design. The single server design uses a &lt;code&gt;set&lt;/code&gt;, but we could also use a Bloom Filter.&lt;/p&gt;

&lt;h4&gt;
  
  
  APIs
&lt;/h4&gt;

&lt;p&gt;Each server will need to implement an API to receive a set of URLs that the other servers find in their pages. We can use JSON and REST to route these requests.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight plaintext"&gt;&lt;code&gt;POST /fetch_URLs HTTP/1.1
Host: 2.crawler.company.com
Body:
{
  URLs: ["/wiki/Consistent_hashing", "/wiki/Hash_table"]
}

Response: 202
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;The URLs attribute should be a unique list of URLs found in the html documents that the server found at its own URLs. We should avoid sending 1 web request per document because each network request has overhead. We could collect URLs and send them to the other machines in batches of 100. If we sent the URLs to the other machines as we extract them, each document may trigger N-network requests for each URL.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--hXzmIOm_--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/distributed-web-crawler.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--hXzmIOm_--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/distributed-web-crawler.png" alt="Distributed Web Crawler Design Flowchart" title="Distributed Web Crawler Design Flowchart"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;[&lt;a href="https://docs.google.com/presentation/d/1BOWsk4L68pDx7u1GpXF-NqOTejrAffsJmdolQVCnO3Q/edit#slide=id.p"&gt;Source&lt;/a&gt;]&lt;/p&gt;

&lt;h4&gt;
  
  
  Follow up questions:
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;How can you distribute the URLs if a portion of the 10k servers lose power while the crawl is happening?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Borrowing from &lt;a href="https://kafka.apache.org/documentation/#design"&gt;Kafka's system design&lt;/a&gt;, we can use the concept of "consumer groups". Instead of sending the URLs to be fetched to a single machine, we could divide the 10,000 servers into groups that are collectively responsible for managing the URLs assigned to their group. One machine would receive the URLs to be fetched using a consistency algorithm like Paxos to decide within a group which machine will be fetching the URL. &lt;/p&gt;

&lt;p&gt;If an entire group fails, we can use the technique called "Consistent hashing" with log(M) hashing algorithms to evenly distribute the load, where M is the number of groups. When a URL is found, it is hashed with k hashing functions&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--V2roJG80--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/consistent-hashing.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--V2roJG80--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/consistent-hashing.png" alt="Example of how Consistent Hashing" title="Example of how Consistent Hashing"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;[&lt;a href="https://docs.google.com/presentation/d/1BOWsk4L68pDx7u1GpXF-NqOTejrAffsJmdolQVCnO3Q/edit#slide=id.p"&gt;Source&lt;/a&gt;]&lt;/p&gt;

&lt;h4&gt;
  
  
  Pros
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;  Fully utilizes all 10,000 machines&lt;/li&gt;
&lt;li&gt;  Minimizes network activity&lt;/li&gt;
&lt;li&gt;  Fetches each URL once :)&lt;/li&gt;
&lt;li&gt;  Handles distributed system failures&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Cons
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;  Despite randomly assigning URLs to each group, a single group may get unlucky and be assigned either a disproportionately large number of URLs or URLs that are larger and take longer to parse.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  👉👉Have a question? Join our FB group: &lt;a href="https://www.facebook.com/groups/3331243270259787"&gt;System Designers&lt;/a&gt;
&lt;/h4&gt;

&lt;h2&gt;
  
  
  Additional Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;a href="https://dzone.com/articles/decoded-examples-of-how-hashing-algorithms-work"&gt;Decoded: Examples of How Hashing Algorithms Work&lt;/a&gt; (blog)&lt;/li&gt;
&lt;li&gt;  &lt;a href="https://www.youtube.com/watch?v=zaRkONvyGr8"&gt;What is Consistent Hashing and Where is it used?&lt;/a&gt; (youtube)&lt;/li&gt;
&lt;li&gt;  &lt;a href="https://www.youtube.com/watch?v=bEmBh1HtYrw"&gt;Bloom Filters&lt;/a&gt; (youtube)&lt;/li&gt;
&lt;li&gt;  &lt;a href="https://www.geeksforgeeks.org/trie-insert-and-search/"&gt;Trie Data Structures&lt;/a&gt; (GeeksForGeeks)&lt;/li&gt;
&lt;li&gt;  &lt;a href="https://blog.cloudera.com/scalability-of-kafka-messaging-using-consumer-groups/"&gt;Scalability of Kafka Messaging using Consumer Groups&lt;/a&gt; (blog)&lt;/li&gt;
&lt;li&gt;  &lt;a href="https://www.youtube.com/watch?v=d7nAGI_NZPk"&gt;The Paxos Algorithm&lt;/a&gt; (youtube)&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>career</category>
      <category>python</category>
      <category>distributedsystems</category>
      <category>architecture</category>
    </item>
    <item>
      <title>How and when to add foreign key constraints</title>
      <dc:creator>Kevin Coleman</dc:creator>
      <pubDate>Sun, 14 Apr 2019 15:33:12 +0000</pubDate>
      <link>https://dev.to/kevincolemaninc/how-and-when-to-add-foreign-key-constraints-1aji</link>
      <guid>https://dev.to/kevincolemaninc/how-and-when-to-add-foreign-key-constraints-1aji</guid>
      <description>&lt;p&gt;Many rails projects rely on application validation to ensure data integrity. With rails &lt;code&gt;presence&lt;/code&gt; validation, you can require associations exist in order for an object to be saved. This works until it doesn't.&lt;/p&gt;

&lt;h2&gt;
  
  
  Not having foreign key constraints
&lt;/h2&gt;

&lt;p&gt;If a developer forgets to define a dependency option on a rails validation (e.g. &lt;code&gt;has_many :users, dependent: :nullify&lt;/code&gt;), uses &lt;code&gt;#delete&lt;/code&gt; instead of &lt;code&gt;#destroy&lt;/code&gt;, or even manually deletes it via a query, and the associated rows will point to empty records. This isn't ideal, because now you can't reliably test if an association exist by seeing if the id exists.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight ruby"&gt;&lt;code&gt;

&lt;span class="c1"&gt;# good&lt;/span&gt;
&lt;span class="nb"&gt;puts&lt;/span&gt; &lt;span class="s2"&gt;"company exists!"&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;user&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;company_id&lt;/span&gt;

&lt;span class="c1"&gt;# bad - N+1 to load the company&lt;/span&gt;
&lt;span class="nb"&gt;puts&lt;/span&gt; &lt;span class="s2"&gt;"company exists!"&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;user&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;company&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;id&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;If the company is deleted, but the users are not deleted with the company, you might be accidentally invalidating your models, thus preventing you from saving any attribute changes!&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight ruby"&gt;&lt;code&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;User&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="no"&gt;ApplicationRecord&lt;/span&gt;
  &lt;span class="n"&gt;belongs_to&lt;/span&gt; &lt;span class="ss"&gt;:company&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="ss"&gt;required: &lt;/span&gt;&lt;span class="kp"&gt;true&lt;/span&gt;
&lt;span class="k"&gt;end&lt;/span&gt;
&lt;span class="n"&gt;user&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;company&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;delete&lt;/span&gt;
&lt;span class="n"&gt;user&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;update&lt;/span&gt; &lt;span class="ss"&gt;name: &lt;/span&gt;&lt;span class="s1"&gt;'kevin'&lt;/span&gt; &lt;span class="c1"&gt;# false - company is missing&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;wat. I can't save the name if the company isn't there?&lt;/p&gt;

&lt;h2&gt;
  
  
  Adding foreign key constraints
&lt;/h2&gt;

&lt;p&gt;In Rails 5, adding foreign key constraints was added to have the database protect the integrity of associated data. Once a foreign key constraint is defined, your database will not allow you to remove records that are required by other tables.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight ruby"&gt;&lt;code&gt;

&lt;span class="n"&gt;add_foreign_key&lt;/span&gt; &lt;span class="ss"&gt;:users&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="ss"&gt;:companies&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2FKevinColemanInc%2Fyeet_dba%2Fmaster%2Fyeet_dba.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2FKevinColemanInc%2Fyeet_dba%2Fmaster%2Fyeet_dba.png" alt="logo of the yeet_dba gem" title="logo of the yeet_dba gem"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A quick way to add foreign key constraints to your entire rails schema is to use the &lt;a href="https://github.com/kevincolemaninc/yeet_dba" rel="noopener noreferrer"&gt;yeet_dba&lt;/a&gt; gem. &lt;code&gt;yeet_dba&lt;/code&gt; includes rake tasks and generators that scan your entire database searching for columns missing foreign key constraints. If the data is valid, it can add the foreign key constraint speeding up your joins and where queries or if the data is invalid, it will help you resolve the problem.&lt;/p&gt;

&lt;p&gt;By adding foreign key constraints to your database you reduce N+1 calls, improve join and where query performance, and prevent unexpected failures with missing associations.&lt;/p&gt;

&lt;h3&gt;
  
  
  Further reading
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://github.com/plentz/lol_dba" rel="noopener noreferrer"&gt;lol_dba&lt;/a&gt; - This gem helps find indexes that are missing, but not foreign keys.&lt;/li&gt;
&lt;li&gt;&lt;a href="http://www.postgresqltutorial.com/postgresql-foreign-key/" rel="noopener noreferrer"&gt;Postgres foreign key guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.kcoleman.me/2019/03/14/how-to-add-foreign-key-constraints-rails.html" rel="noopener noreferrer"&gt;Rails foreign key constraints&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>rails</category>
      <category>showdev</category>
      <category>tutorial</category>
      <category>ruby</category>
    </item>
    <item>
      <title>Fraud Detection with Ruby on Rails</title>
      <dc:creator>Kevin Coleman</dc:creator>
      <pubDate>Thu, 07 Feb 2019 11:11:01 +0000</pubDate>
      <link>https://dev.to/kevincolemaninc/fraud-detection-with-ruby-on-rails-4751</link>
      <guid>https://dev.to/kevincolemaninc/fraud-detection-with-ruby-on-rails-4751</guid>
      <description>&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.kcoleman.me%2Fimages%2Fscammer.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.kcoleman.me%2Fimages%2Fscammer.jpg" alt="Man in a mask" title="Man in a mask"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you're a scammer, please don't read this. Everyone else keeps going. :)&lt;/p&gt;

&lt;p&gt;I created AvoVietnam, a React Native dating app &lt;a href="https://www.avovietnam.com" rel="noopener noreferrer"&gt;to connect Vietnamese women with foreign men&lt;/a&gt;. I have had an influx of registrations of scammers trying to defraud my female Vietnamese users. &lt;a href="https://vietnamnews.vn/society/372275/scams-break-womens-hearts-bank-accounts.html" rel="noopener noreferrer"&gt;Their basic strategy&lt;/a&gt; is get a girl to trust them with promises of love and marriage. They tell their victim they want to protect and take care of them. The scammer will offer to send them some money to buy a safe car or even a house, but of course, there is a transfer fee the girl must pay.&lt;/p&gt;

&lt;p&gt;Most of these scammers are located in North and West Africa. They would upload attractive photos of western men with jobs like airline pilot or military captain and a cute puppy, but their GPS and IP address would say they lived in a shack in Nigeria.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.kcoleman.me%2Fimages%2Ffake-profile.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.kcoleman.me%2Fimages%2Ffake-profile.png" alt="Fake profile on AvoVietnam" title="Fake profile on AvoVietnam"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I am going to go over a few of my techniques for stopping low-tech scammers from reaching my users.&lt;/p&gt;

&lt;h2&gt;
  
  
  Shadow banning
&lt;/h2&gt;

&lt;p&gt;When a user has been marked as a scammer, rather than kicking them off the platform and signaling that they have been caught, I would shadow ban their account. When you are shadow banned, your profile and messages are hidden from all of the other users and your account can only see a static list of fake profiles. The scammer will think everyone is ignoring them and maybe there are not many active users on the app.&lt;/p&gt;

&lt;p&gt;Stopping these scammers is like playing whack-a-mole. I want to slow them down as much as possible&lt;/p&gt;

&lt;h2&gt;
  
  
  App store location
&lt;/h2&gt;

&lt;p&gt;Both Apple and Google have different stores for each country in order to abide by the various regulations and laws. Most of my scammers seem to be coming from North and West Africa. By removing my app from being listed in basically all of Africa (I left South Africa on my list), they will need an account configured to different store to even download my app.&lt;/p&gt;

&lt;h2&gt;
  
  
  IP address location
&lt;/h2&gt;

&lt;p&gt;When an account accesses the API, I save the IP address to a separate table and fire off 2 &lt;a href="https://github.com/mperham/sidekiq" rel="noopener noreferrer"&gt;Sidekiq workers&lt;/a&gt; to collect information about the IP address.  The first worker looks up the country of the IP Address. Using the &lt;a href="https://github.com/hexorx/countries" rel="noopener noreferrer"&gt;countries gem&lt;/a&gt;, I can easily identify which countries belong to Africa and shadow ban them.&lt;/p&gt;

&lt;p&gt;I do not use a geocoder gem, because I want to keep my ruby on rails application as small as possible. You can easily call the &lt;a href="https://ipstack.com" rel="noopener noreferrer"&gt;ip stack&lt;/a&gt; API with a &lt;code&gt;Net::HTTP&lt;/code&gt; request and 4 lines of code.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight ruby"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;GeocodeIpWorker&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="no"&gt;ApplicationJob&lt;/span&gt;
  &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;perform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;ip_address&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="no"&gt;IpAddress&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;ip_address&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;proxy&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="n"&gt;ip_address&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;country&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;present?&lt;/span&gt;
    &lt;span class="n"&gt;resp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;reverse_ip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ip_address&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ip_address&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;ip_address&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;country&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'country_name'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;ip_address&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;city&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'city'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;ip_address&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;country_code&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'country_code'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;ip_address&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;save!&lt;/span&gt;
  &lt;span class="k"&gt;end&lt;/span&gt;

  &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;reverse_ip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ip&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;uri&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="no"&gt;URI&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"http://api.ipstack.com/&lt;/span&gt;&lt;span class="si"&gt;#{&lt;/span&gt;&lt;span class="n"&gt;ip&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;?access_key=xxxx"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="no"&gt;Net&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="no"&gt;HTTP&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_response&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;uri&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="kp"&gt;nil&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;code&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="s1"&gt;'200'&lt;/span&gt;
    &lt;span class="no"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;body&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="k"&gt;end&lt;/span&gt;
&lt;span class="k"&gt;end&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After a few weeks, I noticed the scammers started using USA-based proxy addresses to fake their locations, thus avoiding my automatic detection. Unfortunately, for them, there are many free services that will tell you if the person is using a proxy to access your service. I push that to a sidekiq worker as well. If the user is trying to hide their location, bye-bye. Again, no new gems needed.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight ruby"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;CheckForProxyJob&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="no"&gt;ApplicationJob&lt;/span&gt;
  &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;perform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;ip_address&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="no"&gt;IpAddress&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;ip_address&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;proxy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;nil?&lt;/span&gt;
    &lt;span class="n"&gt;ip_address&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;proxy&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;proxy_test&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ip_address&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ip_address&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;ip_address&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;profile&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;update&lt;/span&gt; &lt;span class="ss"&gt;shadow_banned_reason: :proxy&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;ip_address&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;proxy&lt;/span&gt;
    &lt;span class="n"&gt;ip_address&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;save!&lt;/span&gt;
  &lt;span class="k"&gt;end&lt;/span&gt;

  &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;proxy_test&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ip&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;uri&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="no"&gt;URI&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"http://v2.api.iphub.info/ip/&lt;/span&gt;&lt;span class="si"&gt;#{&lt;/span&gt;&lt;span class="n"&gt;ip&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;req&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="no"&gt;Net&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="no"&gt;HTTP&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="no"&gt;Get&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;uri&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'X-Key'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"..."&lt;/span&gt;

    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="no"&gt;Net&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="no"&gt;HTTP&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;uri&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;hostname&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;uri&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;port&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;
      &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;request&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="kp"&gt;nil&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;code&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="s1"&gt;'200'&lt;/span&gt;
    &lt;span class="no"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;body&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="ss"&gt;:block&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
  &lt;span class="k"&gt;end&lt;/span&gt;
&lt;span class="k"&gt;end&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  GPS location
&lt;/h2&gt;

&lt;p&gt;Since my &lt;a href="https://www.avovietnam.com" rel="noopener noreferrer"&gt;dating app for Vietnamese girls and foreigners&lt;/a&gt; is a mobile app, I have sometimes had access to the phone's GPS location. I don't require it to use the app, but I do ask for it for fraud detection and better location detection. Most people, including the scammers, are comfortable sharing their GPS location with a dating app. On Android, it is easy to fake your GPS location by using "Developer mode." But if you do reveal your location and you are in an African country, you will automatically be shadow banned.&lt;/p&gt;

&lt;p&gt;Using a lat-long to country API to look up their location was super simple to run in a sidekiq worker. I won't include the code for that here, but it looks very similar to the previous workers.&lt;/p&gt;

&lt;h2&gt;
  
  
  Banning WhatsApp numbers
&lt;/h2&gt;

&lt;p&gt;The scammers would always try to move the conversation off of the platform to prevent administrators from seeing their malicious activity. When a profile is shadow banned, I scan through every message they have ever sent to find any WhatsApp or Zalo phone number that they might be used to message the girls.&lt;/p&gt;

&lt;p&gt;If I see a user sharing a banned number with another user, I automatically shadow ban their account. Once caught, they would need to create a completely new WhatsApp account.&lt;/p&gt;

&lt;h2&gt;
  
  
  Banning the device
&lt;/h2&gt;

&lt;p&gt;To prevent the scammers from re-registering the app with an undetectable proxy, I generate a UUID and store it onto the filesystem of the device. When a user tries to register twice, I will receive the same device UUID as the first registration. They would need to delete the app's memory or re-install the application to get a new device ID. Apple and Google used to give you access to the device's MAC address, which is impossible to change, but due to recent privacy concerns, they no longer consistently give access to that API.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.avovietnam.com" rel="noopener noreferrer"&gt;&lt;br&gt;
  &lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.kcoleman.me%2Fimages%2Favovietnam-feature.png" alt="AvoVietnam banner" title="AvoVietnam - serious relationships with Vietnamese"&gt;&lt;br&gt;
&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Final thoughts
&lt;/h2&gt;

&lt;p&gt;With this auto-shadow banning enabled, scammers will see an app with a bunch of fake users. Hopefully, they will continue on their merry way and stop making new accounts. AvoVietnam is free to chat between users, but if you're interested in free &lt;a href="https://www.avovietnam.com/faq" rel="noopener noreferrer"&gt;AvoVietnam Gold&lt;/a&gt; which lets you send photos and appear at the top of the message inbox, shoot an email to &lt;a href="//mailTo:marketing@avovietnam.com"&gt;marketing@avovietnam.com&lt;/a&gt; with your account's email and we will hook you up with 1 free week.&lt;/p&gt;

&lt;p&gt;If you have more suggestions on how to detect malicious users, send an email to &lt;a href="//mailTo:dev@avovietnam.com"&gt;dev@avovietnam.com&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>rails</category>
      <category>security</category>
      <category>geoip</category>
      <category>ruby</category>
    </item>
  </channel>
</rss>
