DEV Community: Kevin Coleman

How To: NSFW Image detection on Digital Ocean Apps

Kevin Coleman — Tue, 08 Jun 2021 16:16:40 +0000

NSWF Porn detection microservice

I built a low-cost NSFW API hosted on Digital Ocean's new App Platform.

How do image tagging ML models work?

Making predictions based on images involves two basic steps: training the data and then processing the prediction. How to train the ML model can be found in the Github repo: GantMan/nsfw_model.

The prediction API first fetches the remote image and saves the bytes to disk. Persisting to disk simplifies communicating with the ML library since the library accepts a file path, not a byte stream.

Then the image is resized to fit the dimensions of the ML model. The ML algorithm needs to compare apples to apples and so resizing to match the same size of the image training data is critical for developing the right comparison.

The resized image is categorized using the attached model. This provides a float score for each of the categories: drawings, hentai, neutral, porn, and sexy. The higher the score, the more likely the image is in this category.

Once the prediction is created, we clean up after ourselves by deleting the image from the disk and return the response.

On the client, these scores are converted to 3 states:

Definitely Adult Content
Unknown
Definitely Safe Content

The unknown state will need to be human-reviewed and bucketed into one of the "definite" categories. For my first pass, I use a combination of "sexy" and "porn" scores to determine if it's "definitely adult content" and I look at the "neutral" score to know if the image is "Definitely safe Content."

Quick Start

Self-hosting and using only takes a couple of hours since the API is so simple and Digital Ocean's App Platform allows for Heroku-like deployment.

Flask API

You will need to develop your client, but there are only 2 HTTP endpoints you would need to implement: POST /predict and GET /health

POST /predict

The service accepts a URL of an image to fetch and process. Instead of passing the image bytes directly, the URL reduces the workload on the client and avoids the overhead of base64 encoding images for the transfer (base64 has a ~33% worse space overhead).

$curl -XPOST 'http://localhost:8080/predict?url=https://www.kcoleman.me/images/hills.jpg'

{"drawings":0.11510543525218964,"hentai":0.024719053879380226,"neutral":0.803202748298645,"porn":0.0172234196215868,"sexy":0.039749305695295334}

GET /health

The health endpoint helps you monitor if the service is running without needing to process an image.

$ curl 'http://localhost:8080/health'

{"status":"ok"}

Hosting ML microservices

Heroku

Unfortunately, Heroku limits the slug size to 500MB. After compilation, the flask app is 635MB (due to needing to load the ML model (250MB) and PyTorch. It is impossible to host ML services on Heroku.

Digital Ocean

The $10/mo Digital Ocean 1GB/1vCPU App Platform hosts this project perfectly. The first deployment takes 20+ minutes, but it will eventually startup. There is a health check endpoint at /health where you can verify the service is running.

This machine takes about 600ms per request and has 2 workers, so can take about 0.8 requests per second or 72,000 images per day. Not too shabby for a $10/mo ML microservice.

Sample App config

name: nsfw-flask
region: nyc
services:
- environment_slug: python
  github:
    branch: master
    deploy_on_push: true
    repo: KevinColemanInc/NSFW-FLASK
  health_check:
    http_path: /health
  http_port: 8080
  instance_count: 1
  instance_size_slug: basic-s
  name: nsfw-flask
  routes:
  - path: /
  run_command: gunicorn --worker-tmp-dir /dev/shm app:app
  source_dir: /

Special Thanks

The flask service is a wrapper for the GantMan/nsfw_model. They performed the heavy lift of developing the ML model and the prediction code.

You can play with a web host version of the model on nsfwjs.com since we use the same model.

Designing a distributed web crawler

Kevin Coleman — Sun, 05 Jul 2020 21:49:57 +0000

Summary

Design a web crawler that fetches every page on en.wikipedia.org exactly 1 time. You have 10,000 servers you can use and you are not allowed to fetch a URL more than once. If a URL fails to be fetched (because of a timeout or server failure), it can be discarded.

👉👉Have a question? Join our FB group: System Designers

Related Companies

Facebook.com (Interview question)
Wikipedia.org (Example website to crawl)
Archive.org

Topics Discussed

Hashing
Distributed Systems
Consistent Hashing
Bloom Filter
Trie Data Structures
Consumer Groups (Kafka)
Paxos

Requirements

Functional

Download all (6.5m) URLs from en.wikipedia.com
Only download each URL once
Using 10k 2-core servers
Only processing on the content is extract URLs otherwise persist the content to local storage
Don't crawl images
Only crawl English Wikipedia
Minimize network traffic between each server.

Non-functional

No need to self-rate limit
Fetch the content as fast as possible

High-level Analysis

How much storage will we need to store 6,500,000 URLs and their HTML documents?

The average internet URL length is 66 characters. Since we don't need to track the domain name or HTTPS prefix, we will round down to 60 characters.

60 characters = 60 bytes

60 bytes *  6,500,000 URLs = 390000000 bytes or 390 Megabytes

The average HTML page is about 100kb.

100 kilobytes * 6,500,000 documents = 650 Gigabytes

How do we want to store the URLs?

Similar to the URL shortening system design problems, the most practical way of storing URLs that we have visited is a Set Structure for O(1) lookup. While the hashes approach will consume more memory, since we will be storing the full URL, than a Trie Data Structure or binary search tree, the lookups will be much faster (O(1) vs O(n)) and the additional memory cost will be manageable.

Where can we store the URLs?

390 Megabytes for all URLs can easily be stored in RAM, meaning we can get away with using an in-memory RAM solution for managing which URLs we have tracked.

65 Gigabytes is more than we can cost-effectively store in RAM on a single server. If we need to store the documents on a single server, we will need to write to the local hard drive. Because we have 10,000 servers, we could evenly distribute the documents, so each server would only need to store 3.9 GB of HTML. 3.9 GB can easily be stored in RAM at a reasonable price.

Where will be the limitations? CPU? Disk? Network?

CPU: The most expensive task for the CPU will be extracting the URLs to be crawled from the HTML documents crawled so far. This task should take less than 1ms per document.

Disk: As mentioned above, we probably don't need to be writing to disk at all since the documents, when distributed across the 10k servers, will fit into memory.

Network: Round trip to wikipedia.org for a single document may take ~200ms depending on their load and the distance our servers will be from theirs.

This will be a Network bound task with the opportunity while waiting for the network responses for our CPUs to parse the existing HTML documents of their URLs.

Design Options

Option 1: Single server

We will start simple and expand to maximize the resources of the problem.

A naive approach would be for a single server to fetch a URL, extract the URLs from the document and then query the next URL

queue = ["https://wikipedia.org"]
seen = set()
while queue:
  URL = queue.pop()
  page = download(URL)
  URLs = extract_URL(page)
  for URL in URLs:
   if not(URL in seen):
      queue.append(URL)

Follow up questions they may ask:

How do we know how many URLs we can safely fetch from one server at a time?

For this, we will need to experiment with timeouts to determine when we are rate limited. Typically systems throttle too many connections coming from a single ip address to prevent them from abusing the service.

Pros

Simple

Cons

Does not utilize the 10k servers
Wastes CPU cycles waiting for the web request to complete
A server failure results in complete data loss

👉👉Have a question? Join our FB group: System Designers

Option 2: Distributed Systems

Assigning each URL to a specific server lets each server manage which URLs need to be fetched or have already been fetched. Each server will get its own id number starting from 0 to 99,999. Hashing each URL and calculating the modulus of the hash with 10,000 can define the id of the server we need to fetch the URL from.

In a master/slave design, a single master server could map the server ids to specific ip addresses. Since the problem asks us to reduce network traffic, we can either pre-configure each server with an id to ip address or rely on a DNS server to map hostnames to ip addresses (e.g. 123.crawler.company.com points to server 123). In case there is a failure and the server id needs to be assigned to a new server, the dns record would be updated to point to the new healthy server.

server_num = hash("/wiki/The_Entire_History_of_You") % 10000

##  Directly talk to the server
server_ip = num_to_ip_dict[server_num]

##  Using a DNS server
server_host = f'http://{server_num}.crawler.company.com'

Since every URL will be uniquely assigned to a single server number, each server will internally track which URLs it has already crawled, just like the single server design. The single server design uses a set, but we could also use a Bloom Filter.

APIs

Each server will need to implement an API to receive a set of URLs that the other servers find in their pages. We can use JSON and REST to route these requests.

POST /fetch_URLs HTTP/1.1
Host: 2.crawler.company.com
Body:
{
  URLs: ["/wiki/Consistent_hashing", "/wiki/Hash_table"]
}

Response: 202

The URLs attribute should be a unique list of URLs found in the html documents that the server found at its own URLs. We should avoid sending 1 web request per document because each network request has overhead. We could collect URLs and send them to the other machines in batches of 100. If we sent the URLs to the other machines as we extract them, each document may trigger N-network requests for each URL.

[Source]

Follow up questions:

How can you distribute the URLs if a portion of the 10k servers lose power while the crawl is happening?

Borrowing from Kafka's system design, we can use the concept of "consumer groups". Instead of sending the URLs to be fetched to a single machine, we could divide the 10,000 servers into groups that are collectively responsible for managing the URLs assigned to their group. One machine would receive the URLs to be fetched using a consistency algorithm like Paxos to decide within a group which machine will be fetching the URL.

If an entire group fails, we can use the technique called "Consistent hashing" with log(M) hashing algorithms to evenly distribute the load, where M is the number of groups. When a URL is found, it is hashed with k hashing functions

[Source]

Pros

Fully utilizes all 10,000 machines
Minimizes network activity
Fetches each URL once :)
Handles distributed system failures

Cons

Despite randomly assigning URLs to each group, a single group may get unlucky and be assigned either a disproportionately large number of URLs or URLs that are larger and take longer to parse.

👉👉Have a question? Join our FB group: System Designers

Additional Reading

Decoded: Examples of How Hashing Algorithms Work (blog)
What is Consistent Hashing and Where is it used? (youtube)
Bloom Filters (youtube)
Trie Data Structures (GeeksForGeeks)
Scalability of Kafka Messaging using Consumer Groups (blog)
The Paxos Algorithm (youtube)

How and when to add foreign key constraints

Kevin Coleman — Sun, 14 Apr 2019 15:33:12 +0000

Many rails projects rely on application validation to ensure data integrity. With rails presence validation, you can require associations exist in order for an object to be saved. This works until it doesn't.

Not having foreign key constraints

If a developer forgets to define a dependency option on a rails validation (e.g. has_many :users, dependent: :nullify), uses #delete instead of #destroy, or even manually deletes it via a query, and the associated rows will point to empty records. This isn't ideal, because now you can't reliably test if an association exist by seeing if the id exists.



# good
puts "company exists!" if user.company_id

# bad - N+1 to load the company
puts "company exists!" if user.company.id

If the company is deleted, but the users are not deleted with the company, you might be accidentally invalidating your models, thus preventing you from saving any attribute changes!



class User < ApplicationRecord
  belongs_to :company, required: true
end
user.company.delete
user.update name: 'kevin' # false - company is missing

wat. I can't save the name if the company isn't there?

Adding foreign key constraints

In Rails 5, adding foreign key constraints was added to have the database protect the integrity of associated data. Once a foreign key constraint is defined, your database will not allow you to remove records that are required by other tables.



add_foreign_key :users, :companies

A quick way to add foreign key constraints to your entire rails schema is to use the yeet_dba gem. yeet_dba includes rake tasks and generators that scan your entire database searching for columns missing foreign key constraints. If the data is valid, it can add the foreign key constraint speeding up your joins and where queries or if the data is invalid, it will help you resolve the problem.

By adding foreign key constraints to your database you reduce N+1 calls, improve join and where query performance, and prevent unexpected failures with missing associations.

Fraud Detection with Ruby on Rails

Kevin Coleman — Thu, 07 Feb 2019 11:11:01 +0000

If you're a scammer, please don't read this. Everyone else keeps going. :)

I created AvoVietnam, a React Native dating app to connect Vietnamese women with foreign men. I have had an influx of registrations of scammers trying to defraud my female Vietnamese users. Their basic strategy is get a girl to trust them with promises of love and marriage. They tell their victim they want to protect and take care of them. The scammer will offer to send them some money to buy a safe car or even a house, but of course, there is a transfer fee the girl must pay.

Most of these scammers are located in North and West Africa. They would upload attractive photos of western men with jobs like airline pilot or military captain and a cute puppy, but their GPS and IP address would say they lived in a shack in Nigeria.

I am going to go over a few of my techniques for stopping low-tech scammers from reaching my users.

Shadow banning

When a user has been marked as a scammer, rather than kicking them off the platform and signaling that they have been caught, I would shadow ban their account. When you are shadow banned, your profile and messages are hidden from all of the other users and your account can only see a static list of fake profiles. The scammer will think everyone is ignoring them and maybe there are not many active users on the app.

Stopping these scammers is like playing whack-a-mole. I want to slow them down as much as possible

App store location

Both Apple and Google have different stores for each country in order to abide by the various regulations and laws. Most of my scammers seem to be coming from North and West Africa. By removing my app from being listed in basically all of Africa (I left South Africa on my list), they will need an account configured to different store to even download my app.

IP address location

When an account accesses the API, I save the IP address to a separate table and fire off 2 Sidekiq workers to collect information about the IP address. The first worker looks up the country of the IP Address. Using the countries gem, I can easily identify which countries belong to Africa and shadow ban them.

I do not use a geocoder gem, because I want to keep my ruby on rails application as small as possible. You can easily call the ip stack API with a Net::HTTP request and 4 lines of code.

class GeocodeIpWorker < ApplicationJob
  def perform(id)
    ip_address = IpAddress.find(id)
    return if ip_address.proxy || ip_address.country.present?
    resp = reverse_ip(ip_address.ip_address)
    ip_address.country = resp['country_name']
    ip_address.city = resp['city']
    ip_address.country_code = resp['country_code']
    ip_address.save!
  end

  def reverse_ip(ip)
    uri = URI.parse("http://api.ipstack.com/#{ip}?access_key=xxxx")
    response = Net::HTTP.get_response(uri)
    return nil if response.code != '200'
    JSON.parse(response.body)
  end
end

After a few weeks, I noticed the scammers started using USA-based proxy addresses to fake their locations, thus avoiding my automatic detection. Unfortunately, for them, there are many free services that will tell you if the person is using a proxy to access your service. I push that to a sidekiq worker as well. If the user is trying to hide their location, bye-bye. Again, no new gems needed.

class CheckForProxyJob < ApplicationJob
  def perform(id)
    ip_address = IpAddress.find(id)
    return if ip_address.proxy.nil?
    ip_address.proxy = proxy_test(ip_address.ip_address)
    ip_address.profile.update shadow_banned_reason: :proxy if ip_address.proxy
    ip_address.save!
  end

  def proxy_test(ip)
    uri = URI.parse("http://v2.api.iphub.info/ip/#{ip}")

    req = Net::HTTP::Get.new(uri)
    req['X-Key'] = "..."

    response = Net::HTTP.start(uri.hostname, uri.port) {|http|
      http.request(req)
    }

    return nil if response.code != '200'
    JSON.parse(response.body)&.dig(:block) != 0
  end
end

GPS location

Since my dating app for Vietnamese girls and foreigners is a mobile app, I have sometimes had access to the phone's GPS location. I don't require it to use the app, but I do ask for it for fraud detection and better location detection. Most people, including the scammers, are comfortable sharing their GPS location with a dating app. On Android, it is easy to fake your GPS location by using "Developer mode." But if you do reveal your location and you are in an African country, you will automatically be shadow banned.

Using a lat-long to country API to look up their location was super simple to run in a sidekiq worker. I won't include the code for that here, but it looks very similar to the previous workers.

Banning WhatsApp numbers

The scammers would always try to move the conversation off of the platform to prevent administrators from seeing their malicious activity. When a profile is shadow banned, I scan through every message they have ever sent to find any WhatsApp or Zalo phone number that they might be used to message the girls.

If I see a user sharing a banned number with another user, I automatically shadow ban their account. Once caught, they would need to create a completely new WhatsApp account.

Banning the device

To prevent the scammers from re-registering the app with an undetectable proxy, I generate a UUID and store it onto the filesystem of the device. When a user tries to register twice, I will receive the same device UUID as the first registration. They would need to delete the app's memory or re-install the application to get a new device ID. Apple and Google used to give you access to the device's MAC address, which is impossible to change, but due to recent privacy concerns, they no longer consistently give access to that API.

Final thoughts

With this auto-shadow banning enabled, scammers will see an app with a bunch of fake users. Hopefully, they will continue on their merry way and stop making new accounts. AvoVietnam is free to chat between users, but if you're interested in free AvoVietnam Gold which lets you send photos and appear at the top of the message inbox, shoot an email to marketing@avovietnam.com with your account's email and we will hook you up with 1 free week.

If you have more suggestions on how to detect malicious users, send an email to dev@avovietnam.com.

DEV Community: Kevin Coleman

How To: NSFW Image detection on Digital Ocean Apps

NSWF Porn detection microservice

How do image tagging ML models work?

Quick Start

Flask API

POST /predict

GET /health

Hosting ML microservices

Heroku

Digital Ocean

Special Thanks

Designing a distributed web crawler

Summary

👉👉Have a question? Join our FB group: System Designers

Related Companies

Topics Discussed

Requirements

Functional

Non-functional

High-level Analysis

How much storage will we need to store 6,500,000 URLs and their HTML documents?

How do we want to store the URLs?

Where can we store the URLs?

Where will be the limitations? CPU? Disk? Network?

Design Options

Option 1: Single server

Pros

Cons

👉👉Have a question? Join our FB group: System Designers

Option 2: Distributed Systems

APIs

Follow up questions:

Pros

Cons

👉👉Have a question? Join our FB group: System Designers

Additional Reading

How and when to add foreign key constraints

Not having foreign key constraints

Adding foreign key constraints

Further reading

Fraud Detection with Ruby on Rails

Shadow banning

App store location

IP address location

GPS location

Banning WhatsApp numbers

Banning the device

Final thoughts