DEV Community: Matthew Schwartz

Jack of All Trades, Master of Some

Matthew Schwartz — Thu, 02 Jul 2020 15:30:55 +0000

I've been a team lead and software development manager for many years. I've coded, architected, and managed many web development projects, mostly large scale SaaS applications. There are obviously many things I look for when interviewing candidates. There's one particular quality I'd like to talk about today: jack of all trades while being a master at some.

Full Stack vs Single Tier

There's often debate in each organization if it's better to hire full stack developers or people specialized for each tier. There's pros and cons to both approaches.

Someone dedicating themselves to one or two technologies is able to spend more time learning them. We all know how time consuming it is to learn the wide array of web technologies.

But having a good understanding of the whole architecture means when problems come up that span tiers, a full stack developer is more capable of solving them. They can be involved in more of the conversations that drive the product. For example, there are times a back-end team needs direction from the front-end team to satisfy UI requirements. Those front-end developers who can speak with understanding of the complete stack will be able to contribute best.

Also consider the pure managerial problem of assigning developers to projects and teams. A full stack developer will have more places they can contribute across the entire product. This opens career paths and also lets you move easily if you get bored or want to learn something new.

That said, my experience has shown me the best developers are those who are true experts in a set of technologies while having a good understanding of the entire stack. They are generally the best problem solvers and also the most successful as they move up in their careers.

Library vs Language

Similarly, the most successful developers I've known have a deep understanding of the programming languages they focus on. They grow expertise in various libraries and frameworks but know the fundamentals of the platform they're running on.

Anyone remember the Prototype JS framework? Then script.aculo.us which was built on it? Most readers here probably don't because jQuery became more popular when it came out. And now we have additions to JS and frameworks like React which are replacing jQuery.

Libraries and frameworks come and go. Most programming languages stick around far far longer. And the fundamental concepts that span programming languages have existing for decades. Become an expert in the foundation and everything built on top of it is both easier to learn and quicker to master.

Conclusion

There's a balance to being a great developer and a value to your employer. Don't just be a generalist or you'll never be the go-to person that people can rely one. And don't be so focused on one thing to the exclusion of everything else or your contributions will be limited and you'll spend more time catching up with the next big technology.

This is just one person's opinion. YMMV.

Check out my current project, SocialSentiment.io, an application which performs social media sentiment analysis of stocks.

Machine Learning: Staying on Topic

Matthew Schwartz — Tue, 30 Jun 2020 12:57:11 +0000

I started SocialSentiment.io with a somewhat simplistic machine learning algorithm. I defined a recurrent neural network to perform sentiment analysis of short texts from social media. Its purpose, and therefore its training set, is focused on the topic of stocks, companies, and their products. It returned only one floating point value representing a single prediction for each string. I quickly fell into a common ML natural language processing trap: new text which is off-topic returns unpredictable and unhelpful results. Worse yet, the results don't directly indicate the text it analyzed is off-topic.

Examples of the Problem

Search a social network such as Twitter for references to the Intel Corporation. Many posts, probably most, refer to the company as "Intel" and not "Intel Corporation" or "Intel Corp". Therefore you're going to search for the word "Intel".

Here are some posts that recently came back which are on-topic:

    What does Apple dumping Intel mean for Mac users? 

    So turns out, I have been running all the games from the Intel gpu instead of the nvidia...

    Amazon buys self-driving car company run by former Intel Oregon exec

Along with them comes posts that aren't related to the company or its stock at all:

    It is common for different intel agencies to attach different degrees of confidence based on the manner on underlying intel...

    Goddammit man what action are YOU gonna take? You’re the chairman of the intel committee!!!

Another example is Google / Alphabet. Youtube is a company owned by Alphabet, so they are included in our social media searches. Search social media for Youtube and the most popular posts are about music and music videos on the site.

    [Song name] Officially Sets YouTube Record For Most Views In First 24 Hours

    [Band name] Smashes YouTube Record As [song name] Soars Past 100 Million Views

While these are referring to Youtube, they aren't on-topic for the kinds of posts we're interested in analyzing.

Since our NN is trained on posts involving general business and stock opinions, plus specific industry sentiment like computing, it naturally returns widely varying results for these off-topic texts. These posts aren't useful to our analysis at all, so how do we ignore them?

Garbage In, Garbage Out

An old coworker of mine used to respond to bug reports in his software with "Garbage in, garbage out!"

Ideally we would filter these posts out before processing them by our RNN. So we started with this approach by adding a negative filter to social media searches. Ignore "house intel" and "senate intel", for example. This of course helped.

But there are more difficult filters. "Intel community", for example, may refer to the company or the government. "Intel chairman" might be the board chairman or a member of the US Congress. We don't want to ignore these posts and lose valuable information.

Multi-Label Classification & Off-Topic Training

We added another approach to solve this problem. We changed our ML algorithm to perform multi-label text classification. Instead of a binary label classification, returning a floating point number between 0 and 1, we redesign and retrained it to label things as positive, negative, neutral, and off-topic.

Our original binary classification took the typical approach of its last dense layer having a unit size of 1 with a sigmoid activation to bound the result between 0 and 1. The redesigned model with multi-label classification ended with a dense layer the size of the number of labels. By keeping the sigmoid activation we now get a prediction of each individual label.

If the prediction of every label is low for a text, or actually below some threshold we choose to rely on, then we know the model is not well trained for this particular text. We can choose to ignore it or hang onto it later for better training.

We can also proactively train it on my off-topic texts which it otherwise classified. If the prediction for the off-topic label is high we must have previously trained it on something similar.

Conclusion

Since switching to this multi-label text classification model for our machine learning algorithm we have much more accurate results. We still catch and predict the sentiment of too many off-topic posts. With more training and fine tuning it'll improve over time.

How to Add Subscription Based Throttling to a Django API

Matthew Schwartz — Sun, 14 Jun 2020 15:55:03 +0000

Python was a natural choice when I started SocialSentiment.io. It let me use the same language for both the machine learning algorithms and web development. And I had used Django previously for other projects. The Django Rest Framework (DRF) is a great package to quickly and easily extend a Django project to offer APIs. Today we'll look at how to extend its capabilities to support custom throttling based on user subscriptions.

Subscription Model

First let's define our application's subscription model and throttling requirements:

A free tier allowing a few hundred API requests per day
A low cost paid tier offering a few thousand requests per day
A higher cost tier offering unlimited requests
All tiers limited to 5 requests per second

This is a very common use case for a modern SaaS application.

Custom Throttling Class

One great thing about Django Rest Framework is it includes many built-in options for authentication and throttling. Each can be applied globally or to specific endpoints. If you desire any type of dynamic throttling options you'll need to extend it. Fortunately the architecture of DRF lets you override just about any part of it.

Let's start by writing a custom class that overrides DRF's UserRateThrottle:

from rest_framework.throttling import UserRateThrottle

class SubscriptionRateThrottle(UserRateThrottle):
    # Define a custom scope name to be referenced by DRF in settings.py
    scope = "subscription"

    def __init__(self):
        super().__init__()

    def allow_request(self, request, view):
        """
        Override rest_framework.throttling.SimpleRateThrottle.allow_request

        Check to see if the request should be throttled.

        On success calls `throttle_success`.
        On failure calls `throttle_failure`.
        """
        if request.user.is_staff:
            # No throttling
            return True

        if request.user.is_authenticated:
            user_daily_limit = get_user_daily_limit(request.user)
            if user_daily_limit:
                # Override the default from settings.py
                self.duration = 86400
                self.num_requests = user_daily_limit
            else:
                # No limit == unlimited plan
                return True

        # Original logic from the parent method...

        if self.rate is None:
            return True

        self.key = self.get_cache_key(request, view)
        if self.key is None:
            return True

        self.history = self.cache.get(self.key, [])
        self.now = self.timer()

        # Drop any requests from the history which have now passed the
        # throttle duration
        while self.history and self.history[-1] <= self.now - self.duration:
            self.history.pop()
        if len(self.history) >= self.num_requests:
            return self.throttle_failure()
        return self.throttle_success()

What we're doing is dynamically looking up the user-specific throttle at the key moment to override the default DRF picks up from your settings file. Define a method get_user_daily_limit to look up the value. I highly recommend using Django's cache methods if this is stored in a database for performance.

Settings

Next let's see what's required in settings.py:

REST_FRAMEWORK = {
    'DEFAULT_AUTHENTICATION_CLASSES': [...],
    'DEFAULT_PERMISSION_CLASSES': [
        'rest_framework.permissions.IsAuthenticated'
    ],
    'DEFAULT_THROTTLE_CLASSES': [
        'rest_framework.throttling.UserRateThrottle',
        'app.throttling.SubscriptionDailyRateThrottle'
    ],
    'DEFAULT_THROTTLE_RATES': {
        'user': '5/second',
        'subscription': '200/day'
    }
}

Here we set up two types of throttling. The built-in UserRateThrottle will handle the global 5 requests per second limit. It finds that setting in DEFAULT_THROTTLE_RATES with key user. Our custom throttle class is also enabled and defaults to the subscription value if a user subscription isn't found. Of course the application should be written so this never happens, but it's good to have a fallback plan if a user isn't configured properly.

Subscriptions

How you code and model your subscriptions is up to you. In my case I wrote static classes that define the details of each subscription tier. A Subscription model links the user to a specific plan with details such as start time, payment details, etc.

The nice thing is Django and DRF don't dictate how you design your user subscriptions. Any way you choose to model it they'll handle because you can customize every aspect of authorization and throttling.

Conclusion

So far I only have good things to say about the flexibility of Django and DRF and the customizations they allow. They took the right approach in offering a wide variety of built-in capabilities while allowing developers the opportunity to easily extend or override them. It's been working great for SocialSentiment.io and our APIs. I'd like to hear how others have added their own features to Django Rest Framework in the comments below.

Quickly find common phrases in a large list of strings

Matthew Schwartz — Sat, 18 Jan 2020 18:47:03 +0000

Python is very good at efficiently iterating over sets of data and gathering useful information. This is often accomplished with a surprisingly short amount of code.

I recently came across a use case within a Python application where I wanted to find repeated phrases in sets of social media posts. This is an easily managed problem because these posts are relatively short, typically under 300 characters, and therefore we can process thousands directly in memory.

NLTK includes text correlation utilities to solve this problem. In my case I didn't get the results I require, so I found a simple solution that requires no libraries.

Ins and Outs

First let's define the input and output. Our function will take an iterable of strings, a maximum phrase length (default 3), and a minimum repeat count (default 2). As we'll soon see, the choice of phrase length will have a huge impact on performance. Minimum repeat count will have a smaller but still significant impact.

The output will be a dictionary. Each key will be a tuple of words which make up a found phrase. Words returned will be all lower case. Each value in the dictionary will be the number of times it's found.

In many cases you'll want to ignore stop words. Useful lists can be found in a gist and the comments under it on Github.

stopwords = ["a", "i", "some", "which", "where", ... ]

The Algorithm

Let's start our function:

import re

def get_common_phrases(texts, maximum_length=3, minimum_repeat=2) -> dict:
    phrases = {}

First let's break down the texts into phrases. These will be tuples between 1 and maximum_length in size. For each phrase we'll count how often we find it.

for t in texts:
    # Replace separators and punctuation with spaces
    text = re.sub(r'[.!?,:;/\-\s]', ' ', text)
    # Remove extraneous chars
    text = re.sub(r'[\\|@#$&~%\(\)*\"]', '', text)

    words = text.split(' ')
    # Remove stop words and empty strings
    words = [w for w in words if len(w) and w.lower() not in stopwords]
    length = len(words)
    # Look at phrases no longer than maximum_length words long
    size = length if length <= maximum_length else maximum_length
    while size > 0:
        pos = 0
        # Walk over all sets of words
        while pos + size <= length:
            phrase = words[pos:pos+size]
            phrase = tuple(w.lower() for w in phrase)
            if phrase in phrases:
                phrases[phrase] += 1
            else:
                phrases[phrase] = 1
            pos += 1
        size -= 1

You'll notice that as we increase maximum_length this will trigger many more loop iterations. The processing time will grow exponentially. So set the maximum to as small a number as reasonably possible. In my case I found 3 to be a good value.

Next we'll remove phrases found less than the minimum required number of times.

    phrases = {k: v for k, v in phrases.items() if v >= minimum_repeat}

And last we remove sub-phrases unless they are found much more frequently than their longer counterparts. I found this to be the most interesting and useful problem to solve to get quality results. I set a threshold of 25% deviation in count, meaning if the shorter sub-phrase is found often outside the longer phrase, we'll include both in the output.

longest_phrases = {}
keys = list(phrases.keys())
keys.sort(key=len, reverse=True)
for phrase in keys:
    found = False
    for l_phrase in longest_phrases:
        # If the entire phrase is found in a longer tuple...
        intersection = set(l_phrase).intersection(phrase)
        if len(intersection) == len(phrase):
            # ... and their frequency overlaps by 75% or more, we'll drop it
            difference = (phrases[phrase] - longest_phrases[l_phrase]) / longest_phrases[l_phrase]
            if difference < 0.25:
                found = True
                break
    if not found:
        longest_phrases[phrase] = phrases[phrase]

return longest_phrases

Let's test the output. Here's sample input:

texts = (
    "This is the first text where I want to catch some common phrases",
    "This is a second text where I hope to catch some common phrases",
    "This is a third text which should catch some common phrases",
    "I'm a unique string",
    "A post with text"
)

The output of this function will be

{
    ('catch', 'common', 'phrases'): 3,
    ('text',): 4,
    ('text', 'catch', 'common'): 2
}

We've been running this algorithm on SocialSentiment.io for a few weeks with very positive results. We track the sentiment of social media posts which reference publicly traded companies. This function helps us find and display frequently found phrases in those posts.