Warren Parad for Standup & Prosper

Posted on Aug 22, 2021 • Originally published at dev.to on Aug 21, 2021

The Punishment of Building a Slack App

#bots #development #slack #microservices

Three years ago, we set out at Rhosys to build the perfect tool for team performance and growth. Turns out lot’s of companies were interested in saying they want to help their teams and team members grow, but but were much less interested in doing anything about it. What they were interested in however was an app that they could force status tracking via standups. We’ve gotten a lot of requests for our Slack bot — Standup & Prosper, to make standup questions mandatory, but since that isn’t core to standups or productive teams, it’s one feature that will never be implemented.

With COVID and a huge switch to remote working environments, teams and more specifically managers took for granted how they were tracking performance. Frequently, we saw companies using “engineer in chair” as the effective metric for “working” and “sync online standups” as the mechanism for reporting “I did my work that was assigned.” This was terrible for so many reasons, however in an attempt to regain control and micromanagement of their teams, these managers wanted to do status tracking over Slack. Slack doesn’t remotely offer the ability to do status tracking in this way, but there are no shortage of tools available in the Slack App Marketplace for them.

At Rhosys, being a remote first company, we also were interested in tools to enable high productivity, and because Discord and Microsoft Teams were unusable (one of them still is today, but I won’t say which) we invested in Slack. Multiple times a week we would hold async standups, and wanted an automated solution for that, we thought to ourselves this is simple and easy, it ought to be free. And it was for a time, we experimented with five different standup apps looking for the best one that fit our small and simple use case. And we found one, and then used it effectively for six months.

One day, they said: Pay us or we’ll shut down your account. Well that was the end of it. For an app that does one thing, and doesn’t even need to do it that well, it shouldn’t cost anything. Like usual, as an engineering team we thought we could it better. There are some important qualifiers here:

At the time we were focused on building team tools, so this was our core competence.
We calculated the TCO and it was huge, and we used that to evaluate doing this in the first place.
We wanted to provide the community with a free standup app, because it was a great marketing opportunity for our main product at the time: Teaminator. (We now solely focus on user security SaaS with Authress.)

Our app

We set out and started building our standup app, one that was simple, cleverly named: Standup & Prosper, and with our core values provided a conversational experience with an AI to record your async standup.

Like all great microservices, it was done it under a month, but took another month to get released to the Slack marketplace. We were able to accomplish it quickly because we already invested in build/integration with:

User Management and Permission — Authress
External Messaging & Notifications — Our notification scheduler (almost open source)
UI experience for Slack integration — Requires some OAuth flow setup
Platform for deploying microservices — Gitlab + AWS Architect

You start out small and make it easy to handle integration events from Slack and post messages to users and channels.

Handle events from Slack
Send messages as DMs to users
Send messages to standup channels

And you’re done?

We’ll almost, now I’m share about all the issues we’ve found and how we went about dealing with them.

1. Slack Status API Lacks CORS support

Frequently, and I mean a lot, Slack is down, or otherwise unreliable, messages take too long to be sent:

(Don’t trust the 100% number, there are frequently reports that don’t make it into the calculation) It’s a terrible user experience, but it’s understandable for realtime messaging app. However, we still aim to make it better. When Slack is having a problem and a specific problem such as: Slack Conversation APIs don’t work, everything may seem fine to your users, but it still looks like your app is broken.

So at the top of our portal, now you see a link to the status:

How do we know Slack is having an incident? Well there is the Slack Status API. Great! Except that it doesn’t support CORS, because the reason was:

However, I’m afraid that we can’t enable CORS headers for our status.slack.com API.

Can’t seems a bit strong for this, of course it should be possible, but it would help me if you shared why you would choose to not enable this feature. There shouldn’t be any security concerns given the nature of this app.

We actually don’t support CORS for any of our API endpoints. Whilst it may not seem security related, we do have policy concerns about how this data is being retrieved.

Great, so there’s an API, but I have to write a one line wrapper to fetch this data myself to expose to our UI.

2. Non-Standard API and Inconsistent Payloads

Everything is a status code 200 returned from Slack except when it isn’t. It also frequently doesn’t even accept json requests, but will return application/json except when it doesn’t and the whole body is re-encoded into another string. But then Slack API will also fail, so that’s great. You have to explicitly check for an ok: true , in the response, but sometimes it’s ok: false , because you queried for a list and the list was empty…Oops.

You also might be getting lots of different types of events, turns out you will get all the following problems:

Events with undocumented fields
Events with three fields all called team_id but with different IDs
Events of different types have the same property in different named fields
Events for messages don’t have a timestamp
Events arriving over 10 seconds late
Events from workspaces that you don’t have an app token for any more

3. Undocumented Errors

Welcome to the land of undocumented errors. While Slack has tons of error codes documented for every single endpoint, things like user_not_found or inactive_account. You know you did something wrong in those cases. However, what will you do when you get back an error such as admin_only or org_login_required. Turns out you get these when other ridiculous things are happening such as a Slack account migration or Enterprise grid conversion. Why you can’t use the api at that time is beyond me, but you can’t.

Oh yeah, turns out the issue was there was a bug in the Slack source:

For this admin_only error, that was the result of an error in the Slack core code where we were supposed to show a proper error to the installer who happens to be a guest user.

4. User Login returns incomplete information

Welcome to the world of confusion. You want to log the user in with Slack Login. There are tons of pages dedicated to it. But it turns out that there are two login options:

Log user in — identity.user
Install App — oauth.exchange

When a user installs the app, you have no idea which workspace they installed it in, who the user is, among other missing details. Of these include missing EnterpriseId (if it is an enterprise). Further if the user installs the app, but your site fails to save the token, too bad. Slack doesn’t give your app permissions, it generates a unique token for the workspace. Even though you already have a clientId and secret, now you have to save additional information for every workspace.

Further, since you can’t know if they have installed the app yet, you always want to send them to the app install flow, send them to the user login, and you’ll end up with a user in your app, but no ability to call any of the Slack APIs.

When the user logs in, you don’t get to know simple information about the workspace, even if the app is already installed, and that information was available during installation. Why? I don’t know.

This means if you want to grant a great experience to your users, they have to log in twice. Same flow, once to install the app, once to just log in.

5. Lack of attention to critical features

Slack apps are lacking in security (until recently there was no support even for refresh tokens). They force the unnecessary generation of tokens for the app to save and then reuse later. If you accidentally expose one of these tokens, there is no way to rotate it. Slack was working on the improvement, and even released it, then told everyone it was the new thing called Workspace Apps. But after telling everyone to use it, they shut it down:

We had a prior app framework called “workspace apps” which was a separate project that required the creation of a new type of app. However, due to the fact that it didn’t take advantage of the current app framework, we decided to deprecate that in favor of porting the features into the existing app framework. We had announced the deprecation of workspace apps in this blog article: https://medium.com/@SlackAPI/aabc9e42a98b

6. Chat.postMessage is not idempotent!

You would think for a messaging service there would be a way to guarantee delivery of messages, every single message without repeating messages to a user. There isn’t. That’s really the end of the story. It would be so trivial on Slack’s slide to store idempotent tokens and then ensure not to deliver the message to the user twice, or even do the delivery, but hide duplicate messages on the UI either by code or by hash. There is no such thing, since Slack apps need to handle multiple retries from Slack on new message events (more on this in a second), idempotent message sending needs to exist in every single Slack app.

So we have a database table specifically to track and handle it. But as everyone knows that isn’t the right way to handle idempotency (you can’t as effectively do it on the client side), so it is sub-optimal. For us there is one edge-case that doesn’t work — Message hasn’t been sent yet, failed to deliver message to slack, failed to setup async processing in AWS, failed to delete the message indicator from our database. It’s 4 failures, and at that point we are confidence there is probably something wrong with the world, so failing here results in an CRITICAL exception sent to AWS CloudWatch and then forwarded to our on-call handlers. Not totally sure what we would do in that situation, other than /sigh in our slack workspace.

7. Fetching messages in a DM does not work with UserId

You can send all the messages you want to a user, using the user’s ID, but you cannot get the list of messages sent there using that same ID. You have to first convert it to a channel ID, using conversations.open , and then pass the result as the parameter. However, even if you manage to get this working, don’t. It turns out the conversations.history endpoint isn’t well supported:

Slack API is unreliable — 75-second peaks on response time

I wouldn’t recommend using this endpoint as a critical usage in your app. We made the mistake of doing that, and while most calls are around ~100ms, every 10th or 20th one can be as high as ~5000ms. When you are making a realtime conversation app on top of slack to handle seamless AI conversations, not a good story. This is where all the performance goes (well here and GCP DialogFlow which is a different sort of trainwreck).

8. The unfortunate iOS experience

It’s no secret that I’m not a fan of Apple products, but it seams that Slack might also not be (although I would have ventured a guess they would be as, they are a lot alike). Frequently messages sent by bots are not displayable on iOS devices, or they do display but incorrectly. The experience is inconsistent. Instead of acknowledging “Oh there are messages displaying wrong — That’s a critical problem”, it’s been much quicker for us to change the type of message we are sending. Why? Because we can move much, much faster than Slack can, with our small team. It’s also something we’ve focused a lot on because we are in the security business, so when we find potential security vulnerabilities, it’s critical to assess and resolve as soon as possible.

Slack iOS doesn't work — iOS missing messages

You can see where it says (edited), there should be the team’s standup responses there.

9. Unexpected behavior

Ever wonder what happens if you do something not documented? The spoiler is only ever do documented things. Standup & Prosper APIs (and Authress’s APIs) are fully documented via an OpenAPI specification. If you do something not explicitly handled, you get an error, do something not explicitly handled by Slack, get weird results, 6 months later when something hidden changes.

If your app has the legacy old style bot permissions and not the new hotness granular permission scopes , when you send a message to the user with the flag: as_user=true , you get what you want. But if you have the new permissions, then you get an error. So start checking your permissions before posting to APIs, because the word deprecated doesn’t mean what Slack thinks it means. Deprecated to Slack means removed/disabled/deleted.

10. Slack message event delivery retries

Similar to almost all systems that offer webhooks for message delivery, Slack provides 3 retries for every message. One immediately, at 1 minute, and finally 5 minutes later. I don’t know the situation where our bot would still care 1 minute later, let alone 5. But you only have 3 seconds to respond to Slack with a “ success” — because Slack automatically times out and retries. Given that some of Slack’s APIs take on average more than 3 seconds, you are going to be fighting with this problem until the end of time. Further, 3 seconds isn’t even enough for our lambda’s to cold start in some rare cases, but usually the cold start is in 1 second, and then we wait for 3 seconds for slack to respond.

That’s not the worst, what is the worst is however for action events , such as button pushes and response modals, the request isn’t retried and a trigger token is passed with the event. This token is MAGIC. You must have the token to create a dialog with the user, and you want to create dialogs lots of times, because realtime messaging as I’ve shared above, is really challenging to get right. But that token also only lasts 3 seconds. And here’s the kicker, the timer threshold is when Slack responds OK to your request. So even if you manage to deliver the request to open a dialog in 100ms, if Slack has trouble processing it for another 3 seconds, sorry, it’s expired.

The solution we’ve added is to inform the user that they should try again, why not provide the retry, no idea. Why not just extend the time to 5 seconds, with the requirement that p99 of all requests have to be under 5 seconds. Why not let the users of the app decide for themselves if they like the experience, getting message at 5 seconds, is much better than getting an error message at 3 seconds that says “Oops something went wrong”.

11. Challenges with shared channels

Did you know there could be users talking to your bot, or mentioned by other users in messages, that your bot isn’t allowed to know about? For a long time, Standup & Prosper had received a confusing error message user_not_found. The user should have been found, someone in some channel added that user to a standup. The reason is that users could be in a channel from a different workspace, or guest users. These users could be in the same enterprise account, or totally different accounts. All I can say here is: Good Luck.

Alternatively you might get something like message_not_found , and not because the user deleted it, but because the user that posted the message is no longer in a channel that Standup & Prosper had access to, even though the bot was still in the shared channel, the user was not, and they were from a different workspace, so messages in the shared DM with the user are now no longer accessible.

You were able to follow that, right?

I’ll say that again, you can’t access DM messages from a user a bot has a conversation with if:

That user is from another workspace
That user was in a shared channel, that the bot was also in
And now the user is no longer in the channel
Even if that other workspace has your bot installed (these are separate workspaces with different user IDs after all)

12. Global Uniqueness of user IDs

The introduction of reusable user IDs across workspaces in the same enterprise slack account is necessary. So thank you Slack! However these user IDs are not unique across workspaces. It’s impossible to uniquely identify users in Slack, end of story. You can’t join the user ID with the team ID, because a user has the same user ID but different team IDs for different workspaces in the same enterprise account, and different user IDs in different workspaces. The only solution we’ve found is joining the user ID with the DM channel ID, these seem consistent, but I’m sure that will bite us later.

This something that Discord and every other messaging platform has figured out, you only need one user, but need multiple identifies one for each workspace. Instead of doing the sane thing, Slack has different users for each workspace, unless those workspaces just happen to be in the same enterprise account, in which case here’s a magic exception.

13. Breaking changes

The list of breaking changes just goes on and on. Let’s say you have a small bug in your configuration and you want to start listening to a new event type. It’s a critical problem now, with a simple fix, but you can’t easily deploy this. You have to wait for Slack to review your app in its entirety. Here’s the kicker, if Slack changed something in the meantime, now you are also required to make those updates as well. The UI for app management will have changed, and so have the requirements. One time for us — OAuth configuration update, “deprecation” of bot scope permissions, and security compliance were blocking us from a production fix.

When migrations are necessary, Slack assumes you can just do them, have done them, did them, and never offers ways to help migrate, keeping everything working before fully removing the deprecated functionality. You are either on the new version or you aren’t, features aren’t backwards compatible. There is such a thing as a migration path, but not with Slack apps.

14. You will hit Slacks outages before they know about them

I’ve already mentioned a bit about the outages, but what’s even more fundamental is, as an app developer you will hit these before Slack’s users do, because your users are doing more ridiculous things with your app. For example, this was clearly a bug, but we got it:

Then it became a 503, although that was supposedly a different problem.

Our team has been investigating and from our logs it seems like calls are being successful, and we not quite sure why you’re getting a 404. We’re gonna need yet another entry to double-check something on our end.

Why they don’t have logging in place, why they need logs from us, is beyond me. Our support requests involve a lot of time giving Slack their own data back.

15. Incomplete APIs

You will soon find out that it isn't just one or two of the APIs that are a bit problematic, but most are missing something critical from their interfaces. APIs that should have better paging, endpoints returning lists instead of objects, or just generally missing a critical query parameter to make them usable. For instance, want to get a list of messages sent to a channel, no problem, but you have to be in the channel. Makes sense, right? But, you can't even get messages that your bot sent. That means you need to duplicate the info about all message data on the bot side, and o that after you sent the message by reading the response (because remember, the Slack message api doesn't support idempotency). Want to know if a user is in a channel, welcome to paging through the list of all the users in that channel, even if there are 20k, and your pages only return 1k at a time. At small scale these things can work, but when you have a product available for an extended time, small little missing features cause big problems. Are you going to be able to get your 20 API calls in before 3 seconds are over, definitely not.

16. Misguided technical advice

Throughout all your mess in trying to figure out what is going on, you will end up contacting support. If you are lucky that they provide you with non-canned responses, you'll want to start questioning if the solutions to working with their apis are the correct ones. In the situations where events from Slack are taking over 20 seconds to be delivered, you are probably wondering why reliability of their delivery queue wouldn't be improved. But instead of taking that as a point of possible improvement, you'll get the advice like:

You'd call the conversations.history API for a specific time-frame at a rate of one request every 2 seconds, and compare the message details returned in the response with what you already collected from the event callbacks that you received/collected.

Sure that's reasonable on the surface, let's use the conversations.history endpoint, which was already untrustworthy, and call it 30 times a minute per user per workspace. For us that amounts to 10 times a second, however you'll notice that this is definitely higher than the allowed tier 3 rate limit of 50+/min, instead it is 600+.

This does show that the support is much better than tech companies with outsourced customer support. And further much better than companies like Google, where the first response is "No you're wrong, read the manual first", but it comes no where close to support for highly technical products such as AWS.

17. Tokens and configuration will overwhelm you

Even if Slack doesn’t make any changes, and they will, you will make mistakes because of confusion and complexity. The App UI and API documentation is so poor that when there is a change, you won’t be sure if clicking one little small button in the UI will cause you to lose all your users.

Your code will become confusing to maintain, there are some critical design choices that needed to be fixed as soon as possible, but instead of having this migration, Slack leaves those problems in place for the next app developer.

For example, assume you lost a bot token or worse, the app workspace or Slack breaks your token for that workspace, there is no way to force uninstall the app nor get a new token. That's right your installation stays stuck in broken mode until the workspace admin uninstall and then reinstall your app. And no, Slack support doesn't really help resolve this problem.

Wrapping up

I hope these sorts of issues help future engineers of bots and services better realize what Total Cost of Ownership really means when they build technology themselves. It isn’t just the features and feedback from users, but the platform you build on is just as important to be simple and polished. It’s a critical part of what you are building, how you are building, and what you spend your time on.

These sorts of problems all exist in the world of the developer experience. And experience like these for our team at Rhosys has led us to not only avoid them but ensure that the developer experience in security tools such as Authress works out of the box. We know if it isn’t simple, straightforward, and easy, then users will be unhappy, developers will be unhappy, and personally I will be unhappy. Although I don’t always need an excuse for that last one.

DEV Community