The Bike Shed

396: Build vs. Buy

Aug 8 '23

Joël has been fighting a frustrating bug where he's integrating with a third-party database, and some queries just crash. Stephanie shares her own debugging story about a leaky stub that caused flaky tests.

Additionally, they discuss the build vs. buy decision when integrating with third-party systems. They consider the time and cost implications of building their own integration versus using off-the-shelf components and conclude that the decision often depends on the specific needs and priorities of the project, including how quickly a solution is needed and whether the integration is core to the business's value proposition.

Transcript:

STEPHANIE: Hello and welcome to another episode of The Bike Shed, a weekly podcast from your friends at thoughtbot about developing great software. I'm Stephanie Minn.

JOËL: And I'm Joël Quenneville. And together, we're here to share a bit of what we've learned along the way.

STEPHANIE: So, Joël, what's new in your world?

JOËL: My world has been kind of frustrating recently. I've been fighting a really frustrating bug where I'm integrating with a third-party database. And there are queries that just straight-up crash. Any query that instantiates an instance of an ActiveRecord object will just straight-up fail. And that's because before, we make the actual query, almost like a preflight query that fetches the schema of the database, particularly the list of tables that the database has, and there's something in this schema that the code doesn't like, and everything just crashes.

Specifically, I'm using an ODBC connection. I forget exactly what the acronym stands for, Open Database connection, maybe? Which is a standard put up by Microsoft. The way I'm integrating it via Ruby is there's a gem that's a C extension. And somewhere deep in the C extension, this whole thing is crashing. So, I've had to sort of dust off some C a little bit to look through. And it's not super clear exactly why things are crashing. So, I've spent several days trying to figure out what's going on there. And it's been really cryptic.

STEPHANIE: Yeah, that does sound frustrating. And it seems like maybe you are a little bit out of your depth in terms of your usual tools for figuring out a bug are not so helpful here.

JOËL: Yeah, yeah. It's a lot harder to just go through and put in a print or a debug statement because now I have to recompile some C. And, you know, you can mess around with some things by passing different flags. But it is a lot more difficult than just doing, like, a bundle open and binding to RB in the code.

My ultimate solution was asking for help. So, I got another thoughtboter to help me, and we paired on it. We got to a solution that worked. And then, right before I went to deploy this change, because this was breaking on the staging website, I refreshed the website just to make sure that everything was breaking before I pushed the fix to see that everything is working. This is a habit I've picked up from test-driven development. You always want to see your test break before you see it succeed.

And this is a situation where this habit paid off because the website was just working. My changes were not deployed. It just started working again. Now it's gotten me just completely questioning whether my solution fixes anything. The difficulty is because I am integrating [inaudible 03:20] third-party database; it's non-deterministic. The schema on there is changing rather frequently.

I think the reason things are crashing is because there's some kind of bad data or data that the ODBC adapter doesn't like in this third-party system. But it just got introduced one day; everything started breaking, and then somehow it got removed, and everything is working again without any input or code changes on my end. So, now I don't trust my fix.

STEPHANIE: Oh no. Yeah, I would struggle with that because your reality has come crashing down, [laughs] or how you understood reality. That's tough. Where do you think you'll go from here? If it's no longer really an issue in this current state of the schema, is it worth pursuing further at this time?

JOËL: So, that's interesting because it turns into a prioritization problem. And for this particular project, with the deadlines that we have, we've decided it's not worth it. I've opened up a PR with my fix, with some pretty in-depth documentation for why I thought that was the fix and what I think the underlying problem is. If this shows up again in the next few days, I'll have that PR that I can pull in and see if it fixes things, and if it doesn't, I'll probably just close that PR, but it'll be available for us if we ever run into this again.

I've also looked at a few potential mitigating situations. Part of the problem is that this is a, like, massive system. The Rails app that I'm using really doesn't need to deal with this massive database. I think there's, you know, almost 1,000 tables, and I really only care about a subset of tables in, like, one underlying schema. And so, I think by reducing the permissions of my database user to only those tables that I care about, there's a lower chance of me triggering something like this.

STEPHANIE: Interesting. What you mentioned about, you know, having that PR continue to exist will be really helpful for future folks who might come across the same problem, right? Because then they can see, like, all of the research and investigation you've already done. And you may have already done this, but if you do think it's a schema issue, I'm curious about whether the snapshot of the schema could be captured from when it was failing to when it has magically gotten fixed. And I wonder if there may be some clues there for some future investigator.

JOËL: Yeah. I'm not sure what our backup situation is because this is a third-party system, so I'd have to figure out what things are like in the admin interface there. But yeah, if there is some kind of auditing, or snapshots, or backups, or something there, and I have rough, you know, if I know it's within a 24-hour period, maybe there's something there that would tell me what's happening.

My best guess is that there's some string that is longer than expected or maybe being marked as a CHAR when it should be a VARCHAR, or maybe something that's not a non-UTF-8 encoded character, or something weird like that. So, I never know exactly what was wrong in the schema. There's some weird string thing happening that's causing the Ruby adapter to blow up.

STEPHANIE: That also feels so unsatisfying [laughs] for you. I could imagine.

JOËL: Yeah, there's no, like, clean resolution, right? It's a, well, the bug is gone for now. We're trying to make it less likely for it to pop up again in the future. I'm trying to leave some documentation for the next person who's going to come along, and I'm moving forward, fingers crossed. Is that something you've ever had to do on one of your projects?

STEPHANIE: Given up? Yes. [laughter] I think I have definitely had to learn how to timebox debugging and have some action items for when I just can't figure it out. And, you know, like we mentioned, leaving some documentation for the next person to pick up, adding some additional logging so that maybe we can get more clues next time. But, you know, realizing that I do have to move on and that's the best that I can do is really challenging.

JOËL: So, you used two words here to describe the situation: one was giving up, and the other one was timebox. I think I really like the idea of describing this as timeboxing. Giving up feels kind of like, defeatist. You know, there's so many things that we can do with our time, and we really have to be strategic with how we prioritize. So, I like the idea of describing this as a timeboxing situation.

STEPHANIE: Yeah, I agree. Maybe I should celebrate every time that I successfully timebox something [laughs] according to how I planned to. [laughs]

JOËL: There's always room to extend the timebox, right?

STEPHANIE: [laughs] It's funny you bring up a debugging mystery because I have one of my own to share today. And I do have to say that it ended up being resolved, [chuckles] so it was a win in my book. But I will call this the case of the leaky stub.

JOËL: That sounds slightly scary.

STEPHANIE: It really was. The premise of what we were trying to figure out here was that we were having some flaky tests that were failing with a runtime error, so that was already kind of interesting. But it was quickly determined it was flaky because of the tests running in a certain order, so--

JOËL: Classic.

STEPHANIE: Right. So, I knew something was happening, and any tests that came after it were running into this error. And I was taking a look, and I figured out how to recreate it. And we even isolated to the test itself that was running before everything else, that would then cause some problems. And so, looking into this test, I saw that it was stubbing the find method on an ActiveRecord model.

JOËL: Interesting.

STEPHANIE: Yeah. And the stubbed value that we were choosing to return ended up being referenced in the tests that followed. So, that was really strange to me because it went against everything I understood about how RSpec cleans up stubs between tests, right?

JOËL: Yeah, that is really strange.

STEPHANIE: Yeah, and I knew that it was referencing the stub value because we had set a really custom, like, ID value to it. So, when I was seeing this exact ID value showing up in a test that seemed totally unrelated, that was kind of a clue that there was some leakage happening.

JOËL: So, what did you do next?

STEPHANIE: The next discovery was that the error was actually raised in the factory setup for the failing tests and not even getting to running the examples at all. So, that was really strange. And digging into the factories was also its own adventure because there was a lot of complexity in the factories. A lot of them used hooks as well that then called some application code. And it was a wild goose chase.

But ultimately, I realized that in the factory setup, we were calling some application code for that model where we had stubbed the find, and it had used the find method to memoize a class instance variable.

JOËL: Oh no. I can see where this is going.

STEPHANIE: Yeah. So, at some point, our model.find() returned our, you know, stub value that we had wanted in the previous test. And it got cached and just continued to leak into everything else that eventually would try to call that memoized method when it really should have tried to do that look-up for a separate record.

JOËL: And class instance variables will persist between tests as long as they're on the same thread, right?

STEPHANIE: Yeah, as far as I understand it.

JOËL: That sounds like a really frustrating journey. And then that moment when you see the class instance variable, and you're like, oh no, I can't believe this is happening.

STEPHANIE: Right? It was a real recipe for disaster, I think, where we had some, you know, really complicated factories. We had some sneaky caching issues, and this, you know, totally seemingly random runtime error that was being raised. And it was a real wild goose chase because there was not a lot of directness in going down the debugging path. I feel like I went around all over the codebase to get to the root of it. And, in the end, you know, we were trying to come up with some takeaways.

And what was unfortunate was that you know, like, normally, stubbing find can be okay if you are, you know, really wanting to make sure that you are returning your mocked value that you may have, like, stubbed some other stuff on in your test. But because of all this, we were like, well, should we just not stub find on this really particular model? And that didn't seem particularly sustainable to make as a takeaway for other developers who want to avoid this problem.

So, in the end, I think we scoped the stub to be a little more specific with the arguments that we wanted to target. And that was the way that we went forward with the particular flaky test at hand.

JOËL: It sounds like the root cause of the problem was not so much the stub as it was the fact that this value is getting cached at the class level. Is that right?

STEPHANIE: Yes and no. It seems like a real pain for running the tests. But I'm assuming that it was done for a good reason in production, maybe, maybe not. To be fair, I think we didn't need to cache it at all because it's calling a find, which is, you know, should be pretty quick and doesn't need to be cached. But who knows? It's hard to tell. It was really old code. And I think we were feeling also a little nervous to adjust something that we weren't sure what the impact would be.

JOËL: I'm always really skeptical of caching. Caching has its place. But I think a lot of developers are a little too happy to introduce one, especially doing it preemptively that, oh well, we might need a cache here, so why not? Let's add that. Or even sometimes, just as a blind solution to any kind of slowness, oh, the site is slow; let's throw a cache here and hope for the best.

And the, like, bedrock, like, rule zero of any kind of performance tuning is you've got to measure before and after and make sure that the change that you introduce actually makes things better. And then, also, is it better enough speed-wise that you're willing to pay any kind of costs associated to maintaining the code now that it's more complex? And a lot of caches can have some higher carrying costs.

STEPHANIE: Yeah, that's a great point. This debugging mystery an example of one of them.

JOËL: How long did it take you to figure out the solution here?

STEPHANIE: So, like you, I actually was on a bit of the incorrect path for a little while. And it was only because this issue affected a different flaky test that someone else was investigating that they were able to connect the dots and be like, I think these, you know, two issues are related. And they were the ones who ultimately were able to point us out to the offending test if you will. So, you know, it took me a few days. And I imagine it took the other developer a few days. So, our combined effort was, like, over a week.

JOËL: Yep. So, for all our listeners out there, you just heard that Stephanie and I [laughs] both went on multi-day debugging journeys. That happens to everyone. Just because we've been doing this job for years doesn't mean that every bug is, like, a thing that we figure out immediately.

So, separately from this bug that I've been working on, a big issue that's been front of mind for me on this project has been the classic build versus buy decision. Because we're integrating with a third-party system, we have to look at either building our own integration or trying to use some off-the-shelf components. And there's a few different levels of this.

There are some parts where you can actually, like, literally buy an integration and think through some of the decisions there. And then there's some situations where maybe there's an open-source component that we can use. And there's always trade-offs with both the commercial and the open-source situation. And we have to decide, are we willing to use this, or do we want to build our own? And those have been some really interesting discussions to have.

STEPHANIE: Yeah. I think you actually expanded this decision-making problem into a build versus buy versus open source because they are kind of, you know, really different solutions with different outcomes in terms of, you know, maintenance and dependencies, right? And that all have, like, a little bit of a different way to engage with them.

JOËL: Interesting. I think I tend to think of the buy category, including both like commercial off-the-shelf software and also open-source off-the-shelf software, things that we wouldn't build custom for ourselves but that are third-party components that we can pull in.

STEPHANIE: Yeah, that's interesting because I had a bit of a different mental model because, in my head, when you're buying a commercial solution, you, you know, are maybe losing out on some opportunities for customization or even, like, forking it on your own. So, with an open-source solution, there could be an aspect of making it work for you. Whereas for a commercial solution, you really become dependent on that other company and whether they are willing to cater [laughs] to your needs or not.

JOËL: That's fair. For something that's closed-source where you don't actually have access to the code, say it's more of a software as a service situation, then, yeah, you're kind of locked in and hoping that they can provide the needs that you have. On the flip side, you are generally paying for some level of support. The quality of that varies sometimes from one vendor to another. But if something goes wrong, usually, there's someone you can email, someone you can call, and they will tell you how to fix the problem, or they will fix it on their end.

STEPHANIE: For the purposes of this conversation, should we talk about the differences, you know, building yourself or leaning on an existing built-out solution for you?

JOËL: The project I'm working on is integrating with a Snowflake data warehouse, which is an external place that stores data accessible through something SQL-like. And one of the things that's attractive about this is that you can pull in data from a variety of different sources, transform it, and have it all stored in a kind of standardized structure that you can then integrate with. So, for pulling data in, you can build your own sort of ingestion pipeline, if you want, with code, and their APIs, and things.

But there are also third-party vendors that will give you kind of off-the-shelf components that you can use for a lot of popular other data sources that you might want to pull. So, you're saying; I want to pull from this external service. They've probably got a pre-built connector for it. They can also do things like pull from an arbitrary Postgres database on some other server if that's something you have access to.

It becomes really attractive because all you need to do is create an account on this website, plug in a few, like, API keys and URLs. And, all of a sudden, data is just flowing from one third-party system into your Snowflake data Warehouse, and it all just kind of works. And you don't have to bother with APIs, or ODBC, or any of that kind of stuff.

STEPHANIE: Got it. Yeah, that does sound convenient. As you were talking about this, I was thinking about how if I were in the position of trying to decide how to make that integration happen, the idea of building it would seem kind of scary, especially if it's something that I don't have a lot of expertise in.

JOËL: Yeah, so this was really interesting. In the beginning of the project, I looked into a little bit of what goes into building these, and it's fairly simple in terms of the architecture. You just need something that writes data files to typically something like an S3 bucket. And then you can point Snowflake to periodically pull from that bucket, and you write an import script to, you know, parse the columns and write them to the right tables in the structure that you want inside Snowflake.

Where things get tricky is the actual integration on the other end. So, you have some sort of third-party service. And now, how do you sort of, on a timer maybe, pull data from that? And if there are data changes that you're synchronizing, is it just all append-only data? Or are you allowing the third-party service to say, "Hey, I deleted this record, and you should reflect that in Snowflake?" Or maybe dealing with an update. So, all of these things you have to think about, as well as synchronization.

What you end up having to do is you probably boot up some kind of small service and, you know, maybe this is a small Ruby app that you have on Heroku, maybe this is, like, an AWS Lambda kind of thing. And you probably end up running this every so many seconds or so many minutes, do some work, potentially write some files to S3. And there's a lot of edge cases you have to think about to do it properly. And so, not having to think about all of those edge cases becomes really enticing when you're looking to potentially pay a third party to do this for you.

STEPHANIE: Yeah, when you used the words new service, I bristled a little bit [laughs] because I've definitely seen this happen maybe on a bit of a bigger scale for a tool or solution for some need, right? Where some team is formed, or maybe we kind of add some more responsibilities to an existing team to spin up a new service with a new repo with its own pipeline, and it becomes yet another thing to maintain. And I have definitely seen issues with the longevity of that kind of approach.

JOËL: The idea of maintaining a fleet of little services for each of our integrations seemed very unappealing to me, especially given that setting something like this up using the commercial approach probably takes 30 minutes per third-party service. There's no way I'm standing up an app and doing this whole querying every so many minutes, and getting data, and transforming it, and writing it to S3, and addressing all the edge cases in 30 minutes. And it's building something that's robust.

And, you know, maybe if I want to go, like, really low tech, there's something fun I could do with, like, a Zapier hook and just, like, duct tape a few services together and make this, like, a no-code solution. I still don't know that it would have the robustness of the vendor. And I don't think that I could do it in the same amount of time.

STEPHANIE: Yeah. I like the keyword robustness here because, at first, you were saying, like, you know, this looked relatively small in scope, right? The code that you had to write. But introducing all of the variables of things that could go wrong [laughs] beyond the custom part that you actually care about seemed quite cumbersome.

JOËL: I think there's also, at this point, a lot of really interesting prioritization questions. There are money questions, but there are also time questions you have to think about. So, how much dev time do we want to devote upfront to building out these integrations? And if you're trying to move fast and get a proof of concept out, or even get, like, an MVP out in front of customers, it might be worth paying more money upfront to a third-party vendor because it allows you to ship something this week rather than next month.

STEPHANIE: Yeah. The "How soon do you need it?" is a very good question to ask. Another one that I have learned to include in my arsenal of, you know, evaluating this kind of stuff comes from a thoughtbot blog by Josh Clayton, where he, you know, talks about the build versus buy problem. And his takeaway is that you should buy when your business is not dependent on it.

JOËL: When it's not part of, like, the core, like, value-add that your business is doing. Why spend developer time on something that's not, like, the core thing that your product is when you can pay someone else to do it for you? And like we said earlier, a lot of that time ends up being sunk into edge cases and robustness and things like that to the point where now you have to build an expertise in a, like, secondary thing that your business doesn't really care about.

STEPHANIE: Yeah, absolutely. I think this is also perhaps where very clear business goals or a vision would come in handy as well. Because if you're considering building something that doesn't quite support that vision, then it will likely end up continuing to be deprioritized over the long term until it becomes this thing that no one is accountable for maintaining and caring for.

And just causes a lot of, honestly, morale issues is what I've seen when some service that was spun up to try to solve a particular problem is kind of on its last legs and has been really neglected, and no one wants to work on it. But it ends up causing issues for the rest of the development team. But then they're also really focused on initiatives that actually do provide the business value. That is a really hard balancing act that I've seen teams struggle with.

JOËL: Earlier this year, we were talking about the book Sustainable Rails. And it really hammers home the idea of a carrying cost for the code, and I think that's exactly what we're talking about here. And that carrying cost can be time and money. But I like that you also mentioned the morale effects. You know, that's a carrying cost that just sort of depresses the productivity of your team when morale is low.

STEPHANIE: Yeah, absolutely. I'm curious if we could discuss some of the carrying costs of buying a solution and where you've seen that become tricky.

JOËL: The first thing to look at is the literal cost, the money aspect of things. And I think it's a really interesting situation for the business models for these types of Snowflake connectors because they typically charge by the amount of data that you're transmitting, so per row of data that you're transmitting. And so, that cost will fluctuate depending on whether the third-party service you're integrating with is, like, really chatty or not.

When you contrast that to building, building typically has a relatively fixed cost. It's a big upfront cost, and then there's some maintenance cost to go with it. So, if I'm building some kind of integration for, let's say, Shopify, then there's the cost I need to build up front to integrate that. And if that takes me, I don't know, a week or two weeks, or however long it is, you know, that's a pretty big chunk of time. And my time is money.

And so, you can actually do the math and say, "Well, if we know that we're getting so many rows per day at this rate from the commercial vendor, how many weeks do we have to pay for the commercial one before we break even and it becomes more expensive than building it upfront, just in terms of my time?" And sometimes you do that math, and you're like, wow, you know, we could be going on this commercial thing for, like, two years before we break even. In that case, from a purely financial point of view, it's probably worth paying for that connector.

And so, now it becomes really interesting. You say, okay, well, which are the connectors that we have that are low volume, and which are the ones that are high volume? Because each of them is going to have a different break-even point. The ones that you break even after, you know, three or four weeks might be the ones that become more interesting to have a conversation about building. Whereas some of the others, it's clearly not worth our time to build it ourselves.

STEPHANIE: The way you described this problem was really interesting to me because it almost sounds like you found the solution somewhere in the middle, potentially, where, you know, you may try building the ones that are highest priority, and you end up learning a lot from that experience, right? That could make it easier or at least, like, set you up to consider doing that moving forward in the future if you find, like, that is what is valuable.

But it's interesting to me that you kind of have the best of both worlds of, like, getting the commercial solutions now for the things that are lower value and then doing what you can to get the most out of building a solution.

JOËL: Yeah. So, my final recommendation ended up being, let's go all commercial for now. And then, once we've built out something, and because speed is also an issue here, once we've built out something and it's out with customers, and we're starting to see value from this, then we can start looking at how much are we paying per week for each of these connectors? And is it worth maybe going back and building our own for some of these higher-volume connectors? But starting with the commercial one for everything.

STEPHANIE: Yeah, I actually think that's generally a pretty good path forward because then you are also learning about how you use the commercial solution and, you know, which features of it are critical so that if you do eventually find yourselves, like, maybe considering a shift to building in-house, like, you could start with a more clear MVP, right? Because you know how your team is using an existing product and can focus on the parts that your business are dependent on.

JOËL: Yeah, it's that classic iterative development style. I think here it's also kind of inspired by a strategy I typically use for performance, which is make it work before you try to make it fast. And, actually, make it work, then profile, then measure, find the hotspots, and then focus on making those things fast. So, in this case, instead of speed, we're talking about money. So, it's make it work, then profile, find the parts that are expensive, and make the trade-off of, like, okay, is it worth investing into making that part less expensive in terms of resources?

STEPHANIE: I like that as a framework a lot.

JOËL: A lot of what we do as programmers is optimization, right? And sometimes, we're optimizing for execution time. Sometimes we're optimizing for memory cost, and sometimes we're optimizing for dollars.

STEPHANIE: Yeah, that's really interesting because, with the buy solution, you know very clearly, like, how much the thing will cost. Whereas I've definitely seen teams go down the building route, and it always takes longer than expected [laughs], and that is money, right? In terms of the developer's time, for sure.

JOËL: Yeah, definitely, like, add some kind of multiplier when you're budgeting out that build alternative because, quite likely, there are some edge cases that you haven't thought about that the commercial partner has, and you will have to spend more time on that than you expected.

STEPHANIE: Yeah, in addition to whatever opportunity cost of not working on something that is driving revenue for the business right now.

JOËL: Exactly.

STEPHANIE: So, the direction of this conversation ended up going kind of towards, like, what is best for the team at, like, a product and company level. But I think that we make these decisions a lot more frequently, even when it comes to whether we pull in a gem or, you know, use an open-source tool or not. And I would be really interested in discussing more of that in another episode.

JOËL: Yeah. That gets into some controversial takes, right? It's the evergreen topic of: do we build it ourselves, or do we pull in some kind of third-party package?

STEPHANIE: Something for the future to look forward to. On that note, shall we wrap up?

JOËL: Let's wrap up.

STEPHANIE: Show notes for this episode can be found at bikeshed.fm.

JOËL: This show has been produced and edited by Mandy Moore.

STEPHANIE: If you enjoyed listening, one really easy way to support the show is to leave us a quick rating or even a review in iTunes. It really helps other folks find the show.

JOËL: If you have any feedback for this or any of our other episodes, you can reach us @_bikeshed, or you can reach me @joelquen on Twitter.

STEPHANIE: Or reach both of us at hosts@bikeshed.fm via email.

JOËL: Thanks so much for listening to The Bike Shed, and we'll see you next week.

ALL: Byeeeeeeeee!!!!!

ANNOUNCER: This podcast is brought to you by thoughtbot, your expert strategy, design, development, and product management partner. We bring digital products from idea to success and teach you how because we care. Learn more at thoughtbot.com.

Support The Bike Shed

Episode source

The Bike Shed Follow

396: Build vs. Buy

The Bike Shed