Dane Hillard for ITHAKA

Posted on Jan 6 • Originally published at Medium on Sep 26, 2024

Choose Boring Releases

#integrationtesting #releasemanagement #agile #migration

Academic research increasingly relies on diverse content types, including gray literature and primary source materials, alongside traditional peer-reviewed works. ITHAKA supports this evolving research landscape through initiatives that foster cross-content connections on JSTOR, our digital library that supports research, teaching, and learning. These include infrastructure services that enable institutions to make their digital archives and special collections discoverable on the platform, and a years-long effort to integrate Artstor — a vast collection of images and multimedia for educational use — onto the JSTOR platform.

Albrecht Dürer (German, Nuremberg 1471–1528 Nuremberg). “Alberti Dvreri Pictoris et Architecti Praestantissimi De Vrbibvs…” 1535. Illustrated book, 78 pp.; H: 13 3/4 in. (35 cm). The Metropolitan Museum of Art. https://jstor.org/stable/community.34718827. Just one of the millions of items now available on JSTOR.

On August 1, 2024, the Artstor migration to the JSTOR platform culminated in the need to smoothly guide users from the Artstor Image Workspace (AIW) at library.artstor.org to appropriate counterparts on www.jstor.org. The amount of work this required was daunting; the AIW platform was as large and complex as JSTOR, and covered many different product areas with ownership spread across several different teams. To achieve this transition without drawing focus away from other objectives, we had to adopt a strategic approach.

The challenge

Our two main constituents for this transition were our users — primarily art history researchers and faculty — and search engine web crawlers. Like JSTOR users, many Artstor users start their searches outside of our platforms and need to be able to find things no matter what platform changes we make.

The first piece of the challenge was technical: Client-side routing was used extensively on the AIW platform. That routing wasn’t built with web crawlers in mind, so many modern web crawlers couldn’t handle it very well. Plus, these client-side routes were generally invisible to our servers because they used hash-based routing rather than the history API. Some pages did support server-side rendering so web crawlers could index them, which gave us an additional scenario to support.

The second challenge was human: Domain knowledge about the details of all this routing and the user value of various pages was spread throughout the organization, and the people with the domain knowledge didn’t always have the technical knowledge to deal with the redirects.

Finally, there was the sheer scope of the work to be done. AIW itself was a huge platform, but on top of that we also had to redirect supporting areas like the marketing and support sites. Very early on, we had to set some parameters to rein in the scope:

We wouldn’t guarantee or aim for 100% of traffic to be redirected. Some pages and requests simply couldn’t be redirected anywhere reasonable, and redirecting to generic informational pages can lead to search engine ranking penalties. More importantly for our user focus, it would be disorienting to click a bookmark or a search engine result and get redirected to something irrelevant.
We would create a central, extensible service designed for loose coupling of the URL patterns to match and the behavior when such a match occurred. This would reduce unnecessary coupling and dependencies between teams as well as the context those teams had to hold onto during the migration project.
We would enable teams to implement redirects for the broadest-used page types (such as individual items), but would equally enable implementing redirects for the long tail of irregular page types. This would create a familiar implementation pattern for teams that they wouldn’t have to abandon for more exotic use cases.
We would enable visibility into the behavior of the system as early as possible, so that we could observe the trend over time. This fostered an experimentation mindset: We could develop hypotheses about how the traffic would be handled, implement a change, and quickly confirm or disconfirm the desired effect after deploying the new change. This also gave us opportunities to leverage parts of our deployment platform we hadn’t exercised much previously, adding some greenfield work to an otherwise brownfield kind of activity.

So instead of perfection, we defined our scope as removing as much risk as possible and building in as much confidence as we could ahead of our launch date. Our goal was to coast through the finish line instead of scrambling.

The approach

So how did we do that? A few key tactics showed up for us and made things possible throughout this project.

Practice product-minded engineering

We approached the project using product-minded engineering principles, which leverage user-focused agile techniques and strong cross-disciplinary and interdisciplinary skills to identify problems worth solving. By considering both user needs and architectural needs together, we came away with a more globally maximum solution without oversimplifying or overengineering.

Use traffic shadowing

Traffic shadowing allows inbound requests to be sent to two or more destinations, each able to take some action. Although one system is still responsible for sending a response back to the user, other systems can also make observations about or take action on requests. Our Capstan platform made this easy to try, and it worked very well. We continued serving AIW pages to users while also sending requests to our new redirect service, so we could see how it would behave and perform.

Measure early and often

We built metrics into the redirect service so that as we started implementing and deploying redirect behaviors we could assess the total traffic volume and understand what portion of that traffic was being handled by a known redirect. That was a huge confidence builder, as we could actually demonstrate not only that the service should redirect the implemented URLs once turned on, but that it was in fact doing so — with the shadow traffic.

Use feature flags

Where we couldn’t implement traffic shadowing for some supporting pages, we used feature flagging. With feature flags, we can ship two different behaviors and then allow staff and automated tests to toggle those behaviors on or off before exposing that new behavior to users. Even under normal circumstances we have many teams developing across many areas of the platform, and a feature flagging strategy brings speed and safety by ensuring we don’t have to perform a “big bang” release at the very end of a project.

Think in transition architectures

In effect, all of the redirect architecture we built to support this migration was a transition architecture. But we also built some smaller transition systems to move the control of traffic from external vendors into our own platform, making it easier to flip a switch on release day instead of having to handle several moving parts in a row.

Easing the transition architecture burden made a big difference on several of our support sites. For example, the DNS records for these sites were initially pointed directly at third-party vendors; before we could control the traffic, the traffic would have to come to us. If we waited until release day for that to happen, the DNS change could take longer than expected to propagate, or it could result in unforeseen consequences, or the logic we implemented for it could be wrong. Instead, we designed these constraints away. We worked to bring the DNS under our control earlier, and although it initially continued to point to the third-party vendors, we could now make immediate changes to its behavior.

Create a loosely coupled design

Because we had to collect and implement so much widely scattered domain knowledge into these redirects, we had to build a system that made it easy to contribute and hard to cross wires or step on other people’s toes. The system we created allowed teams to come in with nothing more than a URL pattern to match on AIW and a destination URL to send the traffic to. If necessary, they could also contact downstream services for data to help them decide where to redirect that traffic.

Basically, we decided to meet teams where they are, with their domain knowledge, so they didn’t have to think too much about the technical aspects of how redirection works. This also allowed the team working on the redirect service to do that work without having to worry too much about the domain knowledge. This separated responsibility in a nice way and ensured better outcomes for us and our users.

Have regular, valuable trade-off discussions

Finally, we engaged throughout this whole project in high-quality trade-off discussions. We talked about the level of effort a particular change required compared to the expected user impact. We talked about specific pages and how to handle them. And because we had such a rich cross-section of technical and domain knowledge in those discussions, we made better decisions about how to simplify things or adjust to reduce our overall risk and costs.

This is really a reprise of product-minded engineering, but this highlights one of its most impactful outcomes — giving us context about what’s possible, what gives us the best ROI, and how we might deliver better outcomes for us and users.

The result

We worked hard leading up to our launch date, August 1. But release day itself was one of the most boring of all time. We did about 30 minutes of real work to flip on some of those feature flags and deploy a couple of applications, all of which we were highly confident in because of our previous metrics and testing. In that short time we started redirecting over 95% of all AIW traffic to known locations on JSTOR and its supporting sites. The remaining requests were for things we had intentionally decided not to handle.

We only shipped one bug that we know of, which we spotted quickly. Because of our system design we were able to revert immediately to the old system while we worked on a short-term fix, and within a couple of weeks put a long-term fix in place so we could sunset the old system.

Meanwhile, we’re listening for user experience feedback in case anyone is confused or notices broken redirects, and we’re watching how our search engine crawling and ranking responds. As Google and Bing and others recrawl our content, they’ll start sending people directly to JSTOR. NISO recommends leaving these redirects in place for at least a year, so we’ve committed to that. Next year we’ll evaluate whether this architecture has run its course and act accordingly.

The takeaways

While the technical tools and architecture we used were obviously important, adopting a loosely coupled, extensible design was key to our success. By bringing solutions to teams that allowed them to focus on their domain knowledge we played to everyone’s strengths. It also ensured teams weren’t colliding with each other too often. This approach helped us focus and synthesize everyone’s strengths. It may not be possible in all cases, but for this project it was one of our best and most productive decisions.

By thinking in transition architectures and building key visibility into the system at the outset, we were able to create significant confidence in our progress and readiness for release day, making the finale just a blip on the radar. This raised an unexpected consequence — it would have been easy to say, “Look, we did the thing,” and part ways on release day, because it was so uneventful. Choose boring releases, but be sure to keep the afterparty exciting.

Interested in learning more about working at ITHAKA? Contact recruiting to learn more about ITHAKA tech jobs.

DEV Community

Choose Boring Releases

The challenge

The approach

Practice product-minded engineering

Use traffic shadowing

Measure early and often

Use feature flags

Think in transition architectures

Create a loosely coupled design

Have regular, valuable trade-off discussions

The result

The takeaways

Top comments (0)

Read next

Adding to the Script with TypeScript

My first docker app

Was able to get my terminal looking nice thanks to this post!

Docker Volumes vs. Bind Mounts: Choosing the Right Storage for Your Containers.