<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Rob</title>
    <description>The latest articles on DEV Community by Rob (@newtorob).</description>
    <link>https://dev.to/newtorob</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F8356%2F59523132-eb85-458e-9db5-b57cb8ee59b2.jpeg</url>
      <title>DEV Community: Rob</title>
      <link>https://dev.to/newtorob</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/newtorob"/>
    <language>en</language>
    <item>
      <title>You're Migrating Off Opsgenie. Here's What You Should Actually Fix.</title>
      <dc:creator>Rob</dc:creator>
      <pubDate>Thu, 26 Mar 2026 14:35:11 +0000</pubDate>
      <link>https://dev.to/newtorob/youre-migrating-off-opsgenie-heres-what-you-should-actually-fix-5fln</link>
      <guid>https://dev.to/newtorob/youre-migrating-off-opsgenie-heres-what-you-should-actually-fix-5fln</guid>
      <description>&lt;h1&gt;
  
  
  You're Migrating Off Opsgenie. Here's What You Should Actually Fix.
&lt;/h1&gt;

&lt;p&gt;Opsgenie's end-of-support is April 2027. If you're on a small engineering team, you're probably mid-migration right now — comparing PagerDuty pricing tiers, reading incident.io vs. BetterStack threads, maybe resigning yourself to Jira Service Management because you're already deep in the Atlassian ecosystem.&lt;/p&gt;

&lt;p&gt;I want to suggest something uncomfortable before you pick your next tool: alerting was never your actual problem.&lt;/p&gt;

&lt;p&gt;I managed Opsgenie rotations at three different companies over the past eight years. FreightWaves, TextNow, Pilot Flying J. Different industries, different stacks, different team sizes. The pattern was always the same.&lt;/p&gt;

&lt;p&gt;Someone would deploy a change. Something would break. Opsgenie would page the on-call engineer. That engineer would open a Notion doc titled "Runbook — Service X" that hadn't been updated since 2022. They'd mostly ignore it and Slack the person who wrote the service. That person would fix it. Everyone would move on. Two weeks later, something similar would happen again.&lt;/p&gt;

&lt;p&gt;Opsgenie did its job perfectly. It routed the alert to the right person. The problem was everything else.&lt;/p&gt;

&lt;h2&gt;
  
  
  The question nobody was asking
&lt;/h2&gt;

&lt;p&gt;At none of those companies — not one — did anyone ask the obvious question before deploying: is it safe to push right now?&lt;/p&gt;

&lt;p&gt;Not "did CI pass." Not "did someone approve the PR." I mean: is the system healthy enough to absorb a change right now? Are we burning through error budget? Is there already an active incident? Did someone just deploy 20 minutes ago and we haven't seen the impact yet?&lt;/p&gt;

&lt;p&gt;Nobody asked because there was no way to answer it. The information existed — scattered across Datadog, PagerDuty, GitHub, Slack — but nobody had assembled it into a single decision. So engineers deployed based on gut feel. "Seems fine." "I don't see anything in Slack." "The dashboards look okay I guess."&lt;/p&gt;

&lt;p&gt;43% of incidents are preceded by a recent deploy. That number didn't surprise me at all when I first saw it. It matched what I'd lived through.&lt;/p&gt;

&lt;h2&gt;
  
  
  The runbook problem is worse than you think
&lt;/h2&gt;

&lt;p&gt;Here's the thing about the Opsgenie migration conversation that nobody is having: most teams using Opsgenie didn't just use it for alerting. It was their entire incident process. Alert comes in, Opsgenie pages someone, that person figures it out. There was no structure beyond that.&lt;/p&gt;

&lt;p&gt;The runbooks — if they existed — lived in Confluence or Notion. I wrote about this in &lt;a href="https://dev.to/blog/incident-management-without-sre"&gt;incident management without a dedicated SRE&lt;/a&gt;, and the core problem hasn't changed: a runbook that's three clicks away from the alert that triggered it is a runbook that doesn't get opened at 3am.&lt;/p&gt;

&lt;p&gt;I've seen this enough times to have a visceral reaction to it. The on-call engineer gets paged, opens Slack, asks "has anyone seen this before?" and waits. Meanwhile the customer is staring at a broken login page. The runbook that would have told them to check the config deployment and roll back the last change is sitting in a Confluence space that the engineer didn't even know existed.&lt;/p&gt;

&lt;p&gt;Teams that connect their runbooks directly to their alerts — so the runbook opens automatically when the relevant alert fires — cut their mean time to resolution from 67 minutes to 23. That's not a marginal improvement. That's the difference between an incident that costs you a customer and one that costs you 20 minutes.&lt;/p&gt;

&lt;h2&gt;
  
  
  What "migrating off Opsgenie" should actually mean
&lt;/h2&gt;

&lt;p&gt;If you're going to rip out a core piece of your incident workflow, this is the moment to ask harder questions than "which alerting tool has the best Slack integration."&lt;/p&gt;

&lt;p&gt;The questions I'd ask:&lt;/p&gt;

&lt;p&gt;Do you know whether it's safe to deploy right now? Not in a gut-feel way. In a "here's your error budget status, here's your active incident count, here's your deploy velocity over the last 24 hours" way. If you don't have that, you're going to keep causing the incidents that your new shiny alerting tool routes to your team.&lt;/p&gt;

&lt;p&gt;When someone gets paged, do they know what to do? Not "figure it out." Actually know — because the runbook showed up in front of them automatically, with the steps they need and the context about what changed. If your runbooks are still in a wiki, your migration isn't going to fix the thing that actually hurts.&lt;/p&gt;

&lt;p&gt;Are you learning anything from your incidents? Not in a blameless-postmortem-Google-Doc way. I mean: does your system know that this service broke last Tuesday after a similar deploy? Does your deploy process incorporate the history of what's gone wrong before? Most teams I've worked with have zero institutional memory. Every incident is treated as a surprise, even when it's the third time it's happened.&lt;/p&gt;

&lt;h2&gt;
  
  
  Don't just swap your paging tool
&lt;/h2&gt;

&lt;p&gt;The Opsgenie shutdown is a forcing function. Use it.&lt;/p&gt;

&lt;p&gt;If you just swap Opsgenie for PagerDuty or BetterStack, you'll have the same problem in a different UI. Engineers deploying blind. Runbooks gathering dust. Your on-call rotation burning people out because every incident starts from scratch. I wrote about the monitoring version of this trap in &lt;a href="https://dev.to/blog/your-startup-doesn-t-need-better-monitoring-it-needs-less-of-it"&gt;your startup doesn't need better monitoring&lt;/a&gt; — the tooling isn't the bottleneck. The process is.&lt;/p&gt;

&lt;p&gt;The actual fix is a layer that sits before your alerting tool. A deploy gate that tells your team whether it's safe to push. Connected runbooks that show up when things break. Incident data that compounds into institutional knowledge so your team stops relearning the same failure every quarter.&lt;/p&gt;

&lt;p&gt;That's what I'm building at Strake. It's in private beta right now and it works alongside whatever alerting tool you pick — PagerDuty, BetterStack, Grafana OnCall, whatever. The deploy gate and the runbook layer are the parts that were always missing, regardless of who was routing the page.&lt;/p&gt;

&lt;p&gt;If you're mid-migration and want to talk through how your team handles deploy safety, I'm happy to jump on a call. Not a pitch — I'm genuinely trying to learn from teams going through this right now.&lt;/p&gt;

&lt;p&gt;Rob is building Strake — a deploy gate and incident workflow platform for engineering teams without dedicated SRE coverage. If your current incident process is "someone posts in Slack and we figure it out from there," come take a look at strake.dev.&lt;/p&gt;

</description>
      <category>devops</category>
      <category>automation</category>
      <category>ai</category>
      <category>discuss</category>
    </item>
    <item>
      <title>Incident Management for Teams Without a Dedicated SRE: A Practical Guide</title>
      <dc:creator>Rob</dc:creator>
      <pubDate>Mon, 23 Mar 2026 19:32:10 +0000</pubDate>
      <link>https://dev.to/newtorob/incident-management-for-teams-without-a-dedicated-sre-a-practical-guide-16cb</link>
      <guid>https://dev.to/newtorob/incident-management-for-teams-without-a-dedicated-sre-a-practical-guide-16cb</guid>
      <description>&lt;p&gt;Most incident management advice assumes you have a real SRE function already in place. Dedicated rotations, formal roles, long severity docs, postmortem templates with twelve sections. That advice is useful in the right environment. It just doesn't map especially well to a smaller team where the CTO, the senior backend engineer, and the person who shipped the last deploy are all effectively part of the incident process.&lt;/p&gt;

&lt;p&gt;If you're running with a lean engineering team and no dedicated SRE, the goal isn't sophistication. The goal is clarity. When something breaks, you want three things to be true: you notice quickly, the right person knows what to do next, and the team fixes the underlying issue often enough that the same incident doesn't keep resurfacing.&lt;/p&gt;

&lt;p&gt;That's the version of incident management that actually helps when your current process is still "someone posts in Slack and we figure it out from there."&lt;/p&gt;

&lt;h2&gt;
  
  
  What You Actually Need (vs. What SRE Content Tells You You Need)
&lt;/h2&gt;

&lt;p&gt;For a small team, the list is shorter than people make it sound.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What actually matters:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;You need to know something is broken before a customer tells you.&lt;/strong&gt; That means basic monitoring and alerting. Nothing fancy, just reliable enough that you are not learning about outages from support tickets. I wrote more about that in &lt;a href="https://strake.dev/blog/your-startup-doesnt-need-better-monitoring" rel="noopener noreferrer"&gt;your startup doesn't need better monitoring&lt;/a&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;You need a simple response path.&lt;/strong&gt; Who gets paged, what they check first, where the incident lives, and when they pull in help. That can fit on one page.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;You need a lightweight habit of learning from incidents.&lt;/strong&gt; Not a heavy postmortem ceremony. Just enough follow-through that the same issue doesn't bite you for the fourth time.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Everything else is secondary. SLOs, error budgets, review boards, chaos exercises, and the rest can be useful later. They are not the first thing standing between you and a workable incident process.&lt;/p&gt;

&lt;h2&gt;
  
  
  Building Your Incident Response Process From Scratch
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Start with three severity levels, not five
&lt;/h3&gt;

&lt;p&gt;I've found that three severity levels are enough for most small teams. More than that usually creates debate without improving the response.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;P1 — the product is down or a core workflow is broken for everyone.&lt;/strong&gt; Someone gets paged immediately. You stay on it until service is back, and if customers are clearly affected, you communicate early instead of waiting for a perfect explanation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;P2 — something important is degraded, but the product still basically works.&lt;/strong&gt; Maybe a major feature is unstable, or a subset of users is having a bad time. This should get attention quickly, but it usually does not justify waking someone up overnight unless the business impact is unusually high.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;P3 — something is wrong, but it can wait.&lt;/strong&gt; A background job is failing, a dashboard is stale, or a non-critical dependency is acting up. This becomes a ticket, not a page.&lt;/p&gt;

&lt;p&gt;The real value here is not the wording. It's the discipline behind it. A P1 should mean "wake someone up." A P3 should mean "nobody loses sleep." Once teams blur those lines, alert fatigue shows up fast.&lt;/p&gt;

&lt;h3&gt;
  
  
  Set up your on-call rotation
&lt;/h3&gt;

&lt;p&gt;Three engineers is the minimum rotation I've seen hold up for more than a few weeks. With two people, someone is on-call every other week and starts to dread the whole thing. With three, it's still not luxurious, but it's survivable.&lt;/p&gt;

&lt;p&gt;On tooling, this is one place where I would spend the money. Use PagerDuty or OpsGenie. Don't build a homemade paging system around Slack, calendars, and someone's phone settings. Alert routing at 3am is a solved problem, and solved problems are worth buying.&lt;/p&gt;

&lt;h3&gt;
  
  
  Create your war room protocol
&lt;/h3&gt;

&lt;p&gt;When a real incident starts, you need a predictable place for it to live.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The on-call engineer opens a Slack channel like &lt;code&gt;#inc-2026-03-23-api-errors&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;They post the current state right away, even if the update is just "seeing elevated 500s, investigating."&lt;/li&gt;
&lt;li&gt;If they are still stuck after 10-15 minutes, they pull in the person closest to the affected system.&lt;/li&gt;
&lt;li&gt;They keep posting short updates on a fixed rhythm. Fifteen minutes is usually enough.&lt;/li&gt;
&lt;li&gt;When the incident is over, they leave behind a short summary of what happened, what fixed it, and what follow-up work is needed.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That is usually enough. Small teams do not need to invent every formal incident role they have seen in enterprise playbooks. If three people are involved, one of them can keep the channel updated while working the problem.&lt;/p&gt;

&lt;h3&gt;
  
  
  Build your first 10 runbooks
&lt;/h3&gt;

&lt;p&gt;"Runbook" makes this sound heavier than it is. For a small team, a runbook is just a checklist for a known failure mode.&lt;/p&gt;

&lt;p&gt;Start with the ten things that have already hurt you. For most startups, the list looks roughly like this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;API returning 5xx errors&lt;/li&gt;
&lt;li&gt;Database connection failures&lt;/li&gt;
&lt;li&gt;High response latency&lt;/li&gt;
&lt;li&gt;Background job queue backed up&lt;/li&gt;
&lt;li&gt;Third-party API dependency down&lt;/li&gt;
&lt;li&gt;SSL certificate expired (yes, this still happens)&lt;/li&gt;
&lt;li&gt;Disk full&lt;/li&gt;
&lt;li&gt;Deploy broke something&lt;/li&gt;
&lt;li&gt;DNS issues&lt;/li&gt;
&lt;li&gt;Authentication/login broken&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Each runbook only needs three things: &lt;strong&gt;what to check first&lt;/strong&gt;, &lt;strong&gt;how to mitigate&lt;/strong&gt;, and &lt;strong&gt;who to pull in if the first pass doesn't work&lt;/strong&gt;. In practice that means a few dashboards, a few commands, a rollback or restart path, and a clear escalation point.&lt;/p&gt;

&lt;p&gt;The standard I like is simple: could a reasonably capable engineer follow this at 3am while half-awake and either stabilize the system or know exactly who to call next? If yes, the runbook is doing its job.&lt;/p&gt;

&lt;h2&gt;
  
  
  The On-Call Rotation Reality for Small Teams
&lt;/h2&gt;

&lt;p&gt;On-call at a startup is never going to feel glamorous, and pretending otherwise usually makes it worse.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Compensation matters.&lt;/strong&gt; That can mean extra PTO, comp time, a monthly stipend, or some combination. The exact mechanism matters less than the signal that on-call work is real work. If someone gets dragged out of bed at 3am and is still expected to operate like nothing happened at 9am, resentment builds quickly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Set a sane page budget.&lt;/strong&gt; As a rule of thumb, outside-business-hours pages should be rare. If people are getting woken up multiple times a week, either the alerts are too noisy or the system is genuinely unstable. Both are fixable engineering problems.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Ease people into the rotation.&lt;/strong&gt; New engineers should shadow first, then serve as backup, then take primary. On-call is stressful enough without making someone learn your systems and your incident process at the same time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Runbooks lower the emotional cost.&lt;/strong&gt; Most people can handle being paged occasionally. What really spikes the stress is waking up and feeling like there is no map. A decent runbook doesn't remove the pressure, but it changes the experience from "solve a mystery in the dark" to "work through a checklist and escalate if needed."&lt;/p&gt;

&lt;h2&gt;
  
  
  What to Track and Why
&lt;/h2&gt;

&lt;p&gt;You do not need a massive reliability dashboard. Four numbers will tell you most of what you need to know.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Time to detect (TTD).&lt;/strong&gt; How long does it take from breakage to awareness? If customers usually tell you first, your alerting is not doing its job.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Time to resolve (TTR).&lt;/strong&gt; How long does it take from the first alert to a verified fix in production? This is the number that tells you whether incidents are annoying or truly expensive.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Incident frequency by service.&lt;/strong&gt; Which part of the system keeps paging you? That is where your reliability work should go first, even if another problem feels more interesting technically.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Repeat incidents.&lt;/strong&gt; What keeps coming back? This one is painful, but useful. Recurring incidents usually mean you only treated the symptom last time.&lt;/p&gt;

&lt;p&gt;You can track all of this in a spreadsheet. The tool doesn't matter much. The habit does.&lt;/p&gt;

&lt;h2&gt;
  
  
  When to Hire a Dedicated SRE
&lt;/h2&gt;

&lt;p&gt;Usually later than you think.&lt;/p&gt;

&lt;p&gt;Here are the signals that it may actually be time to bring in dedicated SRE help:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;A big chunk of engineering time is disappearing into operational work.&lt;/strong&gt; If incident response, infra maintenance, deploy babysitting, and general firefighting are eating the team alive, the opportunity cost becomes real.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The on-call burden is consistently high.&lt;/strong&gt; If engineers are getting paged constantly and the causes are infra-heavy rather than straightforward application bugs, that is often a sign that reliability needs more dedicated ownership.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;You now have contractual uptime expectations.&lt;/strong&gt; Once you are selling into larger customers with SLA language, uptime reporting, and incident expectations, someone needs to own that discipline full-time.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The system has outgrown shared context.&lt;/strong&gt; When no one person can explain the major moving parts with confidence, the risk profile changes.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The team is large enough that coordination itself is becoming the problem.&lt;/strong&gt; At some point, the process needs an owner even if the tech stack is still manageable.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Until then, the more immediate win is usually better operational visibility. The on-call engineer should not have to bounce between PagerDuty, Slack, GitHub, cloud dashboards, and three monitoring tabs just to answer the basic question of "what changed, what is broken, and who owns it?"&lt;/p&gt;

&lt;p&gt;That's the problem we're focused on at Strake. Not replacing an SRE team, but giving smaller teams enough context to respond faster, understand what is failing, and stop relearning the same incident twice.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://strake.dev" rel="noopener noreferrer"&gt;Strake is in beta and it's free to try.&lt;/a&gt;&lt;/strong&gt; If you're a small team managing incidents with Slack threads and tribal knowledge, come take a look.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Rob is building &lt;a href="https://strake.dev" rel="noopener noreferrer"&gt;Strake&lt;/a&gt; — an operational platform for startup founders that connects your tools, surfaces what needs your attention, and cuts the overhead of running a company before it buries you. Less time managing operations. More time building the thing.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;If that's the problem you're living with, follow along or reach out on &lt;a href="https://x.com/strakedev" rel="noopener noreferrer"&gt;X&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>startup</category>
      <category>devops</category>
      <category>webdev</category>
      <category>programming</category>
    </item>
    <item>
      <title>Your Startup Doesn't Need Better Monitoring. It Needs Less of It.</title>
      <dc:creator>Rob</dc:creator>
      <pubDate>Wed, 18 Mar 2026 18:56:37 +0000</pubDate>
      <link>https://dev.to/newtorob/your-startup-doesnt-need-better-monitoring-it-needs-less-of-it-nmf</link>
      <guid>https://dev.to/newtorob/your-startup-doesnt-need-better-monitoring-it-needs-less-of-it-nmf</guid>
      <description>&lt;p&gt;I'm going to say something that will annoy every SRE who's ever given a conference talk: most of what they tell you about observability is wrong for your stage.&lt;/p&gt;

&lt;p&gt;Not wrong in general. Wrong for you. A founding team of six people shipping a B2B SaaS product does not have the same operational needs as Google. I know this sounds obvious written down. But I watch founders set up Datadog with 47 custom dashboards before they have 47 customers, and nobody's telling them to stop.&lt;/p&gt;

&lt;p&gt;I did exactly this. About two years into my first startup, I spent an entire weekend building what I genuinely believed was a world-class monitoring stack. Prometheus, Grafana, custom exporters, alert rules for CPU, memory, disk, network throughput, request latency at p50, p95, p99, p99.9 — the works. I felt like a real engineer. Professional. Prepared.&lt;/p&gt;

&lt;p&gt;Then I got paged at 3am on a Tuesday because CPU hit 80% on a box that was completely fine. The alert was technically correct. The threshold was just wrong. I silenced it, went back to sleep, got paged again at 4am for a memory warning that also didn't matter. By morning I'd silenced four alerts and missed the one email from a customer saying they couldn't log in.&lt;/p&gt;

&lt;p&gt;The login bug had nothing to do with CPU or memory. A config file got borked during a deploy. None of my beautiful dashboards caught it because I was monitoring infrastructure when I should have been monitoring the product.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Actually Matters
&lt;/h2&gt;

&lt;p&gt;Here's what I think you need at the early stage. Not what the monitoring vendor's blog post says. Not what the "complete observability guide" on Medium recommends. What actually keeps your customers happy and lets you sleep.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;One: know if your thing is up.&lt;/strong&gt; That's it. A simple HTTP check against your most important endpoint, every 30 seconds. If it fails three times in a row, text yourself. I don't care if you use UptimeRobot, Pingdom, or a cron job that curls your health check — it doesn't matter. The fancy tool doesn't help if you're checking the wrong thing. Hit the endpoint your customers actually use. Not &lt;code&gt;/health&lt;/code&gt;. Not &lt;code&gt;/ping&lt;/code&gt;. The actual login page, or the API call that matters most. If that works, you're probably fine. If it doesn't, you need to know immediately.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Two: know if your customers are getting errors.&lt;/strong&gt; This means tracking your HTTP 5xx rate. You can do this in CloudWatch, in your application logs, in whatever. The point is: if more than, say, 1% of requests are returning server errors, something is wrong and you should look at it. During business hours. Not at 3am. Unless it's way above 1%, in which case yes, wake up.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Three: know if things are slow.&lt;/strong&gt; Response time matters, but you don't need seven percentile buckets. Track p95 latency. If your p95 is under 500ms for a typical API call, you're fine. If it's climbing, investigate when you're awake. If it suddenly spikes to 5 seconds, that's worth waking up for.&lt;/p&gt;

&lt;p&gt;That's the list. Three things. Everything else is noise at your stage.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Alert Hygiene Nobody Talks About
&lt;/h2&gt;

&lt;p&gt;Here's a rule I wish someone had tattooed on my forearm before I started: every alert that wakes you up must require you to do something right now. Not "hmm, interesting." Not "I should look at this tomorrow." Right now, tonight, in your underwear, something needs to be done or it wasn't worth waking you up.&lt;/p&gt;

&lt;p&gt;If you get paged and the correct response is "I'll check this in the morning," that alert is broken. Downgrade it. Make it a Slack notification. Make it an email. Make it a dashboard you glance at with your coffee. But do not let it wake you up.&lt;/p&gt;

&lt;p&gt;I know this sounds aggressive. You're thinking "but what if I miss something?" You might. And that's okay. Because the alternative is alert fatigue, which is when you've been woken up by false alarms so many times that you start sleeping through the real ones. Alert fatigue has caused more outages than missing alerts ever has. I'd bet money on it.&lt;/p&gt;

&lt;p&gt;At one point I had 30+ alert rules configured. I was getting maybe 4-5 notifications a day. I started ignoring all of them. It took a customer emailing our support address (which was my personal Gmail) to tell me the payment flow had been broken for six hours. Six hours. While my monitoring stack was happily telling me that CPU utilization was nominal.&lt;/p&gt;

&lt;p&gt;I deleted 25 of those alerts in one commit. Kept five. Slept better. Caught more real problems. Go figure.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Tools Question
&lt;/h2&gt;

&lt;p&gt;People ask me what monitoring tools to use and I think the honest answer is: it barely matters, and spending a week evaluating tools is a week you didn't spend building your product.&lt;/p&gt;

&lt;p&gt;If you're on AWS, CloudWatch is already there and it's fine. The UI is ugly and the query language is annoying but it works. If you want something nicer, Grafana Cloud has a free tier that's generous enough for a small startup. If you have money to spend and want things to just work out of the box, Datadog is great — but you will be shocked by the bill once you grow. Their pricing model is designed to be cheap when you're small and extremely expensive when you're not. Just know what you're signing up for.&lt;/p&gt;

&lt;p&gt;The one tool I'd say is genuinely worth paying for early: an error tracking service. Sentry, Bugsnag, something like that. It catches unhandled exceptions in your application code, groups them, shows you the stack trace, tells you which deploy introduced it. This is the stuff that actually breaks your product for users, and application-level error tracking catches it way faster than infrastructure monitoring ever will.&lt;/p&gt;

&lt;h2&gt;
  
  
  What To Add Later (Not Now)
&lt;/h2&gt;

&lt;p&gt;When you have paying customers with SLAs, or when you've got 10+ services talking to each other, or when you're waking up more than twice a month for real incidents — that's when you start thinking about distributed tracing, log aggregation, SLOs, error budgets, and all the other stuff that makes the SRE Twitter crowd excited.&lt;/p&gt;

&lt;p&gt;Not before. I promise you, nobody churned because you didn't have distributed tracing. They churned because your app was down and you didn't notice for two hours because you were drowning in alerts about disk utilization.&lt;/p&gt;

&lt;p&gt;The bigger unlock at the early stage is getting all your operational context — deploys, errors, customer signals, team activity — in one place so you're not switching between eight tools to figure out what's happening. That's a different problem than monitoring, and it's the one that actually slows founders down.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Actual Hard Part
&lt;/h2&gt;

&lt;p&gt;The real operational skill at the early stage isn't monitoring. It's deploy discipline. Can you ship a change and roll it back in under five minutes if something goes wrong? Do you know what changed between "it was working" and "it's not working"? Can you look at your deploy history and your error rate on the same timeline?&lt;/p&gt;

&lt;p&gt;If you can do that, you can fix almost anything fast enough that your customers won't care. And at the early stage, fast recovery beats prevention every single time. You don't have the team or the time to prevent every problem. But you can damn sure get good at fixing them quickly.&lt;/p&gt;

&lt;p&gt;Build the smallest monitoring setup that tells you when customers are hurting. Delete everything else. Ship your product.&lt;/p&gt;

&lt;p&gt;The operational layer of a startup — knowing what's happening, what needs attention, what can wait — should take minutes a day, not hours. That's the problem worth solving.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Rob is building &lt;a href="https://strake.dev/" rel="noopener noreferrer"&gt;Strake&lt;/a&gt; — an operational platform for startup founders that connects your tools, surfaces what needs your attention, and cuts the overhead of running a company before it buries you. Less time managing operations. More time building the thing.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;If that's the problem you're living with, follow along or reach out on &lt;a href="https://x.com/strakedev" rel="noopener noreferrer"&gt;X&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>devops</category>
      <category>startup</category>
      <category>ai</category>
      <category>monitoring</category>
    </item>
    <item>
      <title>What are some of your favorite live coders?</title>
      <dc:creator>Rob</dc:creator>
      <pubDate>Wed, 23 Jan 2019 17:02:07 +0000</pubDate>
      <link>https://dev.to/newtorob/what-are-some-of-your-favorite-live-coders-3fm6</link>
      <guid>https://dev.to/newtorob/what-are-some-of-your-favorite-live-coders-3fm6</guid>
      <description>&lt;p&gt;Hey all,&lt;/p&gt;

&lt;p&gt;I am always looking for people to watch that code live. I love to see other people as they think through their problems and fix issues. Who are some of your favorite people to watch live code? &lt;/p&gt;

</description>
      <category>discuss</category>
    </item>
    <item>
      <title>How to deal with being laid off?</title>
      <dc:creator>Rob</dc:creator>
      <pubDate>Tue, 19 Jun 2018 15:30:43 +0000</pubDate>
      <link>https://dev.to/newtorob/how-to-deal-with-being-laid-off-2b0b</link>
      <guid>https://dev.to/newtorob/how-to-deal-with-being-laid-off-2b0b</guid>
      <description>&lt;p&gt;I am about to be laid off, it is 100% certain to happen. This was my first dev job and this will also be my first time being laid off. I have always had another job lined up and ready in the event that something was to happen or if I was planning on leaving. But, I was enjoying myself so much, learning day in and day out what being a developer is like and honing my craft that I felt like it would never end. This job allowed me to work remotely if I wanted or go into an office, though being fully remote would be awesome. I mainly worked in front-end QA, working with selenium and protractor and even automated our whole testing setup using Jenkins. I have some DevOps experience as well, I have a computer science degree and a certification as a Certified Developer - Associate through AWS. &lt;/p&gt;

&lt;p&gt;This was a kick to the pants and a complete surprise, I have never been caught off guard like this at a company and would love some information on what I should do next. And if anyone has connections in the Chattanooga, TN area or they have remote positions, let me know. &lt;/p&gt;

&lt;p&gt;I feel like I threw a lot out here, I am just panicking. I love being a developer, I love learning something new every day and I really enjoy my work. &lt;/p&gt;

&lt;p&gt;Thanks,&lt;/p&gt;

&lt;p&gt;Rob&lt;/p&gt;

</description>
      <category>work</category>
      <category>job</category>
      <category>development</category>
      <category>discuss</category>
    </item>
    <item>
      <title>What is your least favorite part of software development?</title>
      <dc:creator>Rob</dc:creator>
      <pubDate>Wed, 09 May 2018 12:55:49 +0000</pubDate>
      <link>https://dev.to/newtorob/what-is-your-least-favorite-part-of-software-development-3hjm</link>
      <guid>https://dev.to/newtorob/what-is-your-least-favorite-part-of-software-development-3hjm</guid>
      <description></description>
      <category>devops</category>
      <category>webdev</category>
      <category>programming</category>
      <category>discuss</category>
    </item>
    <item>
      <title>Staying Focused</title>
      <dc:creator>Rob</dc:creator>
      <pubDate>Mon, 07 May 2018 21:27:26 +0000</pubDate>
      <link>https://dev.to/newtorob/staying-focused-57hn</link>
      <guid>https://dev.to/newtorob/staying-focused-57hn</guid>
      <description>&lt;p&gt;Hello, if you're anything like me... you're probably here because there's something else you should be doing right now but aren't. Yes, procrastination is a fickle bitch, however, it led me to think about issues I have had lately with my own focus. I'm not a productivity guru or anything, just someone trying to get my thoughts out of my head and onto paper, so take what I say here with a grain of salt.&lt;/p&gt;

&lt;p&gt;Let's begin...&lt;/p&gt;

&lt;p&gt;I recently picked up a book called "The Shallows: What the Internet is Doing to Our Brains", link is &lt;a href="https://smile.amazon.com/Shallows-What-Internet-Doing-Brains/dp/0393339750?sa-no-redirect=1" rel="noopener noreferrer"&gt;here&lt;/a&gt;, also you should be using smile.amazon to buy things so you donate to a good cause. Anywho, this book brought up a lot of good points about how we learn in this digital age, mainly around how we focus on a lot of things and how we can't seem to focus on one thing at a time. &lt;/p&gt;

&lt;p&gt;I had been feeling like I had some kind of 'brain fog', as in, I felt like I couldn't really process any new information. I was taking courses and watching how-to videos on youtube, reading books, etc. None of it was sticking, I panicked because that's an awful feeling to have, that your brain is all of a sudden feeling like an impenetrable wall. Someone on Reddit recommended this book to me because it helped them with the same issues. &lt;/p&gt;

&lt;p&gt;What it boils down to, and the point of my post here is we are in an age like no other, you used to have to sit down with 20 some odd books to find the right passage for one line in a single term paper in a history class. Now that one line is just a simple google search away. Your brain is going 1000 mph and intaking so much information in that it can't process one page at a time. That's what the issue is, sit down and try to read a single long page in a document about an issue, or a news article, a lot of times you'll resort to skimming the text and call it a day. Our brains are evolving in front of our eyes quicker than we can keep up. &lt;/p&gt;

&lt;p&gt;So I urge you, and the point of this post is to slow down, don't feel like you have to learn it all at once. Sure everyone wants to be that 'hacker guy' who knows all the answers in the flip of a switch, but reality doesn't work like that at least at first. There will be a day when you'll know a lot of things about software development, it's inevitable. However, I implore you to focus and hone one thing at a time at first, don't feel like you have to learn Java and Javascript, or C and PHP, don't feel like you have to 'devopsify' all of your projects at first. Get the basics down and focus on one thing, otherwise, you'll end up knowing a little about a lot of things that don't really help you much in your developer job. &lt;/p&gt;

&lt;p&gt;Thanks for reading. &lt;/p&gt;

&lt;p&gt;Rob&lt;/p&gt;

</description>
      <category>productivity</category>
      <category>development</category>
      <category>focus</category>
      <category>webdev</category>
    </item>
    <item>
      <title>How do you take notes?</title>
      <dc:creator>Rob</dc:creator>
      <pubDate>Sun, 06 May 2018 04:02:16 +0000</pubDate>
      <link>https://dev.to/newtorob/how-do-you-take-notes-4fkb</link>
      <guid>https://dev.to/newtorob/how-do-you-take-notes-4fkb</guid>
      <description>&lt;p&gt;I am wondering how everyone takes notes? This can be while you're testing or doing some courses or you're just putting down a to-do. Do you have a specific notes taking application you use? Are you more of a pen and paper person (like me), do you use vim or emacs with org-mode? &lt;/p&gt;

&lt;p&gt;I am interested in what everyone uses in case there is something better than what I am currently doing. &lt;/p&gt;

&lt;p&gt;Thanks!&lt;/p&gt;

</description>
      <category>productivity</category>
      <category>notes</category>
      <category>development</category>
      <category>discuss</category>
    </item>
    <item>
      <title>What kind of music do you listen to while working?</title>
      <dc:creator>Rob</dc:creator>
      <pubDate>Sun, 29 Apr 2018 06:17:03 +0000</pubDate>
      <link>https://dev.to/newtorob/what-kind-of-music-do-you-listen-to-while-working-f2o</link>
      <guid>https://dev.to/newtorob/what-kind-of-music-do-you-listen-to-while-working-f2o</guid>
      <description>&lt;p&gt;Looking for what others find good for their productivity or code writing. What kind of music do you listen to that gets you in the zone?&lt;/p&gt;

&lt;p&gt;For me, it can be anything from classical to metal, lofi to an entire playlist of hans zimmer. &lt;/p&gt;

</description>
      <category>discuss</category>
    </item>
    <item>
      <title>5 Key Parts to Problem Solving in the Software Development World.</title>
      <dc:creator>Rob</dc:creator>
      <pubDate>Sun, 29 Apr 2018 06:04:55 +0000</pubDate>
      <link>https://dev.to/newtorob/5-key-parts-to-problem-solving-in-the-software-development-world-2a4n</link>
      <guid>https://dev.to/newtorob/5-key-parts-to-problem-solving-in-the-software-development-world-2a4n</guid>
      <description>&lt;p&gt;How do you get from a set of requirements to a working program? There is no cut-and-dried recipe to follow.&lt;/p&gt;

&lt;p&gt;If there were, someone would write a program to do it, and computer programmers would be obsolete.&lt;/p&gt;

&lt;p&gt;However, there are general principles you can follow that will help to guide you toward a solution. (Based on How to Solve it by George Pólya.)&lt;/p&gt;

&lt;h1&gt;
  
  
  1 Make sure you have a clear understanding of all the requirements.
&lt;/h1&gt;

&lt;p&gt;In the real world, this is sometimes the most challenging part. Without this, the product you create (i.e the code you write, the design you create, etc.) will be completely off from what an instructor or client is expecting.&lt;/p&gt;

&lt;p&gt;However, this is probably the number one thing people know they need to do when getting some kind of assignment or job. How can you ever be expected to code hundreds, possibly thousands of lines without fully understanding what you are doing?&lt;/p&gt;

&lt;h1&gt;
  
  
  2 Plan Tests you can use to check potential solutions for correctness.
&lt;/h1&gt;

&lt;p&gt;If you don’t have a way to test an assignment or job and make sure it fits the necessary requirements set by your instructor/boss then you will move forward ignorant to the potential issues.&lt;br&gt;
This obviously can cost you letter grades or more importantly money &amp;amp; time!&lt;/p&gt;

&lt;h1&gt;
  
  
  3 Design the approach.
&lt;/h1&gt;

&lt;p&gt;One of the most commonly used methods for this is stepwise refinement (AKA the top-down design). This is a process of breaking up the original problem into some number of smaller sub-problems.&lt;/p&gt;

&lt;p&gt;Next, you take each sub-problem and break it up into even smaller sub-problems. This is repeated until you get sub-problems that are small enough to be translated into code.&lt;/p&gt;

&lt;p&gt;This approach can be extremely helpful to not only beginners trying out something new or difficult. It also can help people who have been doing this for years. The process of getting the idea, transitioning it to paper, and then working through the problem, allows you to understand the issue quicker and easier.&lt;/p&gt;

&lt;h1&gt;
  
  
  4 Translate your design into computer code (checking it with the tests from #2).
&lt;/h1&gt;

&lt;p&gt;OK, this is probably the most anticipated part of the whole process. Seems simple right? It should be after your extremely thoughtful planning and design work. You can see that the process doesn’t work if you go from step one to step four.&lt;/p&gt;

&lt;p&gt;Another side note here is the amount of time you plan and design. Like dieting and losing weight, the workout or the cardio is only a small part, the diet is the biggest percentage. This is the same as coding. The planning and design is the majority of the work while the actual coding is only a small part.&lt;/p&gt;

&lt;h1&gt;
  
  
  5 Reflect on how you arrived at your solution.
&lt;/h1&gt;

&lt;p&gt;Last but certainly not least, the reflection. I usually write notes of issues I had while writing the actual code or if I found something better that I didn’t know about. If you keep a notebook nearby to jot down notes you can go back to them when you are implementing something.&lt;/p&gt;

&lt;p&gt;The reflection also allows you time to go back through, add testing measurements if need be and ask things like, what worked and what didn’t? How can this experience help you solve future problems? Which of course leads to you being a better developer in the long run.&lt;/p&gt;

</description>
      <category>programming</category>
      <category>productivity</category>
      <category>problem</category>
      <category>solving</category>
    </item>
    <item>
      <title>Hi, I'm Robert Newton</title>
      <dc:creator>Rob</dc:creator>
      <pubDate>Fri, 09 Jun 2017 18:09:58 +0000</pubDate>
      <link>https://dev.to/newtorob/hi-im-robert-newton</link>
      <guid>https://dev.to/newtorob/hi-im-robert-newton</guid>
      <description>&lt;p&gt;I have been coding for 3 years.&lt;/p&gt;

&lt;p&gt;You can find me on GitHub as &lt;a href="https://github.com/newtorob" rel="noopener noreferrer"&gt;newtorob&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I live in Chattanooga.&lt;/p&gt;

&lt;p&gt;I work for a consulting company.&lt;/p&gt;

&lt;p&gt;I mostly program in these languages: Java, Javascript, C++, Go.&lt;/p&gt;

&lt;p&gt;I am currently learning more about Go, Machine Learning.&lt;/p&gt;

&lt;p&gt;Nice to meet you.&lt;/p&gt;

</description>
      <category>introduction</category>
    </item>
  </channel>
</rss>
