<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Steven Richards</title>
    <description>The latest articles on DEV Community by Steven Richards (@captainkrtek).</description>
    <link>https://dev.to/captainkrtek</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F6933%2Fv_jwEP5S.jpeg</url>
      <title>DEV Community: Steven Richards</title>
      <link>https://dev.to/captainkrtek</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/captainkrtek"/>
    <language>en</language>
    <item>
      <title>Engineering for your customers</title>
      <dc:creator>Steven Richards</dc:creator>
      <pubDate>Thu, 18 Jan 2018 18:01:49 +0000</pubDate>
      <link>https://dev.to/captainkrtek/engineering-for-your-customers-4pkl</link>
      <guid>https://dev.to/captainkrtek/engineering-for-your-customers-4pkl</guid>
      <description>&lt;p&gt;In the tech industry there is a misguided tenet of 'move fast, break things' and always trying to use the 'hottest' stack/language/etc. that you see on Hacker News. Unfortunately, that rarely  translates to customer happiness in using your product/service.&lt;/p&gt;

&lt;p&gt;Over the years I've come across some key concepts while engineering systems for customer happiness and reliability:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Be able to identify who your customers are. &lt;/li&gt;
&lt;li&gt;Customers are part of your system.&lt;/li&gt;
&lt;li&gt;Understand what 'customer impact' means in an outage.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Many of these points are illustrated as questions to ask yourself and your team.&lt;/p&gt;

&lt;h3&gt;
  
  
  Customers drive your tenets
&lt;/h3&gt;

&lt;p&gt;What do we build?   &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Do we build a website with public consumers as customers?
&lt;/li&gt;
&lt;li&gt;Do we build a platform that businesses build on?
&lt;/li&gt;
&lt;li&gt;Do we build internal tools for our colleagues?
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You need to put yourself in the shoes of your customers (and perhaps their customers too!). Once you do, you can start to shape your engineering tenets around how customers use what you build.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What are the tenets of your service or product?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you build a platform that businesses use, you'll discover that reliability and robustness are likely more important than 'moving fast and breaking things'. This doesn't just apply to large enterprises; If you're a startup you probably don't have a massive user-base where you can afford to lose many customers.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Can our customers tolerate outages, errors, delays? Do we need to build for robustness, speed, uptime, or all of the above?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These aren't points to define once, they should be evolving with your product, much like features and designs do. As your product evolves customers will use it differently and you should adapt as necessary.&lt;/p&gt;

&lt;h3&gt;
  
  
  Your customers are part of your system
&lt;/h3&gt;

&lt;p&gt;In most block diagrams of a system I typically see some cloud labeled 'internet', but I rarely see 'customers'. &lt;/p&gt;

&lt;p&gt;You know very well how your application reacts to latency, but how do your customers react to latency, errors, retrying?&lt;/p&gt;

&lt;p&gt;I've seen many outages caused by failing to account for how customers interact with a service. While you might have great retry behavior and timeouts throughout your system, it's easy to overlook how your customers deal with an unreliable service (see: &lt;a href="https://en.wikipedia.org/wiki/Thundering_herd_problem"&gt;thundering herd&lt;/a&gt;).  &lt;/p&gt;

&lt;p&gt;Make sure you have customer-appropriate throttles, monitoring, error pages, and documentation.&lt;/p&gt;

&lt;h3&gt;
  
  
  Defining customer impact
&lt;/h3&gt;

&lt;p&gt;In many post-mortems you'll hear statements like 'our webservers served 65,000 HTTP 500 errors from 10:04 to 10:30', which fails to tell the story of the customers who received those errors. &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Did that result in them having to retry a purchase and giving up? &lt;/li&gt;
&lt;li&gt;Did we lose customers? &lt;/li&gt;
&lt;li&gt;Did that break our customer's application and business? &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you can't answer those questions then there are gaps in your system's monitoring.&lt;/p&gt;

&lt;p&gt;It's fundamental to have system-level metrics for all of your components, but going beyond that, you need to be able to measure customer impact. &lt;/p&gt;

&lt;p&gt;If you build a platform that your customers build their business on, consider being proactive in helping them. It's meaningful to reach out to customers while they're experiencing an outage, even if it's not your fault. You know your system the best and perhaps there is advice you can share so they can leverage your system better during their outage.&lt;/p&gt;

&lt;p&gt;If you notice a customer of yours had a bad outage, are there things you can build that they could utilize in order to prevent outages?&lt;/p&gt;

&lt;h3&gt;
  
  
  Final thought
&lt;/h3&gt;

&lt;p&gt;I've used the word customer 30 times in this post because at the end of the day most businesses require paying customers to exist. Understanding and engineering for your customers is incredibly important.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;"The single most important thing is to make people happy. If you are making people happy, as a side effect, they will be happy to open up their wallets and pay you."&lt;/em&gt; - Derek Sivers&lt;/p&gt;

</description>
      <category>customers</category>
      <category>engineering</category>
      <category>reliability</category>
    </item>
    <item>
      <title>The Walking Dead (but with processes)</title>
      <dc:creator>Steven Richards</dc:creator>
      <pubDate>Tue, 13 Jun 2017 16:47:10 +0000</pubDate>
      <link>https://dev.to/captainkrtek/the-walking-dead-but-with-processes</link>
      <guid>https://dev.to/captainkrtek/the-walking-dead-but-with-processes</guid>
      <description>&lt;h2&gt;
  
  
  What's a zombie?
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/http%3A%2F%2Fturnoff.us%2Fimage%2Fen%2Fzombies.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/http%3A%2F%2Fturnoff.us%2Fimage%2Fen%2Fzombies.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;When interviewing system engineers a common answer for "what is a zombie process?" is  "A process which is dead and doesn't have a parent" or something like "you can kill it with &lt;code&gt;kill -9 &amp;lt;pid&amp;gt;&lt;/code&gt;". There seems to be a misunderstanding of zombie processes. Let's delve into what exactly a zombie is, and how they &lt;em&gt;should&lt;/em&gt; be handled by Linux.&lt;/p&gt;

&lt;p&gt;A zombie process is a process which has &lt;code&gt;exit()&lt;/code&gt;'d and whose parent has not called &lt;code&gt;wait()/waitpid()&lt;/code&gt; &lt;a href="https://linux.die.net/man/2/waitpid" rel="noopener noreferrer"&gt;syscall&lt;/a&gt; against its process id (PID). Meaning, the process exited and left a status code for the parent to read, but the parent has yet to read it from the process table. Every process that terminates is briefly a zombie, they become an issue when they stick around too long.&lt;/p&gt;

&lt;p&gt;Some key points:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;They cannot be "killed" - SIGKILL, SIGTERM are ineffective as the process is already dead.&lt;/li&gt;
&lt;li&gt;They are not orphan processes.&lt;/li&gt;
&lt;li&gt;They can be removed via orphaning where init adopts the process (by killing the parent).&lt;/li&gt;
&lt;li&gt;They utilize no resources and simply occupy a process table entry.&lt;/li&gt;
&lt;li&gt;Every process, upon termination, is a zombie until their parent process calls &lt;code&gt;wait()&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This little snippet demonstrates how a zombie can be created:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;void&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;pid_t&lt;/span&gt; &lt;span class="n"&gt;pid&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;pid&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;fork&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="c1"&gt;// Failed to fork&lt;/span&gt;
        &lt;span class="n"&gt;exit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="c1"&gt;// Child process starts execution from here, child pid == 0&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pid&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="c1"&gt;// All the child does is exit&lt;/span&gt;
        &lt;span class="n"&gt;exit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="c1"&gt;// Parent continues from here. A zombie roams for 100 seconds..brains..&lt;/span&gt;
    &lt;span class="n"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="c1"&gt;// After 100 seconds, the parent removes the zombie by calling wait()&lt;/span&gt;
    &lt;span class="n"&gt;pid&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;wait&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A zombie shows up as &lt;code&gt;&amp;lt;defunct&amp;gt;&lt;/code&gt; in the output of &lt;code&gt;ps&lt;/code&gt;, when running the above snippet.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;129274  \_ ./zombie 
129275    \_ [zombie] &amp;lt;defunct&amp;gt; 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This means the parent (PID 129274) hasn't called &lt;code&gt;wait()/waitpid()&lt;/code&gt; on PID 129275. Until it calls wait, 129275 will eat up a process table entry. Since they take up no resources, they typically aren't an issue. &lt;/p&gt;

&lt;p&gt;The problem emerges when too many exist and you run out of PIDs, meaning no more new processes (ex: can't &lt;code&gt;ssh&lt;/code&gt; to a server that can't spawn you a shell..)&lt;/p&gt;

&lt;p&gt;You can see the max PIDs with the following command.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# cat /proc/sys/kernel/pid_max
131072
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  It's a good idea to monitor process count on your machines. If you have a process leaking zombies that is long-running it will eventually fill the process table which could take down your service, prevent remote-access, etc.
&lt;/h4&gt;

&lt;p&gt;A way you can remove a zombie is by killing the parent of the zombie, which would then re-parent the process to init (PID 1) who &lt;em&gt;should&lt;/em&gt; periodically call &lt;code&gt;wait()&lt;/code&gt; on its child processes.&lt;/p&gt;

&lt;p&gt;Zombies usually appear if a program has a flaw where it doesn't call &lt;code&gt;wait()&lt;/code&gt; on its children. what if the process got stuck in some infinite loop and never called wait()? what if that process was named init? &lt;/p&gt;

&lt;h2&gt;
  
  
  When init doesn't do its job...
&lt;/h2&gt;

&lt;p&gt;A while ago I came across a number of servers that had thousands of zombie processes. My team had simply been rebooting these boxes to clear them up, but I became curious after noticing something odd: &lt;/p&gt;

&lt;h4&gt;
  
  
  All the zombie processes had init as a parent!
&lt;/h4&gt;

&lt;p&gt;This means that processes were exiting, and init (as their parent), was never calling &lt;code&gt;wait()&lt;/code&gt;. Additionally, if you were to kill a zombies parent to make init the parent, init would do nothing to help you.&lt;/p&gt;

&lt;p&gt;init, as one of its core functions, should be calling &lt;code&gt;wait()&lt;/code&gt; on its child processes to clear out any zombies, so what was happening?&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;% sudo strace -p 1 -r -s 500
Process 1 attached - interrupt to quit
     0.000000 write(8, "init: serial-ttyS main process ended, respawning\r\n", 50) = ? ERESTARTSYS (To be restarted)
   273.565117 --- SIGCHLD (Child exited) @ 0 (0) ---
     0.000062 write(4, "\0", 1)         = 1
     0.000066 rt_sigreturn(0x4)         = 1
     0.000046 write(8, "init: serial-ttyS main process ended, respawning\r\n", 50) = ? ERESTARTSYS (To be restarted)
     0.003033 --- SIGCHLD (Child exited) @ 0 (0) ---
     0.000026 write(4, "\0", 1)         = 1
     0.000048 rt_sigreturn(0x4)         = 1
     0.000039 write(8, "init: serial-ttyS main process ended, respawning\r\n", 50) = ? ERESTARTSYS (To be restarted)
     7.825845 --- SIGCHLD (Child exited) @ 0 (0) ---
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;stracing showed init stuck in a loop trying to write to a tty device, and each write getting &lt;code&gt;ERESTARTSYS&lt;/code&gt; back (meaning: please re-attempt that write). Init had no mechanism to handle that error, so it gets stuck in an infinite loop.&lt;/p&gt;

&lt;p&gt;As for why it was getting &lt;code&gt;ERESTARTSYS&lt;/code&gt;, the tty it was writing to in this case was a serial tty. grepping the serial tty driver code for &lt;code&gt;ERESTARTSYS&lt;/code&gt; found:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="cm"&gt;/**
 *  tty_send_xchar  -   send priority character
 *
 *  Send a high priority character to the tty even if stopped
 *
 *  Locking: none for xchar method, write ordering for write method.
 */&lt;/span&gt;

&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="nf"&gt;tty_send_xchar&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="n"&gt;tty_struct&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;tty&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;char&lt;/span&gt; &lt;span class="n"&gt;ch&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;was_stopped&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tty&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;stopped&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tty&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;ops&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;send_xchar&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;tty&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;ops&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;send_xchar&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tty&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ch&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tty_write_lock&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tty&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;ERESTARTSYS&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;was_stopped&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;start_tty&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tty&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="n"&gt;tty&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;ops&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tty&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;ch&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;was_stopped&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;stop_tty&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tty&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="n"&gt;tty_write_unlock&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tty&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Writing to a serial tty takes a lock in the form of &lt;code&gt;atomic_write_lock&lt;/code&gt;. Searching for info regarding this lock I found a &lt;a href="http://bit.ly/2rOkSAE" rel="noopener noreferrer"&gt;bug&lt;/a&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;gt;&amp;gt;  Possible unsafe locking scenario:
&amp;gt;&amp;gt;
&amp;gt;&amp;gt;        CPU0                    CPU1
&amp;gt;&amp;gt;        ----                    ----
&amp;gt;&amp;gt;   lock(&amp;amp;tty-&amp;gt;termios_rwsem);
&amp;gt;&amp;gt;                                lock(&amp;amp;tty-&amp;gt;atomic_write_lock);
&amp;gt;&amp;gt;                                lock(&amp;amp;tty-&amp;gt;termios_rwsem);
&amp;gt;&amp;gt;   lock(&amp;amp;tty-&amp;gt;atomic_write_lock);
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Wrapping all this back together found a new bug due to a different bug. &lt;/p&gt;

&lt;p&gt;If the serial tty gets into a deadlock, any logging that init tries to perform against it will get init stuck into a loop of &lt;code&gt;write -&amp;gt; ERESTARTSYS -&amp;gt; write&lt;/code&gt; ...&lt;/p&gt;

&lt;p&gt;If init stays in this state long enough, without the tty resetting, then zombies will pile up and you'll run out of PIDs.&lt;/p&gt;

&lt;h2&gt;
  
  
  How should init handle child processes?
&lt;/h2&gt;

&lt;p&gt;Looking at the sysvinit source in src/init.c we see a couple different ways init handles it child processes. Here we see init establishing a signal handler for &lt;code&gt;SIGCHLD&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="n"&gt;SETSIG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sa&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;SIGCHLD&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="n"&gt;chld_handler&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;SA_RESTART&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="n"&gt;chld_handler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;sig&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;CHILD&lt;/span&gt;           &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;ch&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="kt"&gt;int&lt;/span&gt;             &lt;span class="n"&gt;pid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;st&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="kt"&gt;int&lt;/span&gt;             &lt;span class="n"&gt;saved_errno&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;errno&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

        &lt;span class="cm"&gt;/*
         *      Find out which process(es) this was (were)
         */&lt;/span&gt;
        &lt;span class="k"&gt;while&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;pid&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;waitpid&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;st&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;WNOHANG&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;errno&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;ECHILD&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;break&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
                &lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;ch&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;family&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;ch&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;ch&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ch&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;next&lt;/span&gt; &lt;span class="p"&gt;)&lt;/span&gt;
                        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;ch&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;pid&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;pid&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ch&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;flags&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;RUNNING&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                                &lt;span class="n"&gt;INITDBG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;L_VB&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                        &lt;span class="s"&gt;"chld_handler: marked %d as zombie"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                        &lt;span class="n"&gt;ch&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;pid&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
                                &lt;span class="n"&gt;ADDSET&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;got_signals&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;SIGCHLD&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
                                &lt;span class="n"&gt;ch&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;exstat&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;st&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
                                &lt;span class="n"&gt;ch&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;flags&lt;/span&gt; &lt;span class="o"&gt;|=&lt;/span&gt; &lt;span class="n"&gt;ZOMBIE&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
                                &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ch&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;new&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                                        &lt;span class="n"&gt;ch&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;new&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;exstat&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;st&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
                                        &lt;span class="n"&gt;ch&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;new&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;flags&lt;/span&gt; &lt;span class="o"&gt;|=&lt;/span&gt; &lt;span class="n"&gt;ZOMBIE&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
                                &lt;span class="p"&gt;}&lt;/span&gt;
                                &lt;span class="k"&gt;break&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
                        &lt;span class="p"&gt;}&lt;/span&gt;
                &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ch&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="nb"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                        &lt;span class="n"&gt;INITDBG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;L_VB&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"chld_handler: unknown child %d exited."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                &lt;span class="n"&gt;pid&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
                &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Examining &lt;code&gt;chld_handler&lt;/code&gt;, this will execute if init gets a &lt;code&gt;SIGCHLD&lt;/code&gt; signal which is sent to a parent when a child of theirs dies. Init handles zombies in a push model here by calling &lt;code&gt;waitpid()&lt;/code&gt; and flagging the process as a zombie for later cleanup&lt;/p&gt;

&lt;p&gt;&lt;code&gt;ch-&amp;gt;flags |= ZOMBIE;&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Earlier, I said when init gets stuck in this state that it would not be able to reap any of its own children who exit, but it has a signal handler for this? so what's going on?&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;initlog()&lt;/code&gt; function blocks all signals while logging:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;/*
 *      Re-establish connection with syslogd every time.
 *      Block signals while talking to syslog.
 */
sigfillset(&amp;amp;nmask);
sigprocmask(SIG_BLOCK, &amp;amp;nmask, &amp;amp;omask);
openlog("init", 0, LOG_DAEMON);
syslog(LOG_INFO, "%s", buf);
closelog();
sigprocmask(SIG_SETMASK, &amp;amp;omask, NULL);
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;So when we call initlog() anywhere in the main loop, and call &lt;code&gt;syslog(LOG_INFO, "%s", buf);&lt;/code&gt; we hit our earlier bug. &lt;code&gt;syslog()&lt;/code&gt; is respecting &lt;code&gt;ERESTARTSYS&lt;/code&gt; from &lt;code&gt;write()&lt;/code&gt; so we get stuck in here and we block all signals (including &lt;code&gt;SIGCHLD&lt;/code&gt;).&lt;/p&gt;

&lt;p&gt;Outside of the &lt;code&gt;chld_handler&lt;/code&gt;'s mechanism for reaping zombies, the rest of the main loop handles a variable called &lt;code&gt;family&lt;/code&gt; which stores all the child processes of init. It loops over these looking for processes to kill and/or reap from the process table if they have died.&lt;/p&gt;

</description>
      <category>linux</category>
      <category>systems</category>
      <category>init</category>
      <category>kernel</category>
    </item>
  </channel>
</rss>
