Ages ago I wrote myself some notes on setting up Windows Server Essentials on Intel NUC hardware. Recently I did an upgrade on one of these machines, and ended up in a world of pain. Google was very thin on information to help me try and sort this out, so I figured this needed writing up…
My server had been running quietly and happily for some years, but I had recently realised I had two issues. The obvious one was that I was running out of disk space on the 1TB hard drive installed in the server, so I figured it was time to upgrade it to a 2TB drive. The second issue was that I started seeing quite a few event log messages I didn’t like the look of. The most worrying ones were raised by the “ESENT” subsystem and included error IDs 510, 508, 533. I’d find my event log looking like this:
The error message attached to those looks a bit scary:
A request to write to the file "[some file name]" at offset 229376 (0x0000000000038000) for 4096 (0x00001000) bytes succeeded, but took an abnormally long time (28 seconds) to be serviced by the OS. This problem is likely due to faulty hardware. Please contact your hardware vendor for further assistance diagnosing the problem.
The file in question varied a bit, but was usually something that looked related to Windows itself. It was noticable that the computer would be generally unresponsive for a short period of time around each of these errors.
So given Windows was saying “faulty hardware” and I knew I needed more disk space, I decided it was definitely time for a new hard drive.
So after a quick click around some websites selling hard drives, I orderd myself a 2TB drive. When it turned up I used some disk imaging software to clone the data from the old drive to the new one, hoping to avoid the need to reinstall Windows, and I put it the new one into the machine and booted up again.
And the errors were still being generated…
I tried re-seating all the cables, and rebooted again. Errors still occuring.
I tried replacing the drive cables. Errors still occuring.
I tried another hard drive. Errors still occuring.
I tried another motherboard of the same model. Errors still occuring.
I tried updating the server’s bios. Errors still occuring.
I tried a fresh install of the latest release of Windows Server Essentials onto the new drive. Errors still occuring.
I tried a newer model of NUC. Errors still occuring.
I may have used some fairly strong language at this point. Having changed every aspect of the machine, I was still seeing the same error. Not necessarily instantly, but it was noticable that as soon as I tried to run Windows Update, I would start seeing ESENT errors in the logs, and Windows Update would never successfully install updates. (Leaving it overnight – which seemed like a long enough wait)
I’ll save you the description of about two weeks of further googling, rebooting, re-fitting heatsinks, reinstalling Windows and banging my head on the desk at this point.
But eventually I came across some blog posts that talked about PCI power management causing hangs and potential disk issues. While these were not talking about my specific hardware or issues, that made me think that maybe I should check what the settings for the power management features of this hardware and a default install of Windows Server 2016 were. And they seemed to be set to “try and save power” in most cases. So I tried adjusting the following settings:
- I changed the BIOS power management settings to prevent it throttling performance to save energy.
- I changed the Windows Power Options “AHCI Link Power Management” settings to turn them off, and prevent Windows from powering down the connection to these devices.
And with that done, the ESENT errors stopped reappearing in my logs.
I ran the “SFC /scannow” command from an elevated command prompt, and allowed it to fix up some system files that were unhappy. And with this done I found I was finally able to get a successful run of Windows Update as well.
I do still see the occasional instances of Error ID 153 from the “disk” subsystem, with a message like “
The IO operation at logical block address 0x5a9088 for Disk 0 (PDO name: \Device\00000036) was retried.” in the System log. But as far as I can tell from my searching, this is not actually a problem, it’s just that recent builds of windows are more aggressive about logging some stuff that used to be treated as unimportant, and quietly ignored.
All the messing about leads me to believe that there was never anything wrong with my original hard drive, and all this was caused by something changing in Windows that caused it to either be more sensitive to these disk power management issues, or to just start logging errors it had actually been encountering all along.
All the rebuilding involved in this pointed out that a few other things have changed for installing this software onto NUCs since I last tried:
- With the newer hardware, Intel have finally provided Windows Server drivers for their ethernet cards, so you no longer have to hack about with the driver setup files to get the wired lan to work. That is a good thing.
- The default install seems to fail to configure the time synchronisation correctly. You may see (and need to correct) log errors about this.
- If you have older machines using RDP to connect to a recently patched server, you may need to work around a security setting related to a crypto bug that throws up a warning about “CredSSP” and “Oracle remediation” when you try to connect.
- You’ll probably see lots of log messages about increasing the security of LDAP using signing. If you’re not supporting legacy ldap clients, enable signing to get rid of the messages.
So getting my server back up to speed was a long and painful process. But I am now running a nice new NUC and I’ve also taken the opportunity to install the “My Media for Alexa” server so I can now enjoy the frustration of trying to get my Amazon Echo to understand me asking it to play specific tracks from my MP3 collection.
Fingres crossed I don’t have to reinstall again for a good long time…