Hermes Rodríguez

Posted on May 1

Beyond Power-On Hours: Auditing 'New' SAS Drives on Proxmox, HPE Gen9 & HBA

#proxmox #zfs #smartctl

Publication note (anonymization): Shell prompts, drive serials, LU WWNs, and wall-clock timestamps in pasted output are redacted or generalized so the article stays instructive without tying public text to a specific vendor shipment. Counts (ECC-corrected totals, GB processed), power-on time, firmware revision, and manufacture week/year match the real capture—the technical claims rest on those fields, not on identifiers. Use a cropped or blurred label photo in production if the sticker serial is still readable in pixels.

You ordered new enterprise disks. The PO says new. Logistics signs for new. Then you slide them into a ProLiant Gen9 class machine running Proxmox, and nothing about the story matches how “new” is supposed to feel.

This post is a field narrative with commands you can reuse: how a Smart Array can make lsblk look “empty,” how to reach SMART anyway, why HBA mode matters for ZFS, and what to do when power-on hours look pristine but the rest of the telemetry does not.

The actual problem we were solving

Disks failed. Replacements arrived. The job was not “install and hope”—it was verify the supply chain:

Are these genuinely low-cycle parts?
Are we about to bake gray-market refurbs into a RAID 5 or a ZFS pool that we will swear by for years?

The uncomfortable truth: SMART does not replace procurement judgment—it is a reporting channel. If you cannot read it at the right layer, or if you only read one field, you will fool yourself.

Procurement context (anonymized on purpose)

This write-up does not name the seller, contract vehicle, or price. That is intentional: the goal is a reusable methodology, not a public flame. Still, you should calibrate your priors with facts you do have internally—authorized distributor vs opportunistic channel, warranty terms, whether the SKU was “too cheap to be true,” and how much traceability the paperwork provides. None of that appears here; treat absent commercial context as uncertainty you fold into risk, not as proof either way.

Layer cake: where SMART “lives”

Rough mental model:

Platter/flash + firmware stores vendor logs (hours, wear, internal defects, manufacture metadata—when exposed).
A RAID controller may aggregate, delay, abstract, or gate access.
The Linux block layer shows what the OS is allowed to see as /dev/sdX, /dev/nvme*, multipath devices, etc.

When people say “I ran smartctl and it looks fine,” the first question is: fine at which layer?

Act 1: `lsblk` — what you see depends on the controller

I inserted spares into a RAID 5 world. On the host:

lsblk

With hardware RAID still presenting a single logical drive, I still saw essentially one large block device—the logical volume the controller exported—not three new naked disks sitting as /dev/sdb, /dev/sdc, …

That is normal for HPE Smart Array (e.g. P440ar class): the controller is a traffic cop. It hides unassigned physical drives from the OS until they join a logical drive or you change how the controller exposes devices.

Contrast — same question, different topology: once the machine was in a state where each physical path was visible to Linux (e.g. HBA mode / disks not hidden behind a single LD), lsblk finally showed one row per disk plus the Proxmox/LVM stack on the OS install device. Real capture from the audit host:

root@audit-host:~# lsblk
NAME               MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
sda                  8:0    0 931.5G  0 disk
sdb                  8:16   0 931.5G  0 disk
sdc                  8:32   0 931.5G  0 disk
sdd                  8:48   0 931.5G  0 disk
├─sdd1               8:49   0  1007K  0 part
├─sdd2               8:50   0     1G  0 part /boot/efi
└─sdd3               8:51   0   930G  0 part
  ├─pve-swap       252:0    0     8G  0 lvm  [SWAP]
  ├─pve-root       252:1    0    96G  0 lvm  /
  ├─pve-data_tmeta 252:2    0   8.1G  0 lvm
  │ └─pve-data     252:4    0 793.8G  0 lvm
  └─pve-data_tdata 252:3    0 793.8G  0 lvm
    └─pve-data     252:4    0 793.8G  0 lvm

So the lesson is blunt: lsblk is not a universal physical inventory tool. On hardware RAID it is an inventory of what the OS can address as block storage right now—sometimes one LD, sometimes N bare /dev/sdX nodes once the “wall” is gone.

Act 2: `ssacli`—inventory the controller, not the kernel

HPE’s ssacli is how you ask the controller what it thinks exists: bays, PDs, LDs, rebuild status, unassigned drives, etc.

Proxmox (Debian underneath) does not ship that vendor CLI. You install it yourself—usually via:

HPE MCP repository with a modern signed-by= keyring entry (avoid legacy apt-key patterns in new builds), or
A manual .deb.

The boring failure mode: pinned `.deb` URLs rot

If you wget a specific ssacli_x.y-z_amd64.deb URL you found in a blog post, expect 404 eventually. HPE rotates pool filenames. Prefer:

repo install, or
browse the vendor directory / use a small script to pick latest name, or
download on a workstation and scp the package to the host (sadly practical when corporate networks/VPNs get in the way).

Example repo-style skeleton (validate codename against HPE docs for your Debian/Proxmox major):

curl -fsSL https://downloads.linux.hpe.com/SDR/hpePublicKey2048_key1.pub \
  | gpg --dearmor -o /usr/share/keyrings/hpe-archive-keyring.gpg

echo "deb [signed-by=/usr/share/keyrings/hpe-archive-keyring.gpg] \
http://downloads.linux.hpe.com/SDR/repo/mcp bullseye/current non-free" \
  > /etc/apt/sources.list.d/hpe-mcp.list

apt update && apt install ssacli -y

Then:

ssacli ctrl all show config

What I always grep mentally for:

Logical drives (what the OS sees as “the array disk”)
Unassigned drives (your spares, still invisible to lsblk)
Recovering / degraded states—if parity is rebuilding, do not make rash controller changes until you understand the blast radius.

Act 3: `smartctl` through Smart Array (`cciss`)

Before flipping the whole machine to HBA, you can often still interrogate individual drives through the controller using smartctl’s HP-style device mapping.

Discovery:

smartctl --scan

You may see lines like /dev/sda -d cciss,0, /dev/sda -d cciss,1, …

Spot check:

smartctl -A -d cciss,N /dev/sda

SAS reality: grepping is a sport

SAS/SCSI logs are not always polite about naming. Power_On_Hours might not appear the way SATA textbooks promise. When in doubt, dump wider and read like a debugger:

for i in $(seq 0 15); do
  echo "==== cciss,$i ===="
  smartctl -a -d cciss,$i /dev/sda | egrep -i 'serial|hour|power|time|manufactured|ecc|error|health'
done

This is ugly, but it buys you coverage when bay order and cciss indices do not match your intuition.

Act 4: HBA mode—why ZFS people keep insisting

What I did on the Gen9 (on purpose): I removed every drive from the chassis, left a single disk installed, switched the Smart Array to HBA mode, then did a clean Proxmox install on that lone device. Only after that baseline did I rotate the rest of the batch through the machine and run the full audit pass (lsblk, smartctl, photos, notes) against each candidate. That workflow cost time, but it eliminated two headaches at once: no legacy logical drive hiding spares, and no guessing whether the OS view was “polluted” by an old array layout—I always knew I was reading raw paths on a known-fresh hypervisor root.

If your end state is ZFS on Proxmox, a very common “boring-hard” posture is:

Controller in HBA mode (pass-through-ish visibility; exact wording varies by generation/features)
Redundancy and scrubbing owned by ZFS (mirror, raidz1, raidz2, …)

Why bother?

Per-disk telemetry becomes honest enough to operationalize (scrubs, zpool status, SMART monitoring tools).
You stop fighting RAID abstraction every time a disk misbehaves slightly.

How you flip modes depends on policy and hardware support: System Options / Smart Storage at POST, and/or vendor CLI patterns like:

ssacli ctrl slot=0 modify hbamode=on

Hard stop warning: changing modes / deleting logical drives is data destructive if you do not have backups and a rebuild plan. Treat this as a change-managed migration, not a blog-copy-paste stunt.

After HBA, this often becomes pleasantly boring:

smartctl -a /dev/sdX

No cciss,N roulette—assuming the OS now sees each physical path cleanly.

When HBA + ZFS is not the automatic answer

Plenty of shops must stay on hardware RAID: corporate standard, support contract language, legacy monitoring that expects logical drives, boot disks already tied to an array, or a storage team that simply will not own filesystem-level redundancy. In those worlds, you can still audit with ssacli and smartctl -d cciss,N—you just may not get the same “naked disk” clarity ZFS operators chase. HBA + ZFS here is the path I chose because I control the full stack and wanted per-disk truth for a Proxmox build—not a universal mandate for every Gen9 still under a storage silo’s policy.

Act 5: When “almost zero hours” is the least interesting field

We eventually read a batch that looked “unused” by hours—think well under a few hours—yet the story fell apart on cross-checks:

Lot scope (so this is not a single bad apple): the procurement batch was ten drives. One drive was operationally unusable—the extreme ECC case detailed below. Across all ten, cross-checks showed telemetry and/or physical labels inconsistent with genuinely new, factory-sealed stock (low reported hours paired with old manufacture windows, label vs firmware/DOM mismatches, or corrupted manufacture strings). That is a systemic documentation smell for the lot, not just one marginal unit—while still not proving who altered what upstream.

1) Manufacture hints vs “baby disk” narrative

If the drive cheerfully reports very low power time but also exposes manufacture windows from a decade-ish ago, you should pause. That combination is compatible with counter hygiene in the supply chain—not with assumptions of brand-new factory stock.

2) Broken manufacture strings

Some units showed corrupt / truncated manufacture week/year fields in SMART text. Treat that as a process smell: something touched this device beyond “factory sealed happy path.”

3) ECC counters vs headline “OK”

One unit still presented as broadly “OK” in the headline sense, but the error counter log was obscene: massive corrected read/verify activity relative to a tiny amount of data processed. That pattern screams media fatigue or marginal heads—not a drive I want learning parity for my pool.

Rule: for SAS/SCSI logs, read the tables, not only the one-line health summary.

Act 6: Labels vs silicon—how to win an argument without starting one

We also found label vs firmware inconsistencies: DOM/firmware printed stories that did not match inquiry/SMART, plus model generations that did not match “recent assembly” claims.

Photo evidence (label vs silicon)

Same physical drive throughout: physical label vs smartctl (serial redacted in this public copy; full identifiers stay in internal evidence only). On this spare, the sticker story (e.g. DOM 03/2019, FW HPD9) did not match what the drive reported internally (manufacture week in 2013, FW HPD5)—the kind of mismatch you only catch if you photograph the label and capture SMART in one ticket.

Figure: label-side evidence for one problematic unit.

Same unit — smartctl excerpt (smartctl -a on the drive). Note Revision HPD5, Manufactured … 2013, and the error counter log vs only ~5 GB read — headline SMART Health Status: OK is misleadingly optimistic here.

smartctl 7.4 [x86_64-linux-*-pve] (build/version line trimmed for publication)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               HP
Product:              MM1000FBFVR
Revision:             HPD5
Compliance:           SPC-3
User Capacity:        1,000,204,886,016 bytes [1.00 TB]
Logical block size:   512 bytes
Rotation Rate:        7200 rpm
Form Factor:          2.5 inches
Logical Unit id:      0x5000c500[WWN redacted]
Serial number:        9XG52****KY8W
Device type:          disk
Transport protocol:   SAS (SPL-4)
Local Time is:        [redacted]
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Current Drive Temperature:     36 C
Drive Trip Temperature:        65 C

Accumulated power on time, hours:minutes 1:04
Manufactured in week 28 of year 2013
Specified cycle count over device lifetime:  10000
Accumulated start-stop cycles:  5
Specified load-unload count over device lifetime:  300000
Accumulated load-unload cycles:  5
Elements in grown defect list: 0

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0        0         0  13382752          0          5.039           0
write:         0        0         0         0          0          0.837           0
verify:        0        0         0   3229737          0          0.759           0

Non-medium error count:        6

No Self-tests have been logged

The ECC trap (why “SMART OK” was a trap)

Do not put this class of finding into a production ZFS pool. The summary line can still say SMART Health Status: OK and Elements in grown defect list: 0 while the real story sits in the error counter log—read that table line by line, not only the headline.

           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0        0         0  13382752          0          5.039           0
verify:        0        0         0   3229737          0          0.759           0

Against only ~5 GB read and ~0.76 GB verified, the drive had already accumulated on the order of 13.3 million read-side corrections and 3.2 million verify-side corrections in about one hour of logged power-on time. That is not “new disk noise”; it is media fatigue (or a marginal head/platter situation) being masked because the firmware is still winning the ECC battle—no uncorrected errors yet, so lazy dashboards stay green.

The magnetic surface is failing quietly. It will likely get worse fast; in ZFS you pay for that as latency, scrub pain, and resilver drama before the headline SMART field ever admits defeat. In our ten-drive lot, HPD5 on this unit versus HPD9 on other units was another clue we were not looking at one uniform “fresh from the factory” firmware story.

The professional move is not to accuse a specific counterparty of fraud in writing. It is to ship an evidence bundle:

photo of label (crop/blur serial on the public copy if needed)
smartctl capture for the same physical drive (serials/WWN redacted in public posts)
a short table: label claims vs silicon reports

That forces corrective action upstream without you pretending you ran a criminal investigation. Supply chains fail; your job is to measure and document.

Procurement engineering: “Gen9 is EOL” is not a magic excuse

A common vendor line: “Your server is old; new HDDs basically do not exist; refurbs are normal.”

Sometimes partially true—but Gen9 2.5" SFF is still fundamentally SAS/SATA bays. With HBA + ZFS, you are often less chained to “HP-branded spinning SKU from that exact era” than RAID-centric workflows were.

What you can push for, technically:

Enterprise SSDs with PLP (power-loss protection) suitable for server duty—not random consumer SATA models that will lie to you about endurance under sync writes.
Examples of categories people commonly standardize on: Samsung PM893-class, Micron 5400 PRO-class, Kingston DC600M-class, Solidigm D3-S4520-class—not product endorsements, just “this is what ‘datacenter SSD’ means” anchors.

Mechanically: reuse the metal caddies, mount SSDs, validate with the same SMART discipline.

A practical audit checklist (copy into your ticket)

[ ] Identify whether disks are behind hardware RAID / HBA / NVMe directly
[ ] If HPE Smart Array: install ssacli, run ssacli ctrl all show config, note unassigned PDs
[ ] Run smartctl --scan, then smartctl -a paths (with -d cciss,N while still in RAID mode if needed)
[ ] Record: serial, model, hours, manufacture strings (if present), ECC/error counters
[ ] If something is “new” but smells off: label photo + smartctl for the same bay/unit (redact IDs in public write-ups)
[ ] Only then: bake into RAID/ZFS and sleep soundly

Command cheat sheet

lsblk
ssacli ctrl all show config
smartctl --scan
smartctl -a -d cciss,N /dev/sda    # while controller presents cciss mapping
smartctl -a /dev/sdX               # typical after HBA / direct paths

What I would do again

Assume lsblk lies by omission on hardware RAID.
Treat hours as a single signal—never the whole proof.
Prefer HBA + ZFS when you want the OS to own disk truth operationally.
Package procurement language around evidence (captures + photos), not accusations.

Disclosure: use of AI in this article

Yes — with human oversight. The narrative structure, English prose, section flow, and several edits were produced with help from an AI assistant (large language model in an editor workflow). I remained responsible for what went in: technical facts, methodology, and conclusions come from a real on-server audit (e.g. smartctl / lsblk captures, label photograph, and internal notes/scripts from that work). Public-facing command listings add deliberate redaction of identifiers and timestamps; telemetry numbers (ECC totals, GB read, hours, firmware, manufacture year/week) match the underlying capture. The figure tracks the real unit; the repo PNG is the redacted label export used for public posts—unredacted originals stay in the internal evidence bundle only.

Rough split: AI — drafting, wording, reordering, checklist formatting; human — incident ownership, evidence selection, accuracy checks, and final sign-off before publication.

If you have your own “the array looked fine but the disks were lying” story—RAID vs HBA, Dell PERC, MegaRAID, whatever—I want to read it in the comments.

DEV Community

Beyond Power-On Hours: Auditing 'New' SAS Drives on Proxmox, HPE Gen9 & HBA

The actual problem we were solving

Procurement context (anonymized on purpose)

Layer cake: where SMART “lives”

Act 1: `lsblk` — what you see depends on the controller

Act 2: `ssacli`—inventory the controller, not the kernel

The boring failure mode: pinned `.deb` URLs rot

Act 3: `smartctl` through Smart Array (`cciss`)

SAS reality: grepping is a sport

Act 4: HBA mode—why ZFS people keep insisting

When HBA + ZFS is not the automatic answer

Act 5: When “almost zero hours” is the least interesting field

1) Manufacture hints vs “baby disk” narrative

2) Broken manufacture strings

3) ECC counters vs headline “OK”

Act 6: Labels vs silicon—how to win an argument without starting one

Photo evidence (label vs silicon)

The ECC trap (why “SMART OK” was a trap)

Procurement engineering: “Gen9 is EOL” is not a magic excuse

A practical audit checklist (copy into your ticket)

Command cheat sheet

What I would do again

Disclosure: use of AI in this article

Top comments (0)

The actual problem we were solving

Procurement context (anonymized on purpose)

Layer cake: where SMART “lives”

Act 1: lsblk — what you see depends on the controller

Act 2: ssacli—inventory the controller, not the kernel

The boring failure mode: pinned .deb URLs rot

Act 3: smartctl through Smart Array (cciss)

SAS reality: grepping is a sport

Act 4: HBA mode—why ZFS people keep insisting

When HBA + ZFS is not the automatic answer

Act 5: When “almost zero hours” is the least interesting field

1) Manufacture hints vs “baby disk” narrative

2) Broken manufacture strings

3) ECC counters vs headline “OK”

Act 6: Labels vs silicon—how to win an argument without starting one

Photo evidence (label vs silicon)

The ECC trap (why “SMART OK” was a trap)

Procurement engineering: “Gen9 is EOL” is not a magic excuse

A practical audit checklist (copy into your ticket)

Command cheat sheet

What I would do again

Disclosure: use of AI in this article

Act 1: `lsblk` — what you see depends on the controller

Act 2: `ssacli`—inventory the controller, not the kernel

The boring failure mode: pinned `.deb` URLs rot

Act 3: `smartctl` through Smart Array (`cciss`)