DEV Community: Roy Keene

A DSL For Seccomp Rules

Roy Keene — Tue, 17 Dec 2019 22:45:23 +0000

Recently I started to improve my package building system. It's a pretty simple build system that downloads source files from the Internet, verifies their SHA-2 against known values for that version of the package, extracts, and compiles them. Something like:

url='...'
sha256='...'
destdir='...'

function verify() { ... }

function download() {
    curl -sSL "${url}" > src.new && \
        verify src.new "${sha256}" && \
        mv src.new src && \
        return 0
}

function extract() {
    gzip -dc src | tar -xf - && \
        mv foo/* . && \
        return 0
}

function build() {
    ./configure && make install DESTDIR="${destdir}"
}

download && extract && build

But recently I discovered that some packages would try to go out to
the Internet and attempt to download software without the SHA-2 being
verified, which could lead to unpredictable results.

I wanted to address this but keep the basic process intact.

What I decided to do was create a bash extension, so that when I loaded
the shared object into bash no more network access was permitted
by the shell or any of its child processes.

This can easily be accomplished using seccomp so I started about writing the rules.

seccomp works by attaching a fragment of code using the instruction set specified by Berkley Packet Filter (BPF) to the path between a process and the Linux system call interface. When a new system call is made by a process, the Linux kernel starts a BPF virtual machine and runs the code to determine the result (e.g., allow the system call, kill the process, return an error).

Writing code for that instruction set by manually typing in each op code is tedious, but there are some C macros defined to make it a bit higher-level. However, even with those high-level interfaces it's still not that great of an experience, still very close to writing in an assembly language.

Initially I hard-coded the rules using these macros, but I wanted to make something a bit more flexible so I created a domain specific language for creating the rules.

The language looks a bit like Tcl (and is indeed parsed by a Tcl script). The ruleset I came up for this is:

i386 {
    if {$nr eq "socketcall"} {
        return errno ENOSYS
    }

    if {$nr eq "socket"} {
        if {$args(0) != PF_LOCAL} {
            return errno EINVAL
        }
    }

    return allow
}

x86_64 {
    if {$nr eq "socket"} {
        if {$args(0) != PF_LOCAL} {
            return errno EINVAL
        }
    }

    return allow
}

Which basically says on the i386 platform return ENOSYS for the system call socketcall() (since there is no good way to deal with
its arguments via seccomp), otherwise for both i386 and x86_64 platforms reject the socket() system call with EINVAL unless the first argument is PF_LOCAL. All other system calls are permitted.

The resulting rules look like:

BPF_STMT(BPF_LD | BPF_W | BPF_ABS, (offsetof(struct seccomp_data, arch))), /* Load architecture */
BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, AUDIT_ARCH_I386, 2, 0),
BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, AUDIT_ARCH_X86_64, 10, 0),
BPF_STMT(BPF_RET, SECCOMP_RET_TRAP),
BPF_STMT(BPF_LD | BPF_W | BPF_ABS, (offsetof(struct seccomp_data, nr))), /* if ($nr eq "socketcall") ... */
BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, 102, 0, 1),
BPF_STMT(BPF_RET, SECCOMP_RET_ERRNO | ENOSYS),
BPF_STMT(BPF_LD | BPF_W | BPF_ABS, (offsetof(struct seccomp_data, nr))), /* if ($nr eq "socket") ... */
BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, 359, 0, 3),
BPF_STMT(BPF_LD | BPF_W | BPF_ABS, (offsetof(struct seccomp_data, args[0]))), /* if ($args(0) != PF_LOCAL) ... */
BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, PF_LOCAL, 1, 0),
BPF_STMT(BPF_RET, SECCOMP_RET_ERRNO | EINVAL),
BPF_STMT(BPF_RET, SECCOMP_RET_ALLOW),
BPF_STMT(BPF_LD | BPF_W | BPF_ABS, (offsetof(struct seccomp_data, nr))), /* if ($nr eq "socket") ... */
BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, 41, 0, 3),
BPF_STMT(BPF_LD | BPF_W | BPF_ABS, (offsetof(struct seccomp_data, args[0]))), /* if ($args(0) != PF_LOCAL) ... */
BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, PF_LOCAL, 1, 0),
BPF_STMT(BPF_RET, SECCOMP_RET_ERRNO | EINVAL),
BPF_STMT(BPF_RET, SECCOMP_RET_ALLOW),

The updated build process looks something like:

...

function dropNetwork() {
    enable -f /path/to/dropnet.so dropnet
}

download && dropNetwork && extract && build

The rest of the dropnet code is pretty trivial, it really just
loads the seccomp filter as part of initialization of the
shared object, but it's very effective in keeping build processes
from reaching out to the Internet and we are one step closer to ensuring
reproducible builds.

Full source for dropnet can be found here.

Network Load Balancing with CLUSTERIP

Roy Keene — Fri, 05 Jul 2019 20:54:57 +0000

Load-balancer Less Load Balancing

There's not a lot of information on CLUSTERIP on the Internet for some reason. It's an implementation of an older technique, made easier by an IPTables target extension.

Flavio's Technotalk on CLUSTERIP
Load Sharing with IPtables and Linux-HA
Microsoft calls this technique "Network Load Balancing"
LARTC has a longer explanation of the underlying mechanism in their article "How to do simple load-balancing with Linux without a single point of failure"

The way CLUSTERIP works is fairly simple.

Every member of the cluster is attached to the same [broadcast domain];
Every member of the cluster is configured with the same multicast MAC address;
Each member of the cluster then filters out incoming packets they don't think they should handle:
1. In an exclusive manner with respect to other nodes (i.e., no other member of the cluster will handle the packet);
2. And in an inclusive manner with respect to packets (i.e., when all the nodes of the cluster are up every packet will get handled by a node); also
3. Using the following criteria normally:
  1. Based on source IP; or
  2. Based on source IP and source port; or
  3. Based on source IP and source port and dest port
4. If a node is down, another node can notice and assume responsibility for its share of the incoming packets
Outgoing packets are sent with the source IP and MAC address of the cluster, but the destination IP of the target and destination MAC address of the next-hop router (gateway)

From the above description, the major weakness of CLUSTERIP is shown. Incoming packets are replicated N times (for every member of the cluster), thus CLUSTERIP (when used alone, other higher layer techniques can sometimes mitigate this) cannot be used to load-balance incoming traffic bandwidth effectively. Outgoing traffic is unaffected and will be split as evenly as the load-balancing scheme permits.

BtrFS, ZFS, and More

Roy Keene — Wed, 03 Jul 2019 21:58:01 +0000

Butter Filesystem. Hold the toast.

I've started experimenting with BtrFS which aims to provide an "advanced and modern filesystem" (heavily compared to ZFS) on Linux. With my new workstation I've started using BtrFS for my home directories (/home) and my build directories (/mnt/slackbuilds) to gain exposure to the filesystem and compare it to ZFS and EXT4 on LVM (all of my other data, including my root disk is on EXT4 on LVM).

I have used ZFS heavily in the past, and using BtrFS is significantly different as many of the fundamental concepts are different. BtrFS has no concept of "pools" or "volume groups" -- instead there are "volumes." BtrFS has no concept of "datasets" or "logical volumes" -- instead there are "subvolumes".

Here's a comparison between ZFS, BtrFS, and EXT4 on LVM:

Comparison

	ZFS	BtrFS	EXT4 and LVM
Commands Involved	zpool, zfs	mkfs.btrfs, btrfs	pvcreate, vgcreate, lvcreate, mkfs.ext4
Pool of disks	"zpool"	"volume"	"volume group"
Mountable unit	"dataset"	"volume" and "subvolume"	"logical volume"
License	CDDL	GPL	GPL
Can be Boot filesystem	Yes	Yes (grub 2.00)	No
Can be Root filesystem	Yes	Yes	Yes
Can provide swapspace	Yes (zvols)	No	Yes (lvm)
OSes with Implementations	Solaris, OpenSolaris, Nexenta, FreeBSD, Mac OS X, Linux	Linux	Linux
Stability	Stable	Stable	Stable
CLI-System Integration [1]	Strong	Weak	Mild
Grow Online	Yes	Yes	Only when there are no snapshots
Shrink Pool	No	Online	Online
Shrink Filesystem	Online (reduce quota)	Online	Offline
Replace Disk (without parity)	Yes (must be compatible size disk)	Yes	Yes (copies space allocated on disk)
Filesystem Level Storage Pooling	Yes	Yes	No
Re-balance	No	Yes	Can be done manually (pvmove)
Checksumming	Yes	Yes	No
Autocorrect Checksum Errors	Yes	???	No
Compression	Yes	Yes	No
De-duplication	Yes (only synchronous)	Yes (only asynchronous)	No
Ditto Blocks	Yes	???	No
Tiered Caching	Yes	No	No
Writable Snapshots	Yes (clone)	Yes	Yes
Copy-on-Write	Fast, space-efficient	Fast, space-efficient	Slow, requires pre-allocating an LV
Redundancy	Mirroring and Parity (x1, x2, x3)	Mirroring	Mirroring, though the PVs can be RAID devices
Maximum Volume Size	16 Exabytes	16 Exabytes	1 Exabyte
Maximum File Size	16 Exabytes	16 Exabytes	16 Terabytes
Maximum Number of Snapshots	Unlimited	Unlimited	Effectively 32

[1] For lack of a better term -- how well the command line interface integrates with the system as a whole, this might be subjective.

For a more complete, but less focused comparison see Wikipedia's Comparison of Filesystems

The Rosetta Stone

Task: Create a pool of storage from disks /dev/A, /dev/B, and /dev/C (striped or linear concat)
1. Using ZFS:
  1. # zpool create TESTPOOL A B C
2. Using BtrFS:
  1. # mkfs.btrfs -d raid0 /dev/A /dev/B /dev/C
3. Using EXT4 on LVM:
  1. # pvcreate /dev/A /dev/B /dev/C
  2. # vgcreate TESTPOOL /dev/A /dev/B /dev/C
Task: Make storage from pool available to system
1. Using ZFS:
  1. # zfs set mountpoint=/data TESTPOOL
2. Using BtrFS:
  1. # mkdir /data
  2. # mount -t btrfs /dev/A /data
  3. Update /etc/fstab
3. Using EXT4 on LVM:
  1. # mkdir /data
  2. # lvcreate -L _SizeOfVolume_ -n DATA TESTPOOL
  3. # mkfs -t ext4 /dev/TESTPOOL/DATA
  4. # mount /dev/TESTPOOL/DATA /data
  5. Update /etc/fstab
Task: Add an additional disk to the pool
1. Using ZFS:
  1. # zfs add TESTPOOL D
2. Using BtrFS:
  1. # btrfs device add /dev/D /data
  2. # btrfs filesystem balance /data
3. Using EXT4 on LVM:
  1. # pvcreate /dev/D
  2. # vgextend TESTPOOL /dev/D
Task: Add additional space to a filesystem
1. Using ZFS:
  1. No action needed
2. Using BtrFS:
  1. No action needed
3. Using EXT4 on LVM:
  1. # lvextend -L _SizeToIncreaseTo_ /dev/TESTPOOL/DATA
  2. # resize2fs /dev/TESTPOOL/DATA
Task: Remove a disk from the pool
1. Using ZFS: N/A
2. Using BtrFS:
  1. # btrfs device delete /dev/A /data
  2. # btrfs filesystem balance /data
3. Using EXT4 on LVM:
  1. # pvmove /dev/A
  2. # vgreduce TESTPOOL /dev/A
Task: Replace operational disk
1. Using ZFS:
  1. # zfs replace TESTPOOL A D
2. Using BtrFS:
  1. # btrfs device add /dev/D /data
  2. # btrfs device delete /dev/A /data
  3. # btrfs filesystem balance /data
3. Using EXT4 on LVM:
  1. # pvcreate /dev/D
  2. # vgextend TESTPOOl /dev/D
  3. # pvmove TESTPOOL /dev/A /dev/D
  4. # vgreduce TESTPOOL /dev/A
Task: Take a snapshot of a filesystem
1. Using ZFS:
  1. # zfs snapshot TESTPOOL@snapshot1
2. Using BtrFS:
  1. # btrfs subvolume snapshot /data /data/snapshot1
3. Using EXT4 on LVM:
  1. # lvcreate -s /dev/TESTPOOL/DATA -L _SizeToAllocate_ -n snapshot1
Task: Rollback a snapshot
1. Using ZFS:
  1. # zfs rollback TESTPOOL@snapshot1
2. Using BtrFS:
  1. Not sure...
3. Using EXT4 on LVM:
  1. # lvconvert --merge /dev/TESTPOOL/snapshot1
Task: Delete a snapshot
1. Using ZFS:
  1. # zfs destroy TESTPOOL@snapshot1
2. Using BtrFS:
  1. # btrfs subvolume delete /data/snapshot1
3. Using EXT4 on LVM:
  1. # lvremove /dev/TESTPOOL/snapshot1

The Day "/proc" Died

Roy Keene — Tue, 02 Jul 2019 18:20:28 +0000

Solaris'd...

This morning I got an email from our group manager asking about a UNIX server out at one of the districts:

Date: Thu, 27 Jan 2011 14:14:49
From: "UNIX Manager"
To: UNIX_SA_GROUP
Subject: Milky Way District UNIX Server

Someone take a look into whats going on with Milky Way District UNIX Server today.
The server is up but is acting really strange, I was not able to get a
response from ps or ptree commands.  Please note, there was not any server
consolidation work at this site last night.

--
A. Manager
Enterprise Unix & Storage Systems Manager
Spacely Sprockets, Inc.

So I logged in expecting a mundane failure:

workstation$ ssh milkyway1
milkyway1$ ps -eaf
     UID   PID  PPID   C    STIME TTY         TIME CMD
    root     0     0   0   Dec 08 ?           0:45 sched
    root     1     0   0   Dec 08 ?          72:15 /sbin/init
    root     2     0   0   Dec 08 ?           0:00 pageout
    root     3     0   0   Dec 08 ?        5514:24 fsflush
    root    59     1   0   Dec 08 ?           0:00 /lib/svc/method/iscsid
    root     7     1   0   Dec 08 ?           1:48 /lib/svc/bin/svc.startd
    root     9     1   0   Dec 08 ?           3:13 /lib/svc/bin/svc.configd
    root   149     1   0   Dec 08 ?           0:04 devfsadmd
  daemon   168     1   0   Dec 08 ?          73:17 /usr/lib/crypto/kcfd
^C
^C

~.

So far, so good. It does indeed appear to be broken. It probably can't talk to its LDAP server or something silly. I check "nsswitch.conf" and the only thing configured for "ldap" is "sudoers" and "netgroup". Suspiciously normal.

workstation$ ssh milkyway1
milkyway1$ sudo -i
milkyway1# cd /proc
milkyway1# echo *
0 1 10045 10116 10162 10163 10169 10176 10240 10243 10244 10246 10263 10318 10386
10483 10489...
milkyway1# ls -ln
^C
^C

~.

Hmm. Curiouser and curiouser. Why is my "ls" hanging ? What HAS science done ? Let's take a look at what the kernel thinks about what is going on with that process while it is "hung" (being able to attach a debugger to the running kernel is one of the MOST USEFUL things about Solaris. However it sometimes gets abused, which can be tragic.

workstation$ ssh milkyway1
milkyway1$ sudo -i
milkyway1# ls -ln /proc & echo $!
20023
milkyway1# mdb -k
> 0t20023::pid2proc
302765d6078
> 302765d6078::walk thread
300e3d5e100
> 300e3d5e100::findstack -v
stack pointer for thread 300e3d5e100: 2a10d548d71
[ 000002a10d548d71 cv_wait+0x38() ]
  000002a10d548e21 pr_p_lock+0x80(0, 60032d48030, 60032d58600, 300b84af968, ff000000, 18f8218)
  000002a10d548ed1 prgetattr+0x2d4(30025ce8240, 2a10d549998, 149, ffffffffffffffef, 300bd1a8da8, 0)
  000002a10d548f91 fop_getattr+0x18(30025ce8240, 2a10d549998, 0, 3025fb17d00, 2a10d549ad8, 1362e00)
  000002a10d549041 cstat64_32+0x1c(30025ce8240, ffbffb20, 0, 3025fb17d00, 3fff, 3c00)
  000002a10d549221 cstatat64_32+0x5c(ffffffffffd19553, 26578, 1000, ffbffb20, 1000, 0)
  000002a10d5492e1 syscall_trap32+0xcc(26578, ffbffb20, ffffffffffffffff, 27f68, 6c, 1b)
>

Simple enough, it seems to be waiting for a conditional variable (cv_wait()) as part of a mutex lock after a 32-bit process has made a system call (syscall_trap32()) for stat() with the file whose name is stored at 0x26578.

This pointer will probably point to a string that is a filename under "/proc" since that's what I was running "ls" on after all. I'll have to switch to the context of that process to interpret that memory.

> 302765d6078 $p
debugger context set to proc 302765d6078
Segmentation Fault
milkyway1#

Oh. Well, that wasn't nice. Let's try it another way...

milkyway1# mdb -p 20023 
mdb: failed to initialize /lib/libc_db.so.1: libthread_db call failed unexpectedly
mdb: warning: debugger will only be able to examine raw LWPs
Loading modules: [ ld.so.1 libc.so.1 libavl.so.1 ]
> 26578::print -i char*
0x26578 "/proc/26936"
>

So my "ls" is hanging because "stat()" is being called on "/proc/26936". What is running with the PID of 26936 ? Hmm... Well, we can't just run "ps" since that will "stat()" that file in "/proc" and hang. The modular debugger to the rescue, again:

milkyway1# echo '::ps -fz' | mdb -k | grep 26936
S    PID   PPID   PGID    SID  ZONE    UID      FLAGS             ADDR NAME
R  26936  26935  26935  26935     3   1000 0x4a014000 00000300b84af968

So that process corresponds to a command with no name (what ?) run by the UID 1000 in the zone 3. Now we know the "who", what about the "what in the world is it doing?" ?

milkyway1# mdb -k
Loading modules: [ unix genunix specfs dtrace ufs sd mpt px md ldc ip hook neti sctp
arp usba fcp fctl emlxs qlc lofs zfs ssd random crypto fcip logindmux ptm nfs ipc ]
> 00000300b84af968::walk thread
3000f654d40
> 3000f654d40::findstack -v
stack pointer for thread 3000f654d40: 2a109f38a61
[ 000002a109f38a61 cv_wait+0x38() ]
  000002a109f38b11 txg_wait_open+0x54(6003f983d20, 16533a, 0, 6003f983d64, 6003f983d66, 6003f983d18)
  000002a109f38bc1 zfs_putapage+0x1e0(6004d6fae40, 5b, 2a109f39570, 2a109f39568, 400, 10a6ac0)
  000002a109f38cb1 zfs_putpage+0x1b8(3002836dd40, b7c1a000, 0, 400, 30114c15318, 7000f9c9d80)
  000002a109f38d81 fop_putpage+0x1c(3002836dd40, 0, b7c1a000, 400, 30114c15318, 7b262c7c)
  000002a109f38e31 zfs_delmap+0x6c(3002836dd40, 0, 10a6800, 30114c15318, b7c1a000, b)
  000002a109f38ef1 zfs_shim_delmap+0x3c(3002836dd40, 30114c15318, 1, 1, f, b)
  000002a109f38fc1 fop_delmap+0x40(3002836dd40, 30114c15318, 1, f, b7c1a000, b)
  000002a109f39091 segvn_unmap+0x180(3017e8a0bd0, fffffffec0000000, b7c1a000, b7c1a000, f, 600439d6b78)
  000002a109f39181 as_unmap+0xe4(300ee744d20, 3017e8a0bd0, 300de51bd88, b7c1a000, 1, 1)
  000002a109f39231 munmap+0x78(1fff, b7c1a000, 10a6800, 300b84af968, 300b84af968, fffffffec0000000)
  000002a109f392e1 syscall_trap+0xac(fffffffec0000000, b7c1a000, 0, 8, 10012c4d0, 10012c460)
>

Hmm, well that process ALSO seems to be waiting on a conditional variable in a mutex lock as part of the ZFS kernel module. Maybe they're related ?

I look at the storage available to the zone with the ID of 3 (obtained via "zoneadm list -v") and noticed that the zpool that provides most of the storage to that zone (including it's "/") is 100% full.

I send out an email detailing my findings and someone logs into that zone and cleans a few files up and things magically start working again.

Hooray.

RAID5 lost, RAID5 recovered

Roy Keene — Mon, 01 Jul 2019 17:45:44 +0000

lost+found

While I was on vacation a few years ago a RAID enclosure at one of our districts sites died. Attempts by off-site personnel to revive it were unsuccessful. Nobody local to the site could resolve the situation either. When I got back from vacation this problem was waiting for me. I got back to the office, then out to the district site and resolved the issue by manually reconstructing the array.

A few weeks later a coworker asked me about what the problem was and how it was resolved.

Below is my reply.

Mark,

I attached the enclosure in JBOD mode to my Linux workstation. I
wrote my own (userspace) RAID5 driver to reconstruct the array (attached). I then determined the number of disks in the missing RAID5 parity group (11). I then determined the order of the 11 disks -- of which there are 39,916,800 possible arrangements. I then determined that 1 disk was missing (so I only had 10 of the 11 disks, the missing disk's location still had to be determined so it did not reduce the number of possible combinations). I then determined that one of the members I was accessing was an outdated member. I then physically repaired one of the disks to get an up-to-date member. I then got the partitions on the local drives from Ian. I then created loop-back devices with differing offsets to represent the hardware array's partitions and the slices on the disk. I then ran "file" on all the loopback devices so I could determine which filesystems they were and when they were last accessed. I used "dd" to copy the loopback devices to the NFS server to be mounted using lofiadm on the UNIX system from the NFS server.

I also mounted the filesystems up on my workstation. Here's the script I used to do that ("sdk" was an oudated member of the array -- got most of the data using this, but there was some filesystem corruption):

# ./raid5d 6149 /dev/sdc MISSING /dev/sdf /dev/sdk /dev/sde /dev/sdi /dev/sdd /dev/sdj /dev/sdb /dev/sdg /dev/sdh

Once I repaired the previously failed disk:

# ./raid5d 6149 /dev/sdc /dev/sdl /dev/sdf MISSING /dev/sde /dev/sdi /dev/sdd /dev/sdj /dev/sdb /dev/sdg /dev/sdh
# nbd-client bs=4096 127.0.0.1 6149 /dev/nbd1
# losetup -r -o 805306368 --sizelimit $[36 * 1024 * 1024 * 1024] /dev/loop0 /dev/nbd1
# losetup -r -o 39460012032 --sizelimit $[36 * 1024 * 1024 * 1024] /dev/loop1 /dev/nbd1
# losetup -r -o 78114717696 --sizelimit $[256 * 1024 * 1024 * 1024] /dev/loop2 /dev/nbd1
# losetup -r -o 352992624640 --sizelimit $[256 * 1024 * 1024 * 1024] /dev/loop3 /dev/nbd1
# losetup -r -o 627870531584 --sizelimit $[679622426624 - 627870531584] /dev/loop4 /dev/nbd1
# losetup -r -o 679622426624 --sizelimit $[773094113280 - 679622426624] /dev/loop5 /dev/nbd1
# mount -t ufs -o ro,ufstype=sun,onerror=umount /dev/loop3 /mnt/hostwork/mnt/mount3
# mount -t ufs -o ro,ufstype=sun,onerror=umount /dev/loop4 /mnt/hostwork/mnt/mount4
# mount -t ufs -o ro,ufstype=sun,onerror=umount /dev/loop5 /mnt/hostwork/mnt/mount5

Here's how I copied the data to the NFS server:

# dd if=/dev/loop0 of=loop0.img bs=128k
# dd if=/dev/loop1 of=loop1.img bs=128k
# dd if=/dev/loop3 of=loop3.img bs=128k
# dd if=/dev/loop4 of=loop4.img bs=128k
# dd if=/dev/loop5 of=loop5.img bs=128k

Here's how I determined the mountpoint and age of all the filesystems using "file":

# dd if=/dev/loop0 bs=128k count=1 2>/dev/null | file -
/dev/stdin: Unix Fast File system [v1] (big-endian), last mounted on /mount0, last written at Thu Jan 12 21:51:21 2012, clean flag 253, number of blocks 37740000, number of data blocks 37164676, number of cylinder groups 712, block size 8192, fragment size 1024, minimum percentage of free blocks 1, rotational delay 0ms, disk rotational speed 90rps, TIME optimization
# dd if=/dev/loop1 bs=128k count=1 2>/dev/null | file -
/dev/stdin: Unix Fast File system [v1] (big-endian), last mounted on /mount1, last written at Thu Jan 12 21:50:10 2012, clean flag 253, number of blocks 37740000, number of data blocks 37164676, number of cylinder groups 712, block size 8192, fragment size 1024, minimum percentage of free blocks 1, rotational delay 0ms, disk rotational speed 90rps, TIME optimization
# dd if=/dev/loop2 bs=128k count=1 2>/dev/null | file -
/dev/stdin: Unix Fast File system [v1] (big-endian), last mounted on /mount2, last written at Thu Jan 12 20:23:54 2012, clean flag 253, number of blocks 268415040, number of data blocks 264335586, number of cylinder groups 5483, block size 8192, fragment size 1024, minimum percentage of free blocks 1, rotational delay 0ms, disk rotational speed 90rps, TIME optimization
etc...

Here are the partitons on the logical drive, from Ian (only ld0 was damaged):

  (02:12:43) Ian * partitions
  LD/LV    ID-Partition        Size
  -------------------------------------
  ld0-00   00000000-00        256MB
  ld0-01   00000000-01        256MB
  ld0-02   00000000-02        256MB
  ld0-03   00000000-03      36.00GB
  ld0-04   00000000-04      36.00GB
  ld0-05   00000000-05     256.00GB
  ld0-06   00000000-06     256.00GB
  ld0-07   00000000-07      96.41GB
  ld1-00   11111111-00     682.39GB
  ld2-00   22222222-00     272.96GB
  ld2-01   22222222-01     272.96GB

Obviously I'm omitting a lot of the details, but this is a basic overview.

Let me know if you have any further questions.

I told the users they should migrate their data off that controller.

They are still using it.

Hooray.

Writing Your First PAM Module

Roy Keene — Sun, 30 Jun 2019 15:36:10 +0000

Pluggable Authentication for the Masses

Recently I was talking with someone who expressed an interest in allowing their users to specify that their account can only be logged into using an SSH key only and thus not with a password. They wanted to allow this to be done by the user -- similar to how SSH's "authorized_keys" is done.

I suggested that they write a simple PAM module to accomplish this, but they indicated that they felt intimidated by that prospect. So here I am, to walk you all through this relatively simple process.

Typically I develop an application iteratively in such a way that I always have an application that compiles and doesn't do anything adverse. So I will walk through the steps I take to build any PAM module first, then add the specific functionality functional part by functional part.

First, we'll start with a basic PAM module that returns "ignore" for everything:

/* Define which PAM interfaces we provide */
#define PAM_SM_ACCOUNT
#define PAM_SM_AUTH
#define PAM_SM_PASSWORD
#define PAM_SM_SESSION

/* Include PAM headers */
#include <security/pam_appl.h>
#include <security/pam_modules.h>

/* PAM entry point for session creation */
int pam_sm_open_session(pam_handle_t *pamh, int flags, int argc, const char **argv) {
        return(PAM_IGNORE);
}

/* PAM entry point for session cleanup */
int pam_sm_close_session(pam_handle_t *pamh, int flags, int argc, const char **argv) {
        return(PAM_IGNORE);
}

/* PAM entry point for accounting */
int pam_sm_acct_mgmt(pam_handle_t *pamh, int flags, int argc, const char **argv) {
        return(PAM_IGNORE);
}

/* PAM entry point for authentication verification */
int pam_sm_authenticate(pam_handle_t *pamh, int flags, int argc, const char **argv) {
        return(PAM_IGNORE);
}

/*
   PAM entry point for setting user credentials (that is, to actually
   establish the authenticated user's credentials to the service provider)
 */
int pam_sm_setcred(pam_handle_t *pamh, int flags, int argc, const char **argv) {
        return(PAM_IGNORE);
}

/* PAM entry point for authentication token (password) changes */
int pam_sm_chauthtok(pam_handle_t *pamh, int flags, int argc, const char **argv) {
        return(PAM_IGNORE);
}

So now we have our PAM module that just does nothing. Let's go ahead and compile it (Linux compilation specified):

user@build$ gcc -fPIC -DPIC -shared -rdynamic -o pam_ignore.so pam_ignore.c

Or, if you are running a multilib system, you will need to compile the PAM module for every architecture your system has a "libpam" for, for example for Linux/x86_64 and Linux/i386:

user@build$ gcc -m32 -fPIC -DPIC -shared -rdynamic -o pam_ignore_32.so pam_ignore.c
user@build$ gcc -m64 -fPIC -DPIC -shared -rdynamic -o pam_ignore_64.so pam_ignore.c

Next, we'll install the PAM module into the path where it should live. On most Linux systems this is "/lib/security" (or "/lib/security" for 32-bit and "/lib64/security" for 64-bit libraries on 32/64 multilib systems)

root@test# cp pam_ignore_32.so /lib/security/pam_ignore.so
root@test# cp pam_ignore_64.so /lib64/security/pam_ignore.so
root@test# chown root:root /lib/security/pam_ignore.so /lib64/security/pam_ignore.so
root@test# chmod 755 /lib/security/pam_ignore.so /lib64/security/pam_ignore.so

Finally, we should configure our PAM implementation to actually use the module. This is where first have to start making decisions on how our system should interact with our PAM module. To determine this, we go back to our problem and back to talking about how PAM actually works.

We want SSH to allow all users to login using their SSH keys but selectively deny access using a password. Typically this kind of decision would be made by an "account" interface within a module. The "account" interface is for determining if an account is valid for this login, so it would return PAM_PERM_DENIED if the user designated this so. But by the time we are processing the "account" interface the user has already either authenticated via SSH keys or via password -- and we don't know which one has taken place !

There are several ways to resolve this dilemma, along two basic lines of reasoning. The first line of reasoning is that we can use the "authentication" interface of PAM module to store which one happened and then later retrieve that information from the "account" interface of the PAM module. The second line of reasoning is that the "authentication" interface of PAM modules are indeed only called from SSH during non-SSH-key based authentication so we could just return failure there.

There are benefits to each approach, but for simplicity we will use the latter approach.

So, given that, we are now ready to insert our PAM module into the PAM configuration. We have determined that we only care about it's "authentication" interface, and only for SSH. How we insert this depends on your PAM configuration, but on Linux it's typically done by editing "/etc/pam.d/sshd". For our module we will insert above all the other "auth" modules, so something like (only the top-line is added, the second line is an example of what might already exist):

auth       requisite         pam_ignore.so
auth       include           system-auth

There are some important details about PAM at this point:

PAM directives are followed in-order
There is the concept of a "result" of a PAM module (Success, Failure, Ignore, Error)
The "action" (requisite, required, sufficient, etc) indicates what exactly is done with the result

A very important note here about the "action" we used above, "requisite", it means that if our PAM module returns in failure the failure is immediately returned to the application to return to the user. We do this because we don't want something like "pam_tally" counting this as an authentication attempt and failure. If success is returned, it could be considered successful authentication if nothing else in the "PAM stack" returns any failures. We don't want to do that since we don't actually check anyone's authentication token (password) so we will only ever return PAM_IGNORE or PAM_AUTH_ERR.

Alright, so now we have our PAM module in place and being used. We should ensure that we can still login to our test system at this point. We should do this with a new connection and leave our existing root session alone in case we need to undo the changes.

After we have verified that our PAM module is operable and we can still authenticate to the system we can move on to making our PAM module actually do something. Finally.

Since we are only providing an "authentication" interface we will only be modifying the pam_sm_authenticate function and I will not repeat the other functions or the C headers we included. However, your source code should be the aggregate of this modification and the original above.

The first thing we want our authentication system to do is identify the user that is attempting to authenticate. Conveniently PAM provides us with a function that does just that -- pam_get_user(). Since we will be linked with PAM at run-time by any application that uses us we do not need to indicate at compile-time that we depend on it.

pam_get_user() takes three (3) arguments:

The PAM handle (pamh);
A place to store the username (user); and
An optional prompt, if the username already has not been collected (prompt)

These are all relatively easy to satisfy. The function we are operating inside of already provides us with the PAM handle (pamh) to provide. We should already have the username, so we don't care too much about the prompt. The most difficult thing to do is provide a place to store the username. The documentation indicates that this parameter is a pointer to a C-style string (pointer to a char), and that we do not need to free() it later. This makes it easy on us.

Alright, so let's start with that:

#include <unistd.h>
int pam_sm_authenticate(pam_handle_t *pamh, int flags, int argc, const char **argv) {
        const char *user = NULL;
        int pgu_ret;

        pgu_ret = pam_get_user(pamh, &user, NULL);
        if (pgu_ret != PAM_SUCCESS || user == NULL) {
                return(PAM_IGNORE);
        }

        return(PAM_IGNORE);
}

Now we have the username. The next step is to actually check for the file that we want to use to determine if we should succeed or fail. Let's define this as "<user_home_dir>/.ssh/nopasswd". So that means we need to find the user's home directory.

In order to get the user's home directory from what we know of the user (their username) we will need to use one of the functions in the getpwnam-family of fuctions. Specifically we will use getpwnam_r() since PAM modules need to be re-entrant since they are linked into applications that may have multiple threads doing unrelated things.

So our code with getpwnam_r() added in looks like this:

#include <unistd.h>
#include <pwd.h>
int pam_sm_authenticate(pam_handle_t *pamh, int flags, int argc, const char **argv) {
        struct passwd *pw = NULL, pw_s;
        const char *user = NULL;
        char buffer[1024];
        int pgu_ret, gpn_ret;

        pgu_ret = pam_get_user(pamh, &user, NULL);
        if (pgu_ret != PAM_SUCCESS || user == NULL) {
                return(PAM_IGNORE);
        }

        gpn_ret = getpwnam_r(user, &pw_s, buffer, sizeof(buffer), &pw);
        if (gpn_ret != 0 || pw == NULL || pw->pw_dir == NULL || pw->pw_dir[0] != '/') {
                return(PAM_IGNORE);
        }

        return(PAM_IGNORE);
}

Next, we'll construct the path to the file to check using snprintf() and actually check for it, using access():

#include <unistd.h>
#include <stdio.h>
#include <pwd.h>
int pam_sm_authenticate(pam_handle_t *pamh, int flags, int argc, const char **argv) {
        struct passwd *pw = NULL, pw_s;
        const char *user = NULL;
        char buffer[1024], checkfile[1024];
        int pgu_ret, gpn_ret, snp_ret, a_ret;

        pgu_ret = pam_get_user(pamh, &user, NULL);
        if (pgu_ret != PAM_SUCCESS || user == NULL) {
                return(PAM_IGNORE);
        }

        gpn_ret = getpwnam_r(user, &pw_s, buffer, sizeof(buffer), &pw);
        if (gpn_ret != 0 || pw == NULL || pw->pw_dir == NULL || pw->pw_dir[0] != '/') {
                return(PAM_IGNORE);
        }

        snp_ret = snprintf(checkfile, sizeof(checkfile), "%s/.ssh/nopasswd", pw->pw_dir);
        if (snp_ret >= sizeof(checkfile)) {
                return(PAM_IGNORE);
        }

        a_ret = access(checkfile, F_OK);
        if (a_ret == 0) {
                /* The user's file exists, return authentication failure */
                return(PAM_AUTH_ERR);
        }

        return(PAM_IGNORE);
}

We should probably name this something other than "pam_ignore" at this point, perhaps something like "pam_ssh_denypasswd" or similar.

Then we compile, install, and test as above and declare victory.

Hooray.

Postscript:

If you wish to take the PAM module further it's a good idea to start to incorporate GNU autoconf at this point to support multiple platforms. I have a skeleton similar to "pam_ignore" called "pam_success" that can be used for this purpose.

You Can't Get There From Here (Policy Based Routing)

Roy Keene — Sat, 29 Jun 2019 17:08:58 +0000

Which way do I go ? Well, you see... the problem is... you can't get there from here

Occasionally in the course of server management it becomes desirable, or sometimes even necessary, to configure a UNIX or Linux system with network interfaces on multiple networks.

On the surface, the problem appears straight-forward. It may indeed be as simple as it sounds depending on what you do after configuring the network interface. Often times, however, after configuring the network interface and reaching that early success an enterprising system administrator will then say themselves "Excellent, I now have an address on this network now I want to be able to reach it from other networks ! I know, I'll add a route !" and off they go.

The problem is routing, standard routing, is based solely on the destination of the packet. Read the last sentence carefully and you will notice several things. First, and most obvious, of which is that the ''destination'' address is used to determine which route to take -- this seems obvious when stated plainly but the subtlety can often be missed. The second thing to notice is that it is the destination of the packet -- notice that there were no other qualifiers that might indicate that the packet is part of some higher level stream, since indeed that may not be the case.

You see, when most people add the route they think they want they are thinking of things like TCP sessions or connections being tracked and handled that way. But that's not what is being specified by that route.

As an example, let's say we have a machine that starts out like this:

server# ifconfig eth0 192.168.5.100 netmask 255.255.255.0 broadcast 192.168.5.255
server# route add default gw 192.168.5.1

Then we add our second network interface:

server# ifconfig eth1 10.230.5.100 netmask 255.255.255.0 broadcast 10.230.5.255

And then from our client (and presuming that we have a route through the network to 10.230.5.100) we try to reach the box (server) on the IP 10.230.5.100 from another host, let's say 10.44.11.19. The results can vary from it works, it works sometimes, or it doesn't work at all. The reason for this lay in our host's routing table.

When we send the packets for our TCP session from our client to 10.230.5.100 they take a particular path through the network and end up coming in on the server's "eth1" interface. Since the packets are associated with a TCP socket with the destination address of 10.230.5.100, the packets going back will have the source address set to 10.230.5.100. However when the packets from that TCP session are sent back from the server to the client only the destination (10.44.11.19) is used. Thus, the packets will leave the server on its default route and via the network interface "eth0". This can lead to it taking a different path back to the client than the packets from the client to the server.

This may be fine for example if no reverse-path/egress filtering is done by the router on that interface, and if no stateful firewalls exist along one path but not the other.

It also means that if the router 192.168.5.1 becomes unavailable, you will not be able to receive packets from the system (unless you happen to be connected to one of the broadcast domains it is on).

This is very likely undesirable behavior.

To correct this the enterprising administrator will likely do something similar to:

server# route add default gw 10.230.5.1

And think that because they have added a second default gateway via "eth1" that packets will start going out that interface if they have a source address of the IP of "eth1". But this is wrong. Again, normal routing is based only on the destination address of packets. Also the routing entries are processed from most-specific to least-specific so now packets may end up leaving the system via either "eth0" or "eth1" at random (this is indeed what happens on Solaris).

So how do we solve this problem ? How do we get packets to do what we want ? Well, first we have to define the problem. Up until now we've only defined the symptoms and the current behavior.

The problem is that packets are going out the "wrong" interface. How do we define which interface is the right interface ? It's easy -- the right interface is the interface with the IP of the source address of the packet. Looked at this way we can see that we want our routing to be source-based instead of destination-based. This is implemented by using the technique of Policy Based Routing.

So how does one actually implement this Policy Based Routing ("PBR") ? It depends on the platform. On Linux one would use the IP Advanced Routing features, on Solaris one would use "ipf".

To implement the above example using PBR on Linux, first we would remove the extraneous default route entry we just added, because it does not implement the policy we want:

server# route del default gw 10.230.5.1

The way Linux Advanced Routing handles policy based routing is through the use of multiple routing tables. This allows for very flexible, but very clearly defined, routes to be configured.

Just creating additional routing tables isn't sufficient, however, since we need to actually tell the Linux routing system when to use which routing table. This is done with rules.

Also, we can't simply get rid of the "default gateway" entry in the "main" routing table (the default name of the routing table routing which the "route" command manipulates) because it is used to determine the source IP address to use when creating sockets having not explicitly specified a source.

Alright, so we have our two concepts: routing tables (of which we can have several of) and rule entries (also of which we can have several of). To actually convey the configuration changes to the system we use the "ip" command, part of the Linux IP Advanced Routing system and provided by the iproute2 package.

First we create our new routing tables. Routing tables are identified by a number (names can be associated with these numbers for convenience, but for clarity here we will just use the numbers). To do this we would do something like:

server# echo "Configuring eth0"
server# ip route add 192.168.5.0/24 dev eth0 table 100
server# ip route add default via 192.168.5.1 table 100
server# echo "Configuring eth1"
server# ip route add 10.230.5.0/24 dev eth1 table 101
server# ip route add default via 10.230.5.1 table 101

Second we create our rules. Our rules implement our policy. Our policy is to classify routes based on their source address. Our rules would thus be something like:

server# ip rule add from 192.168.5.100 table 100
server# ip rule add from 10.230.5.100 table 101

We can then verify that our routes are being used by using the "ip route get" command:

server# ip route get 4.4.4.4 from 192.168.5.100
4.4.4.4 from 192.168.5.100 via 192.168.5.1 dev eth0
    cache   mtu 1500 advmss 1460
server# ip route get 4.4.4.4 from 10.230.5.100
4.4.4.4 from 10.230.5.100 via 10.230.5.1 dev eth1
    cache   mtu 1500 advmss 1460

It worked.

Hooray.

Why Is There Packet Loss ?

Roy Keene — Fri, 28 Jun 2019 15:33:31 +0000

Is the Internets dying ?

I work at a datacenter. Really, I work at half a datacenter. The datacenter spans two physical locations: the location I work at, and another location on the coast. Because we are so inter-dependent reliable connectivity between the two sites is a crucial element for us to actually perform our hosting duties.

Since our datacenters are so far apart we don't have anything as simple as a cable connecting us. Instead we lease bandwidth on someone else's network to provide connectivity between the two sites. We are on a budget and leasing dedicated bandwidth is expensive so while we have an OC-12 at both sites connected to our provider's network, they only guarantee that we will be able to sustain 45Mbps -- everything above that is best-effort. We didn't know about this limitation until afterwards.

Recently we started an effort to provide "Continuity of Operations" which involved making the data we store available at the opposite site in case of catastrophic failure at a site. Our continuity of operations plan required that we be able to bring the systems back to the state they were before the failure and as quickly as possible. To accomplish this goal we decided that the best way to get the data to each opposite site and keep it up-to-date was over the network.

We set this up and let it go one evening and began noticing increased latency and packet loss from all hosts at the two sites. We started looking at network graphs and saw that we were only doing 500Mbps across the WAN link (OC-12, 622Mbps). We began to question the network group about the packet loss and they admitted that we really only had 45Mbps of guaranteed bandwidth between the two sites. They said they were using QoS so this shouldn't be a problem.

I looked at their QoS configuration and noticed that they had not specified any sort of bandwidth limit on their Cisco QoS configuration. I mentioned that we weren't reaching our interface capacity but they seemed to think that their QoS configuration was being effective. It did set some DSCP parameters, but all of our traffic between the two sites were being set to the same value so it is unlikely to have been useful.

I attempted to explain the issue to them in the following email:

Bob,

Here's a little justification, explanation, and plan of action for
our experiment with Quality of Service.

Late last week, when the SAN team began to use unused bandwidth
(while not exceeding our link capacity) we experienced packet loss between
the datacenters. Packet loss is caused by one of two things:

A device or transit (e.g. cable or repeater) malfunctioning;
A queue being full (or nearly full):
1. Either an interfaces outbound queue;
2. A devices global queue; or
3. Random Early Detection (RED) signaling that a queue (one of the above) is nearing fullness

Given the general reliability of modern network devices, and the
fact that the packet loss stopped once we reduced the amount of traffic
we were transmitting across the network I think we can eliminate a
device malfunction as a cause that we should attempt to address.

This leaves us with a queue being full or nearing fullness. This queue may be on a
device inside our network (e.g., the firewall) or within the WAN network
(e.g., a router or switch on our provider's network).

First, let me expound upon:

Why packet loss slows down a TCP stream;
Why queues being full (or nearly full, causing higher RED probabilities) leads to packet loss is our problem; and
Why queues being nearly full leading to increased latency leading to TCP connections slowing down is not our problem.

First, Transport Control Protocol (TCP) is a network protocol that (among other characteristics) guarantees delivery and ordering of packets within a socket stream. In TCP, packets that are transmitted by one side are acknowledged by the other side. This acknowledgment is done by transmitting a packet to the sender indicating which packets were received in a given "acknowledgment window" (range of bytes). If the receiver does not receive enough packets to construct the entire range of bytes that the acknowledgment window covers it will not send this acknowledgment. If the acknowledgment is not received by the sender in a defined amount of time (various algorithms exist, but we will just assume 2 * ''AverageRoundTripTime'') either because the acknowledgment packet was lost due to packet loss, or because it was not sent because packet loss caused a packet to be missing from the receiver's acknowledgment window the sender will resend all of the packets in the acknowledgment window that was not acknowledged.

Because TCP guarantees ordering of packets, even over medium that do not guarantee this (i.e., ethernet) the receiving side of a TCP socket must have a buffer available to re-order incoming packets. This buffer is naturally of finite size. Thus, if packets have been lost, the receiver must wait for them to be retransmitted before the receiver can emit any later packets we may have to the system's socket and evict them from the buffer. The size of the receiver's buffer, therefore defines how many packets can outstanding/unacknowledged at a given time and the TCP window size.

Once the sender has sent enough packets to fulfill the TCP window size without a contiguous range starting from the last cleared acknowledgment window being acknowledged, it will stop transmitting until further data is acknowledged and buffer becomes available. If we presume that ''AverageRoundTripTime'' is 100 ms, the sender will wait 200 ms for an acknowledgment, in the presence of packet loss it will then retransmit the missing segment, which will take an additional 100 ms to get there and be acknowledged. During that 300 ms interval, a link that is capable of carrying 100Mbps will have transmitted 4MBytes of data. It is likely that the receiver's buffer will have been exhausted by then and the transmitter will have stopped transmitting. With transmitting stopped, the average throughput for the link will be decreased significantly.

In addition, in the presence of packet loss TCP decreases the window size and require more frequent acknowledgments, which will be limited by the latency.

Second, devices which engage in "store and forward" must utilize some data structure for storing which packets remain to be sent out a given interface. This data structure is typically a queue as it will limit the amount of out of order packets transmitted out the interface, while maximizing the number of packets the device can manage. Some network devices will have multiple queues for a given interface and process the queues in a pre-determined order (e.g., when de-queuing a packet, check the highest priority queue first, if no packets, try the next, etc). These queues are naturally of finite size, and if they are being filled faster than they are being emptied eventually no more packets can be queued.

What happens when an interfaces queue is full is simple, packets that would have been added to it are dropped ("tail-drop"). Most modern network devices will also implement "Random Early Detection" (RED) where packets will be dropped from a non-empty queue (at a probability that increases the more full the queue is) to prevent the queue from becoming full and forcing tail-drop (since tail-drop can lead to massive failure with TCP).

This packet loss (either from tail-drop or from RED) slows the TCP connection down significantly, allowing the packets in the queue to be processed. We were seeing increased packet loss (~15%) over the WAN link during period where we were heavily utilizing the network path between the two datacenters.

Third, in the process of inserting packets into a queue and waiting for their turn to de-queue, time passes. This time causes increased latency. Increased latency will cause an increase in the average utilization of the receiver's TCP receive buffer, and the TCP window. As long as the latency does not grow to the point where that buffer becomes full waiting for acknowledgments it will not significantly impact the throughput of a TCP socket. We did not note exceedingly high latency during the period where we were heavily utilizing the network path between the two datacenters.

Given that this is an issue with a queue being full somewhere
causing packet loss, it would be beneficial to us to control the queue
that was dropping packets so that we can control WHAT packets are being
dropped. Given that the packet loss may be happening on network devices
outside our control (e.g., within our provider's network) we have limited
options to attempt to control which queue is filled, and dropped:

Differentiated Services Code Point (DSCP) Marking; and
Enact queues on our end that are not dequeued faster than the rate we can sustain without acceptable packet loss between endpoints

I will expound upon the benefits and costs of the two approaches:

Differentiated Services Code Point (DSCP) Marking allows us to attempt to manipulate which queue a packet is assigned to on interfaces that support multiple queues. This is the simplest approach, however it has several draw backs:
1. Not all network interfaces along the path may support multiple queues, or if they do DSCP;
2. DSCP may be ignored or re-written within Provider; and
3. DSCP is very coarse, beyond the default there are only 13 classes that packets can be placed in
If DSCP marking is unavailable, ignored, or insufficient a more complicated approach must be taken in order to accomplish the goal of managing the priority in which packets are dropped in favor of others. One possible solution that I support exploring (again, if DSCP Marking proves ineffective) is using Quality of Service (QoS) and Traffic Shaping (TS) on a system that sits in-line to our WAN routers. This device could be setup with an arbitrary number of queues, of arbitrary depth that we control and also we control which packets are inserted into. This would allow us infinite flexibility in controlling which packets are dropped as long as it was combined with traffic shaping. Traffic shaping is required to ensure that we are not transmitting at such a rate that exceeds the current de-queue rate of remote queues -- if we did not shape the traffic, the remote queues would still become full and drop packets regardless of what queue they began in on our side. In order to shape the traffic, we need to know at what rate we currently can transmit from one system to another across the network path. Or, more precisely we need to know if we have exceeded the capacity of the network or not. The same device that is doing the Quality of Service and Traffic shaping could also be used to determine whether we are exceeding the rate across the network path for a given set of queues. The device could measure the amount of packet loss passively by determining how many TCP retransmits are occurring over the link and determine whether the effective rate is too high for a given set of queues.

I propose the following plan of action:

Determine which hosts, subnets, or other definable characteristics can have lower priority;
Attempt to place these hosts and subnets into appropriate DSCP classes;
Perform some testing with DSCP to see if it is effective;
If it is effective, and we are able to classify all of our traffic into to DSCP classes declare victory.
If it is not effective, or we are unable to classify all of our traffic into appropriate DSCP classes we should investigate QoS/TS:
1. We should first start by coming up with a tool to passively monitor the number of retransmits to determine if we are currently sending down a network path beyond some device's in that path's limit.
2. When we are ready to test that tool, we can create a monitor port on one of the WAN routers to mirror outbound traffic from the outbound WAN link and then attempt to saturate the network path
3. Once the tool is able to determine the network path's current limit (as this may change over time, depending on what other depends are being placed on the queues on the remote network devices) we can test QoS and Traffic Shaping with the limit determined by the tool by configuring the WAN router QoS rate between the given networks (e.g., DC2 and DC1) using the determined rate manually
4. Once that is done we can implement a queuing hierarchy and define traffic shaping requirements on a box that will sit inline
5. Once we have defined our queues (5.d.), have defined which packets will be inserted into them (1.), and have a tool that continuously determines the throughput of a given network path (5.a.) we can put the pieces together and have the tool set the traffic shaping parameters on a box that will sit inline with our WAN routers and our WAN link to ensure we do not overflow remote queues.

Ultimately, we decided the best thing to do was to do nothing and hope for the best.

Hooray.