Misadventures in bringing up the same Ubuntu image on different hypervisors
I recently stumbled upon a couple of rather nasty issues when bringing up a server on a variety of hypervisors (and cloud IaaS). Now, admittedly, I was doing something "not normal" - i.e - trying to bring up a VM that was a replica of one that had worked on VMWare ESX (using a dd
copy of the disk) on other platforms. What struck me was that the VM would:
- boot up normally on VMWare (obviously), networking configured via DHCP
- boot up normally on AWS, networking configured, instance was SSH'able
- boot up normally on Azure, but would not get any default routes, so no SSH!
After some poking around, I learnt a couple of lessons
1. dhclient
has multiple ways in which it will set a default route
These are based off the response from the DHCP server. Now,
- some DHCP servers return
Classless Static Routes
option - some DHCP servers return
Router
option
After a lot of debugging (which is excruciatingly painful, since the only real "debug tool" available for these issues is using the debug
script in /etc/dhcp/dhclient-exit-hooks
) I figured out that one environment worked fine because it was using the Router
option whereas on Azure the DHCP server sends back the Classless Static Routes
option.
And dhclient prefers the Classless Static Routes
option (see here
On VMWare, and on AWS the DHCP server actually uses the Router
option, which is why it worked fine across those 2 environments.
Okay, so that solves one mystery - but given that dhclient should know how to handle both options - why wasn't it working for my VM on Azure?
2. dhclient-exit-hook
scripts are sourced
On further investigation, it turned out that one of my dhclient-exit-hook scripts was doing the "wrong thing" and terminating early due to a misplaced exit 0
.
It is not clearly mentioned anywhere (neither man-pages, nor on the interwebz) that the dhclient-exit-hook scripts are actually sourced rather than executed.
It turned out that one of my custom scripts was "exiting" the execution, thus causing some of the remaining scripts to not run. Most notably, there is a default script on Ubuntu that deals with the Classless Static Routes
option (called rfc3442-classless-routes
) that actually handles this correctly. However since the scripts are sourced in lexicographical order, this script would never run because of my aforementioned problem.
One last thing of note - this post on StackExchange was immensely helpful in me finding the root cause of my problems.
Top comments (0)