Part 1: Building the VCF on VxRail management cluster
Part 2: Virtual Infrastructure Workload Domain creation
Part 3: Deploy, Configure and Test VMware HCX
Having worked with, built, torn apart and played with VxRail and VxRack for much of the last two years, I’m always up for an interesting challenge on either platform. Even better when the opportunity came along to work with both platforms on a VxRack SDDC to Cloud Foundation on VxRail conversion and data migration project.
The scenario I’ll be working on is a very realistic one;
We’ve got a VxRack SDDC system in place with production data running on it. We can’t migrate the data off the VxRack to wipe and reinstall from scratch and we can’t upgrade. How do we turn this into a Cloud Foundation on VxRail environment and move all our production data to the new platform without massive downtime or needing extra hardware?
Also, because there’s always at least one ‘also’;
- How long is it going to take?
- How much of it can be done remotely?
- How much of it can be automated?
- How do we migrate our VMs?
Above; where I’m starting from and where I’m going to be by the end of this blog post. To start with, a somewhat poorly maintained VxRack SDDC system with 24 13G nodes. I’m running version 2.3.1 on it currently and It needs pretty much everything done to it. It’s got a four node management domain (nodes 1 to 4) and 19 nodes in one VI workload domain (nodes 5 to 23). That’s where the production workload is running. The final node was decommissioned from SDDC manager after the VxRack was installed and is used for hosting tools (Jumpbox, VxRack imaging VMs, etc). The kind of stuff you’d have on your laptop if you were physically plugged into the rack.
Where we’re going to end up is what I’ll call ‘Stage 1’. Decommission nodes 20 to 23 from VxRack SDDC manager, convert them to VxRail, reconfigure the network, build the cluster, prep for Cloud Foundation and then deploy it.
Before all that, a little background to understand why my VxRack needs so much work. Somewhere around the 2.2 to 2.3 SDDC upgrade window (or 2.3.1 to 2.3.2 – I’m a little fuzzy on the exact versions), there were some hardware changes made to VxRack SDDC systems that are out in the wild. PowerEdge nodes that originally shipped with Perc H730 disk controllers were swapped out to the H330 Mini, a controller which is compatible with VxRail. But our little lab system was left behind. Possibly because at the time the upgrade happened, we weren’t heavily using the VxRack SDDC platform.
Another component which we didn’t have was the now standard IDSDM, an internal module that houses two SD cards and provides a platform to either boot from or, in this case, to install a node recovery/reimage mechanism (RASR). I could get by without this of course, by flashing a number of suitably large USB sticks with the node recovery software and having them permanently plugged in to each node. But let’s try not to drift from what will be a standard VxRail build.
After a box of H330s and IDSDMs arrived at the data center and were diligently installed by our data center technicians, I moved on to the minor detail of actually converting the nodes. I should add that with the exception of the physical hardware swaps in the nodes, the entire process was run remotely. So I guess that’s one of the four questions above answered already.
Thankfully it was nothing new, having already ran through the exact procedure on another VxRack some 8 months previous. But back then I wasn’t aware of the disk controller incompatibility, so the whole thing pretty much fell flat on its face after the initial installation of software and attempt at cluster build. I was immediately grateful to have helpful colleagues who nudged me in the right direction.
The process is straightforward. Time consuming of course, but this really isn’t the kind of thing you should be doing one at a time. Ideally, the more nodes you can convert at the same time, the better. Until you start losing track of what node has had what parts of the process completed on it. A spreadsheet or even a scrap of paper and a pen are your friend here.
- Decommission the node from SDDC manager. This is a one-by-one process as SDDC manager (or at least 2.3.1) doesn’t appear to like concurrent tasks.
- Power it down, install the hardware. Everything from now on can (and should) be run in parallel on multiple nodes.
- Power it back up, enter the BIOS and enable the IDSDM (mirroring, etc).
- Run through any required firmware updates. In my case, I needed to update BIOS, iDRAC, network and disk controller.
- From the iDRAC KVM, mount and boot from the RASR (Rapid Appliance Self Recovery) ISO file. I used VxRail version 4.7.111.
- Do all the FRU assignment tasks in the RASR support menu, then RASR reset the node. This nukes the IDSDM and copies the RASR software to it.
- Reboot and boot from the IDSDM. If the previous step was successful, you get a RASR menu.
- Run the factory reset. This wipes all disks in the node, also wiping the SATADOM which ESXi will be installed on. It copies ESXi images back to the device and preps for install.
- When the above finishes, reboot. ESXi installer kicks off and requires no intervention. After several reboots and about 60 minutes, the node is done.
With the above mostly non-taxing process completed on four nodes, I’ve got enough VxRail appliances ready to build my VCF management cluster. My VxRack VI workload domain is impacted to the tune of four hosts, but there’s plenty of spare capacity. Capacity planning and knowing exactly how many nodes you can free up to move the conversion & migration process forward is going to be a running theme throughout this entire exercise.
The steps above are prime candidates for automation. So it’s good news that this automation has already been done. I automated the full VxRail reset and rebuild process some time ago to take a lot of admin overhead off the almost daily rebuilds required for using several VxRail clusters in test environments. It’ll need a little work to make it useful for this project, but that’s at least part of the way toward answering “can it be automated?”. As I move through the conversion process on the VxRack, the ability to automate and essentially forget about node conversion is going to free up some much needed cognitive capacity.
Moving on with the build, I’ve already reserved some IP addresses and added a few DNS records;
- ESXi management
- PSC x2 (1 will be deployed by VxRail, second by VCF)
- VxRail Manager
And some more I’ll need for Cloud Foundation later on;
- Cloud Builder VM
- SDDC manager
- PSC (as mentioned above)
- vRLI x4 (master, two workers and a load balancer)
- NSX manager
- NSX controllers
Let’s kick off the build. First, I need to make sure my top of rack switches are ready. As I haven’t physically moved these nodes, I’m going to be sharing the switches with the existing VxRack SDDC environment. Switch ports for decommissioned nodes are stripped back to a basic configuration by VxRack SDDC manager. Port channel is removed and previously trunked VLANs are disallowed. This is a good time to point out that I’ll also be reusing the existing IP subnets and VLANs, but I could just as easily have modified the final VxRail network design entirely. Different subnets, VLANs or even moving to a layer 3 topology using BGP or OSPF (or any other routing protocol, I just prefer either of those two).
I put a basic configuration on the ports for the four nodes I’m working with right now. Let’s call that ‘Networking Stage 1’. I’m piggybacking on the existing VxRack uplinks to the production network, but I’ll re-evaluate that once I’m further into the conversion. I’ve got some 40Gbit QSFP+ direct attach cables hanging around that are just begging to be used.
I also trunked some VLANs and enabled services required for VxRail discovery to happen. I created a few VLANs for things like vMotion and vSAN specific to the VxRail, because I’d like to have as little reuse of VxRack SDDC managed VLANs as possible. A VLAN for VXLAN VTEPs is also added, and it’s worth noting at this point that VCF requires DHCP on this VLAN to assign IP addresses during the VCF bring up.
VxRack SDDC manager is still going to be managing the configurations on the switches, so I’m not going to go too crazy with the current configuration. All existing uplinks to the core network need to remain in place until VxRack SDDC is no more. It goes without saying that if this was real production infrastructure, I’d have already been through several meetings and ever-evolving Visio diagrams to figure out what the new VxRail network is going to look like and how to safely build it alongside the existing VxRack network. As with any production environment, you really don’t want to make any spur of the moment, potentially career limiting decisions during deployment.
The management cluster goes like any other VxRail build. Except not quite for me, as I’m remote. The time tested process I’ve adopted is to give each node a temporary IP address on the ESXi management subnet. I would then log in to the master node and also give the VxRail Manager a temporary IP address. With version 4.7.x, I also need to think about node discovery. It was moved onto a dedicated VLAN which I’ll need to change as that VLAN doesn’t exist on the network. With that setting changed on all four nodes, I can fire up the VxRail installer UI and run a standard install process. Be sure to change the logging option to none during the install. Cloud Builder will deploy it’s own log insight instance.
With the cluster built, I’ll log into vCenter and make sure everything looks as it should. Check out any alarms, etc. There are a couple of changes that need to be made before I can move any further.
- I need to change the management port group from static binding to ephemeral. This involves creating a temporary port group, migrating VMKernels on all hosts to that port group. Then modifying the original management port group and migrating everything back. Don’t forget to delete the temporary port group.
- I need to ‘externalise’ the vCenter. I was a little mystified by this one initially, but it boils down to running a script on the VxRail Manager VM that essentially forces the VxRail Manager to forget about the vCenter. In a normal cluster, you’d initiate a cluster shut down from VxRail Manager and it’d take down all the VMs in an orderly fashion, shut down hosts, etc. With an externalised vCenter, the VxRail Manager no longer has control over the vCenter. This is verified by attempting a cluster shutdown and confirming that validation fails (screenshot below).
With all that done, I grabbed a copy of the 3.7.1 Cloud Builder OVA and deployed it onto the VxRail cluster, using the necessary option to identify the installation target as a VxRail. With the deploy completed, I opened up Chrome and browsed to the IP I set during the deployment and logged in with the admin password I also set during deployment.
There is a not entirely insignificant checklist to work through and make sure everything is in place, but with all that sorted out I should be in a good state to get a working Cloud Foundation install at the other end.
To give Cloud Builder everything it needs to get Cloud Foundation installed, you need to either supply a JSON answer file or download a Microsoft Excel template from the UI, complete it and upload it. I didn’t have a JSON file unfortunately, so took the long route. It’s nothing out of the ordinary in the Excel template. Details about the VxRail cluster, the network, DNS, NTP, host names and IP addresses. Hit the upload button and provide it with the completed Excel template.
The information in the template was validated successfully after a few attempts. I hit an issue with ‘JSON Spec Validations’ and then another one with license keys I’d entered. Mostly everything else was fine. Couple of warnings that (for my environment) could be safely ignored. I could then kick off the bring up process and begin watching the clock.
A couple of cups of coffee later, SDDC dashboard!
Don’t look too closely or you’ll see I’ve been cheating. The above screenshot was taken after I’d already converted four more nodes and added a VI workload domain. Let’s call it a preview of ‘Stage 2’.
I’ve also got a whole load of new VMs in my VxRail cluster, neatly grouped within one of three new resource pools.
Yes, we’ve got a bit of a Norse mythology naming scheme going on in the environment. Aside from a shiny new dashboard, I’ve also now got NSX and Log Insight installed. So that’s pretty much stage 1 of the build completed. Four VxRack SDDC nodes decommissioned, converted to VxRail nodes, built, prepped and Cloud Foundation deployed. Going back to the questions at the start, how many can we tackle at this point?
- How long is it going to take? – Right now we’re at about a day to get to this point.
- How much of it can be done remotely? – Given that this environment is remote to me, almost all of it. Everything except physical hardware swaps. So if you’ve got 13G nodes that were previously upgraded or 14G nodes, you can probably skip that bit.
- How much of it can be automated? – As I said above, I’ve already written automation for node RASR and VxRail cluster build. There’s no reason why everything else here can’t be automated, with a little effort of course. See the note on automation below. To make your life easier, you’ll want to have JSON answer files ready to go for VxRail build and Cloud Foundation deployment.
- How do we migrate our VMs? – We’re getting to that, but it’ll be a little while.
About the automation. With some awkward parts to automate, like driving a Java KVM session, I used a tool called Eggplant. It’s usually used for test automation, but suits this job pretty well. But it’s not terribly portable (as far as I’m aware). A nice, open source and more portable alternative is Jython and the Robot Framework. There are most likely dozens or dozens of dozens of ways to automate what I’m doing. Later stages could purely use API calls or SDKs. Right now I’m automating the low-hanging fruit. I’m sure it’ll evolve over time. That’s it, speech over.
Next up, I’ll be stealing a few more nodes from VxRack SDDC and building a VI workload domain on my new Cloud Foundation on VxRail deployment.