VMware Cloud on AWS; Getting up and running

I make no bones about being a reformed server hugger. One of my more recent catchphrases is “I’d be a happy man if I never had to build another physical server spec”. So for those currently in the VMware ecosystem, I’ve been turning a lot of attention toward VMware Cloud on various hyperscalers. There is a particular emphasis on AWS and Azure here. GCP is also available, I just don’t get asked about it a lot. Sorry Google.

Some of the most frequent questions I hear include topics such as difficulty to setup and maintain the platform, required knowledge of the underlying hyperscaler, the ubiquitous horror of “vendor lock-in”, and of course the overall cost of the solution. The latter is a whole series of articles in itself, so I’ll be tactically dodging that one here. The short answer, as it so often is, is “it depends”.

Another frequently asked question surrounds the whole idea of “why should I be in the cloud”. There are quite a few well documented use cases for VMware Cloud. The usual ones like disaster recovery, virtual desktop, datacenter extension or even full on cloud migration. One that’s sometimes overlooked is that because SDDCs are so quick to spin up and can be done so with on-demand pricing, they’re great for short lived test environments. Usually, you can create an SDDC, get connectivity to it and be in a position to migrate VMs from your on-premises cluster or DR solution within a couple of hours. No servers to rack, no software to install, no network ports to configure, no week long change control process to trundle through.

For those organisations currently using native cloud services, there are a whole load of additional use cases. Interaction with cloud native services, big data, AI/ML, containers, and many many more buzzwords.

What I can do is show you how quickly you can get a VMC on AWS SDDC spun up and ready for use. The process is largely the same for Azure VMware Solution, with the notable difference that the majority of the VMC interaction in Azure is done via the Azure console itself. The video below gives you an idea of what to expect when you get access to the VMC console and what creating a new SDDC on AWS looks like.

I’ve included some detail on setting up a route based VPN tunnel back to an on-premises device. Policy based and layer 2 VPN are also available, but as my device (a Ubiquiti Edgerouter) is BGP capable I’m taking what I consider to be the easy option. I prefer to use a route based VPN because it makes adding new networks easier and allows a great deal of control using all the usual BGP goodness any network team will be comfortable with. There is also some content covered for firewall rules, because as you might expect they are a core component of controlling access to your newly created SDDC.

Something you may immediately notice (although not explicitly demonstrated in the video) is that the only interaction I had with AWS native services throughout the SDDC creation was to link an existing AWS account to the SDDC. For those without AWS knowledge, this is an incredibly straightforward process with every step walked through in just enough detail. If you’re creating an SDDC, you need to link it to an AWS account. If you don’t already have an AWS account, it’ll take less than five minutes to create one.

The only other requirement right now is to know a little bit about your network topology. Depending on how complex you want to get with VMC on AWS in the future, you’ll need to know what your current AWS and/or on-premises networks look like so you can define a strategy for IP addressing. But of course even if you never plan on connecting the SDDC to anything, you should still have an IP addressing strategy so it doesn’t all go horribly wrong in the future when you need to connect two things you had no plan to connect when you built them.

The one exception I’ll call out for the above is self-contained test environments. What I’m thinking here is a small SDDC hosting some VMs to be tested and potentially linked to an AWS VPC which hosts testing tools, jumpboxes, etc on EC2. If this never touches the production network, fill your boots with all the overlapping IP addresses you could possibly ever need. In this case, creating an environment as close as possible to production is critical to get accurate test results. Naturally, there’s also a pretty decent argument to be made here for lifecycling these type of environments with your infrastructure as code tool of choice.

On that subject, how do you connect a newly created SDDC to an AWS VPC?

This is the process at it’s most basic. Connecting a single VMC SDDC to a single AWS VPC in a linked AWS account. Great for the scenario above where you have self-contained test environments. Not so good if you have multiple production SDDCs in one or more AWS region and several AWS accounts you want them to talk to. For that, we’ve got SDDC groups and transit gateways. That’s where the networking gets a little fancier and we’ll cover that in the next post.

I hope by now I’ve shown that a VMC on AWS SDDC is relatively easy to create and with a bit of help from your network team or ISP, very easy to connect to and start actively using. I’ve tried to keep connectivity pretty basic so far with VPN. If you are already in the AWS ecosystem and are using one or more direct connect links, VMC ties in nicely with that too.

The concept of vendor lock-in comes up more than once here. It’s also something that’s come up repeatedly since Broadcom appeared on the VMware scene. To what extent do you consider yourself locked-in to using VMware and more importantly, what are the alternatives and would you feel any less locked-in to those if you had to do a full lift & shift to a new platform? If you went through the long and expensive process to go cloud native, would going multi-cloud solve your lock-in anxiety? Are all these questions making you break out in a cold sweat?

If you take nothing else away from all the above, I hope you’ve seen that despite some minor hyperscaler platform differences, VMware Cloud is the same VMware platform you’re using on-premises so there is no cloud learning curve for your VMware administrators. You can connect it to your on-premises clusters and use it seamlessly as an extension of your existing infrastructure. You can spin it up on demand and scale up and down hosts quickly if you need disaster recovery. You can test sensitive production apps and environments in isolation.

As I brought up those ‘minor hyperscaler differences’, how would they impact your choice to go with VMC on AWS, Azure VMware Solution, GCP, or another provider? As the VMware product is largely the same on any of the above, it comes down to what relationship you currently have (if any) with the cloud provider. The various providers will have different VMC server specs and connectivity options which would need to be properly accounted for depending on what your use case is for VMC. This is another instance of it being a whole topic by itself. If you’re curious, the comment box is below for your questions.

In the next post on this subject I’ll go into more detail about some complex networking and bring in the concept of SDDC groups, transit connect and AWS transit gateways. I’ll also cover a full end-to-end HCX demo in the next post, so you’ll have what is possibly the best way (in my biased opinion anyway) of getting your VMs into VMware Cloud on AWS.

VMware HCX; Testing the limits

Introduction

I had a conversation with a colleague recently about VMware HCX, specifically around network requirements and the minimum link specifications we’d need to have between two sites so that the basic functionality would work as intended.

I threw out a figure that I knew was well in excess of what would be required for HCX (and all the other services that need to be run over the link) but it got me thinking; How bad could the link get before even basic HCX services would just stop working?

I was a little surprised to read the figures in the below table from the VMware docs site and I’ve highlighted the numbers I’m interested in for HCX vMotion. The bandwidth figure listed is quite a bit lower than I thought would be recommended. I’m even more surprised by the latency figure, quite a lot higher than I imagined HCX would endure.

With the above in mind, lets see what it takes to break HCX.

The Lab

But first, this. My test environment consists of two VxRail clusters running on 13G nodes with version 7.x code. Networking is 10G everywhere and between the two sites I’ve got a Netropy 10G link emulator to provide all the link state related horror later on. As HCX encapsulates traffic within a tunnel between sites, the traffic inspection options in the Netropy are going to be less useful to me to be able to pull specific traffic off a link. Instead, I’ve done my best to isolate all HCX traffic to a single 10G NIC on each cluster. In all migration tests, I’m using clones of an 80GB Windows 2012 VM.

I should also point out that this is by no means a benchmark, nor should it be interpreted as such. I was merely curious at what point HCX would stop working in my existing lab environment. The test environment hasn’t been endlessly tweaked for maximum vMotion performance. It’s pretty much a straight out of the box VxRail build.

We already know that a 10G link with very little latency is going to work just fine and yes I did run a migration to test that.

The transfer speed axis is a little bit of an eye chart, but it’s within 2.3 to 2.6Gbps. The transfer did initially spike to 6.8Gbps, but it settled to it’s final figure pretty quickly.

Testing VMware minimums

With that out of the way, lets get straight to the VMware quoted minimums. I modified the link characteristics of the Netropy to 100Mbps with a fixed 150ms of latency. In the screenshot below, 75ms is applied in each direction for a total of 150ms for the round trip.

At this new setting the response time of the HCX dialogs within vSphere client are somewhat slower, taking marginally longer to refresh and gather cluster information from the remote site. Once all the information has been gathered though, the process of selecting migration options is just as quick as with a faster/lower latency link.

Once the migration kicked off, it did max out the available line-rate initially, only dropping to about 35Mbit after 60-70% of the migration had completed.

To migrate the entire 80GB VM took a little over 4 hours. While that time is far from ideal in a production environment, it becomes more acceptable depending on the frequency and quantity of migrations. Keep in mind also that it’s not exactly a tiny VM that I’m pushing around here. In a small production environment or one where mobility is restricted to disaster recovery scenarios (where the VMs would be kept in sync between sites according to RPO/RTO), it becomes almost workable.

Pushing the limits

But it’s still not broken, so I’m not done. What if instead of moving your VMs to the other side of the country, you’re moving them to the other side of the world. Lets keep the same 100Mbit link but ramp the latency to 300ms.

The most charitable thing I can say about the beginning of the process was that the migration dialog did eventually open. I measured the time it took to do that in one cup of coffee made & consumed, and several minor household tasks completed. Lets say 15-20 minutes. Like the previous attempt, once the dialog opened and information was pulled from the remote site, selecting all required migration options was snappy and responsive.

Unsurprisingly, further trouble is just around the corner and once I kicked off that migration, everything ground to a halt once again. The length of time that the base sync performed at the beginning of the migration took to complete was a sign that I probably shouldn’t wait around for live stats on this one. Before shutting down for the evening, I did see peak speeds of between 50 and 60Mbps reported by the Netropy device.

Today is tomorrow and the results are in.

If the text is small and hard to read, let me help. It’s 11 minutes (for the 10Gbit link) versus 6 hours and 18 minutes. Needless to say, if you’re moving VMs to and from a distant site, be that another private site or a VMware Cloud on AWS region and all you’ve got is a 100Mbit link, you’re probably going to reserve the use of vMotion for special occasions. In this scenario, bulk migration with scheduled switchover might be your best friend. Or at least a way to preserve your sanity.

Pushing Bandwidth

You know what I’m going to say next. Even taking the above dismal results into account, HCX still works on a 100Mbit link with 300ms latency. It’s still not broken. The territory I’m entering now is well within the realm of testing with a link that is of zero use for any kind of traffic. I’m more concerned with finding out what happens with low bandwidth right now, so I’ll drop the link speed significantly to 20Mbit and return the latency to the VMware recommended minimum of 150ms. I fully expect the migration to succeed, even if it does take a day to do so. Onward, and downward!

But not downward far enough it seems, the migration still completed successfully at 20Mbit/150ms. It did take over 16 hours mind you, but the end result is what matters here.

From here, I’m conflicted about taking the bandwidth any lower. If you’re thinking about multi-site private cloud or hybrid private/public deployment and you can’t get at least a 20Mbit link between your sites, it’s almost certainly time to re-evaluate the deployment plan. So lets say the minimum bandwidth test result is a resounding pass. Even if all that’s available is VMware’s minimum recommended of 100Mbit, it’s going to be sufficient to migrate VMs between sites with relative ease.

Pushing Latency

So instead, I’m going to bring the bandwidth available back to a very healthy 500Mbit and focus instead on two things; Latency and packet loss. First up is latency and as I’ve already shown that even 300ms (double the VMware quoted minimum) still results in a successful migration, I’m going to double the double to 600ms.

Despite the huge jump in bandwidth since the last run, the equally huge jump in latency is firmly putting the link back into the almost unusable territory. With the migration running, the charts are showing exactly that.

At peak, less than a fifth of the total bandwidth available is being used and the average use is less than half the peak. Nothing out of the ordinary for a link with such high constant latency. Accordingly, migration time is huge at just over 12 hours.

Dropping Packets

My last attempt to prevent HCX from doing it’s job is to introduce packet loss to the link. The VMware table at the top of the post specifies a maximum loss of 0.1%. This feels very like another worst case scenario kind of test. A high level of packet loss on any link isn’t something that would be tolerated. For this test, I’m going to remove the latency from the link, but maintain the 500Mbit bandwidth. I’m introducing 5% packet loss.

The results are nothing shocking. For anyone not familiar, packet loss with TCP traffic on a link will cause packet transmission issues and re-transmissions will happen, effectively slowing down any data transfers.

With 5% packet loss set, average bandwidth is a fraction of the potential 500Mbit link speed.

With no packet loss, practically the entire 500Mbit link is used.

As packet loss increases, the above downward trend would continue until the link is unusable. As with the other tests above, HCX appears no more sensitive to poor link quality than any other network traffic would be.

Conclusion

It has become apparent that HCX will perform adequately on any link that is of relatively decent quality. It continues to function with acceptable performance well below the recommended figures that VMware quote in documentation. As I have stated in the tests above, I would expect that if the link between sites functions then HCX will also function.

To put a little more of a ‘real world’ slant on this, I performed a simple latency test to several AWS regions in which it’s possible to run VMware Cloud services and therefore act as a hybrid cloud into which VMs can be migrated using HCX.

My test environment is located in eastern US, so latency to US regions is lowest. Keeping in mind that I tested all the way to 600ms of latency and had successful, albeit slow transfers, it makes any of the available regions above seem viable.

It is obviously not realistic to expect HCX to be able to perform any magic tricks and work well (or at all) over a link so poor that any other network traffic would have issues traversing. I am pleased that my proposition at the beginning of this post was incorrect. I assumed there would be a point at which a built-in timeout or process error would take the whole thing down and HCX would beg for a faster and more stable link. I also assumed that when that happened, I’d still be able to show that inter-site connectivity was up and somewhat functional outside of HCX.

A somewhat anticlimactic conclusion perhaps, but one that’ll be of great use for my next HCX conversation. At least now I know that when a colleague asks what kind of link they need for HCX, I can confidently answer “anything that works”.

VxRack SDDC to VCF on VxRail; Part 3. Installing VMware HCX.

Quick Links
Part 1: Building the VCF on VxRail management cluster
Part 2: Virtual Infrastructure Workload Domain creation
Part 3: Deploy, Configure and Test VMware HCX
Part 4: Expanding Workload Domains

I’ve got Cloud Foundation up and running and a VI workload domain created, so I’m ready to think about getting some VMs migrated. This is where VMware HCX comes in. The subject of moving VMs around is a sometimes contentious one. You could talk to ten different people and get ten entirely unique but no less valid methods of migrating VMs from one vCenter to another, across separate SSO domains. But I’m working with HCX because that was part of the scenario.

That doesn’t mean I don’t like HCX, quite the opposite. It takes a small amount of effort to get it running, but once it is running it’s a wonderful thing. It takes a lot of the headache out of getting your VMs running where you want them to be running. It’s a no-brainer for what appears to be its primary use case, moving VMs around in a hybrid cloud environment.

Installing VMware HCX on source and destination clusters.

This is stage 3 of the build, HCX installation. I’ve worked out what VMs I can migrate to allow me to free up some more resources on the VxRack. Moving some VMs off the VxRack will allow me to decommission and convert more nodes, then add more capacity to my VCF on VxRail environment. In something of a departure from the deployment norm, the installation starts on the migration destination, not the source.

There is something I need to cover up front, lest it cause mass hysteria and confusion when I casually refer to it further down in this post. ‘Source’ and ‘destination’ are somewhat interchangeable concepts here. Usually, you’d move something from a source to a destination. With HCX, you also have the option of reverse migration. You can move from a destination to a source. Using HCX as a one-time migration tool from VxRack SDDC to VCF on VxRail, it doesn’t matter too much which clusters are my source or destination. If I intended to use HCX with other clusters in the future, or with a service like VMware Cloud on AWS, I’d probably put my source on a VxRail cluster and my first destination on VxRack SDDC. Also important here is that one source appliance can link to several destinations.

Back to the install. The HCX installer OVA is deployed on the VxRail VI workload domain that I created in the last part. The deployment is like any other. I set my management network port group and give the wizard some IP and DNS details for the appliance. The host name of the appliance is already in DNS. After the deployment the VM is powered on, then left it for about 5 minutes to allow all services to start up. As you might expect, attempting to load the UI before everything has properly started up will result in an error. When it’s ready to go, I’ll open up https://[DESTINATION-FQDN]:9443 in my browser and login at the HCX Manager login prompt.

The initial config wizard will is displayed, and it’s quite a painless process. It’s notable though that internet access is needed to configure the HCX appliance. Proxy server support is available. I enter my NSX enterprise plus license key, leaving the HCX server URL at it’s default value.

HCX license entry and activation.

Click the activate button and as I didn’t deploy the latest and greatest HCX build, a download & upgrade process begins. This takes several minutes, the appliance reboots at the end to activate the update. Your mileage will no doubt vary, depending on the speed of the internet connection you’re working on.

HCX automatic download and upgrade

After the reboot, log back in at the same URL to continue the configuration. The next part involves picking a geographic location for your cluster. Feel free to be as imaginative as you like here. With all my clusters in the same physical location, I decided to take artistic license.

Location of the HCX destination cluster.

System name stays at the default, which is the FQDN with “cloud” tagged onto the end. 

HCX system name

“vSphere” is the instance type I’m configuring. Interestingly, VIO support appears to have been added in the very recent past and is now included in the instance type list.

HCX instance type

Next up is login details for my VI workload domain vCenter and NSX manager instances.

HCX connection to vCenter and NSX

After which, the FQDN of the first PSC in the VCF management cluster.

HCX connection to external PSC

Then set the public access URL for the appliance/site. To avoid complications and potential for confusion down the road, this is set to the FQDN of the appliance.

HCX public access URL

Finally is the now ubiquitous review dialog. Make sure all the settings are correct, then restart for the config to be made active.

Completed HCX initial setup

After the restart completes, additional vSphere roles can be mapped to HCX groups if necessary. The SSO administrators group is added as HCX system administrator by default, and that’s good enough for what I’m doing. This option is located within the configuration tab at the top of the screen. Then under vSphere role mapping from the left side menu.

Deploying the OVA on the destination gives you what HCX call a “Cloud” appliance. The other side of the HCX partnership is the “Enterprise” appliance. This is what I’m deploying on the VxRack SDDC VI workload domain. This is another potential source of confusion for those new to HCX. The enterprise OVA is sourced from within the cloud appliance UI. You click a button to generate a link, from which you download the OVA. To find this button, log out of the HCX manager, then drop the :9443 from the URL and log back in using SSO administrator credentials. Go to the system updates menu and click “Request Download Link”.

Requesting a download link for HCX Enterprise OVA

It may take a few seconds to generate the link, but the button will change to either allow you to copy the link or download the enterprise OVA directly.

HCX Enterprise OVA download link

I didn’t do this the first time around, because of an acute aversion to RTFM. Instead, I installed cloud and enterprise appliances that were of slightly different builds and ultimately, they did not cooperate. The site link came up just fine, I just wound up with VMs that would only migrate in one direction and lots of weird error messages referencing JSON issues.

The freshly downloaded enterprise appliance OVA gets deployed on the VxRack, and goes through much the same activation and initial configuration process as the cloud appliance did.

HCX had two methods of pairing sites. In fact, it has two. The regular “Interconnect” method and the new “Multi-Site Service Mesh”. The second is more complicated to set up, but the first is deprecated. So I guess the choice has been made for me.

Before I get to linking sites however, I need to create some profiles. This happens on both the cloud and the enterprise sites in an identical manner. I’ll create one compute profile per site, each containing three network profiles. The compute profile collects information on vSphere constructs such as datacenter, cluster and vSAN datastore. The network profiles are for my management, uplink and vMotion networks.

Still within the HCX UI, I move over to the interconnect menu under the infrastructure heading. The first prompt I get is to create a compute profile. I’ll try to make this less screenshot heavy than the above section.

1. First, give the compute profile a name. Something descriptive so it won’t end up needle in a haystack of other compute profiles or service names. I name mine after the vSphere cluster it’s serving.

2. In services, I deselect a couple of options because I know I’m not going to use them. Those are network extension service and disaster recovery service. All others relate to migration services I’m going to need.

3. On the service resources screen, my VI workload domain data center and vSphere cluster are selected by default.

4. All I need to select on the deployment resources screen is the vSAN datastore relevant for this cluster. Only the resources within this cluster are displayed.

5. Now I get to the first of my network profiles, so back to the screenshots.

In the drop down menu for management network profile, click create network profile.

HCX service mesh network profile creation

Each network profile contains an IP pool, the size of which will vary depending on the quantity and complexity of services you want to set up. In my case, not very many or very complicated; each IP pool got just 2 addresses.

But wait a second, my uplink network profile is probably a little misleading. As I’m reusing the same IP subnet for the new environment, I created a management network profile with a sufficiently large IP pool to also serve as the uplink profile. So really, my management network profile got 4 IP addresses. I lied. Sorry about that.

The uplink profile might be a separate VLAN with an entirely different IP subnet to act as a transit network between the VxRack and VxRail. In my case, they’re on the same physical switches so that seems a little redundant. If my source and destination were in two different physical locations, my uplink port group would be using public IP addressing within my organization’s WAN. On that subject, there are ports that need to be open for this to work, but it’s nothing too out of the ordinary. TCP 443 and UDP 500 & 4500. Not a concern for me, as I have no firewalling in place between source and destination.

Finally I’ll create a vMotion network profile using the same process as the management network profile. I don’t have a default gateway on the vMotion VLAN, so I left that blank along with DNS information.

HCX service mesh network profile creation

Next up is vSphere replication, and the management network profile is selected by default. Connection rules are generated, which is of concern if firewalls exist between source and destination. Otherwise, continue and then click finish to complete the compute profile on the destination.

Now do the exact same thing on the source appliance.

With all the profiles in place, I’ll move on to setting up the link. That is accomplished on the source appliance (or HCX plugin within vSphere web client) by entering the public access URL which was setup during the deployment of the cloud appliance, along with an SSO user that has been granted a sufficiently elevated role on the HCX appliance. Keeping things simple, I left it with the default administrator account. I’ll complete everything below from within the HCX source appliance UI.

First up, I’ll import the destination SSL certificate into the source appliance. If I don’t do this now, I’ll get an error when trying to link the sites in the next step. This is done by logging into the source appliance at https://[SOURCE-FQDN]:9443, clicking on the administration menu and then the trusted CA certificate menu. Click import and enter the FQDN of the destination appliance.

HCX import destination appliance certificate

After clicking apply, I get a success message and the certificate is listed. With source and destination clusters sharing the same SSL root, the amount of setup I need to do with certificates is minimal. If I was migrating VMs across different trusted roots, I’d need a lot more to get it working. I’m not covering it here, mostly because I couldn’t explain it any better than Ken has already done on his blog.

Within the interconnect menu, open site pairing and click on the “Add a Site Pairing” button. Enter the public access URL of the destination site (remember I set it as the FQDN of the destination) and also enter a username and password for an SSO administrator account.

HCX site pairing dialog

If everything up to this point has been configured correctly, the site pairing will be created and then displayed.

HCX site pairing display

On the home stretch now, so I’m moving on to the service mesh. Within the service mesh menu, click on “Create Service Mesh”. The source appliance will be selected, click the drop down next to this to select the destination appliance. Now select compute profiles on both sites. Services to be enabled are shown. As expected, I’m missing the two I deselected during the compute profile creation. I could at this point choose entirely different network profiles if I wished. I don’t want to override the profiles created during the compute profile creation, so I don’t select anything here. The bandwidth limit for WAN optimization stays at it’s default 10Gbit/s. Finally a topology review and I’m done with service mesh. Except not quite yet. I’ll give it a name, then click finish.

The service mesh will be displayed and I’ll open up the tasks view to watch the deployment progress. But alas, it fails after a couple of minutes. Thankfully, the error message doesn’t mess around and points to the exact problem. I don’t have a multicast address pool set up on my new NSX manager.

HCX failed service mesh deployment

That’s an easy one to fix. In vSphere web client, jump over to the NSX dashboard by selecting networking and security from the menu. Then into installation and upgrade and finally logical network settings. Click on edit under segment IDs. Enable multicast addressing and give it a pool of addresses that doesn’t overlap with any other pool configured on any other instance of NSX that may be installed on VxRail or VxRack clusters.

NSX segment ID settings

With that minor issue resolved, I go back to the HCX UI and edit the failed service mesh. Step through the dialog again (not changing anything) and hit finish. Now I’m back to watching the tasks view. This time it’s entirely more successful.

The above configuration deploys two VMs per site to the cluster and vSAN datastore chosen in the compute profile. A single, standalone ‘host’ (like a host, but more virtual) is added per site to facilitate the tunnel between sites.

Leaving the newly deployed service mesh to settle and do it’s thing for a few minutes, I returned to see that the services I chose to deploy are all showing up. Viewing the interconnect appliance status shows that the tunnel between the sites is up.

HCX appliance and tunnel status

In the vSphere web client, it’s time to test that tunnel and see if I can do some migrations. The HCX plugin is available in the menu, and the dashboard shows our site pairing and other useful info.

Into the migration menu and click on “Migrate Virtual Machines”. Because I don’t really want to have to migrate them one by one. I could have done that by right clicking on each VM and making use of the “HCX Actions” menu. That was labeled “Hybridity Actions” when I was running an earlier version. I imagine that was like nails on a chalkboard to the UX people.

Inside the migrate virtual machines dialog, my remote site is already selected. If I had more than one (when I have more than one), I’ll need to select it before I can go any further. I’m going to migrate three test VMs from the VxRack SDDC to the VxRail VI workload domain, using each of the three available migration options. Those are vMotion, bulk and cold.

The majority of my destination settings are the same, so I set default options which will be applied to VMs chosen from the list. The only things I’ll need to select when picking individual VMs is the destination network and either bulk or vMotion migration.

HCX VM migration dialog

A little info on migration options. When I select a powered off VM, cold migration is the only available option. For powered on VMs, I can choose bulk or vMotion. The difference being that vMotion (much like a local vMotion) will move the VM immediately with little to no downtime. Bulk migration has the added benefit of being able to select a maintenance window. That being, a time when the VM will be cut over to the destination site. Very useful for, as the name suggests, migrating VMs in bulk.

With all my options set, I advance to the validation screen. Unsurprisingly, its telling me that my vMotion might get affected because of other migrations happening at the same time. My bulk migration might need to reboot the VM because my installation of VMware tools is out of date. As this is a test, I’m not going to worry about it.

HCX VM migration status

As you’d expect, vMotion requires CPU compatibility between clusters. Not an issue for me, because I’m reusing the same hosts so all of the nodes have Intel Xeon 2600’s. If this wasn’t the case, I’d have ended up enabling EVC. But better to figure out any incompatibility up front because enabling EVC once you’ve got VMs already on the cluster isn’t a trivial matter. Also on this subject, be aware that when a VxRail cluster is built, EVC will be on by default. I already turned it off within my destination VxRail cluster. 

I’m going to go out on a limb and guess that bulk migration is the one I’ll end up using the most. That way, I can schedule multiple VMs during the day and set my maintenance window at the same time. Data will be replicated there and then, with VM cutover only happening later on in the maintenance window. Great for those VMs that I can take a small amount of downtime on, knowing it’ll be back up on the VxRail in the time it takes to reboot the VM.

Second will probably be cold migration, for those VMs that I care so little about that I’ve already powered them off on the VxRack. Any high maintenance VMs will get the vMotion treatment, but still certainly within a brief maintenance window. HCX may whine at me for VMware tools being out of date on (some) most of the VMs, so I’ll either upgrade tools or deal with HCX potentially needing to bulk migrate and reboot those VMs in order to move them.

As to why I left two services out of the service mesh, I won’t be using HCX in a disaster recovery scenario and I won’t be extending any layer 2 networks. The VxRack and VxRail share top of rack switching, so any and all important L2 networks will be trunked to the VxRail and have port groups created. 

That’s certainly leading on to a much larger conversation about networking and VLAN or VXLAN use. Both the VxRack SDDC and VCF on VxRail clusters have NSX installed by default, and I’m using NSX backed networks for some of my VMs. I’ll get to that in the near future as a kind of addendum to this process.

So;

How long is it going to take? – I was just a little under 2 days total before I touched HCX. A single source and destination install, along with configuration and site pairing could make up the rest of day 2. All that takes about 90 minutes.

And;

How much of it can be automated? – Depending on your chosen deployment strategy, HCX could be a one-time install. Given the relatively short time it takes to install (plus the potential for errors as we’ve seen above) makes it a hard sell for automation.

With HCX installed and running, I can move onward. Out of the frying pan and into the fire. Getting some of those production VMs moving.