Deploying Cloud Foundation 3.9.1 on VxRail; Part 3

The next task on the list is to add a workload domain to the Cloud Foundation deployment.

The checklist of prerequisites includes the following;

  1. Additional VxRail nodes prepared to version 4.7.410
  2. All DNS records for the new cluster created
  3. IP addresses assigned and DNS records created for NSX flavour of choice
  4. User ‘vxadmin’ created in SSO (I’ll cover this on the video below)

Before starting the process shown in the video, I grabbed three additional VxRail E460F nodes and upgraded RASR to 4.7.410. I kicked off a factory reset and allowed that to run while I’m getting on with the initial creation of the workload domain.

Getting to the content of the video, I first created a new workload domain in SDDC Manager. I entered a workload domain name and all the required details for the new vCenter.

While the vCenter was deploying, I finished up the factory reset on my three new VxRail nodes and made VxRail manager reachable. In my environment, this consists of the following;

  1. Log into two of the three nodes DCUI (KVM via the node’s iDRAC) and enable the shell in the troubleshooting menu.
  2. Set the VLAN ID of two port groups to match the management VLAN. This is done on the shell with the command esxcli network vswitch standard portgroup set -p ” [port group name] ” -v [VLAN ID]. I change VLAN IDs for the port groups ‘Private Management Network’ and ‘Private VM Network’.
  3. Restart loudmouth (the discovery service) on both nodes with the command /etc/init.d/loudmouth restart.
  4. Wait for the primary node to win the election and start the instance of the VxRail Manager VM. You can check which node this is by checking if the VxRail Manager VM is booted. Use the command vim-cmd vmsvc/getallvms to get the ID of the VxRail Manager VM, then use vim-cmd vmsvc/power.getstate [ID] to check if the VM is powered on.
  5. On the primary node, set the ‘VM Network’ port group to the management VLAN (same command as above). Failing to set this will lead to massive confusion as to why you can’t reach the temporary management IP you’ll assign to the host in the next step. You’ll check VLAN’s, trunks, spanning tree and twenty other things before groaning loudly and going back to the node to set the VLAN. Ask me how I know.
  6. In the DCUI, give the primary node a temporary IP address on the management VLAN.
  7. Log into vSphere client on the node and open the VxRail Manager VM console. Log in as root using the default password.
  8. Set a temporary IP on the VxRail Manager VM with the command /opt/vmware/share/vami/vami_set_network eth0 STATICV4 [ip address] [subnet mask] [gateway].
  9. Restart the marvin and loudmouth services on the VxRail manager VM with systemctl restart vmware-marvin and systemctl restart vmware-loudmouth.
  10. Give it a moment for those services to restart, then open the temporary VxRail Manager IP in a browser.
  11. Go back to the third (and any subsequent) node(s) and perform steps 1 to 3 above.

Before kicking off the VxRail build, I go back and remove the temporary management IP address I set on the primary node to prevent any confusion on the built cluster. I’ve found in the past that SDDC sometimes isn’t too happy if there are two management IP addresses on a host. It tends to make the VCF bringup fail at about the NFS datastore mount stage.

Before anyone says anything; Yes, this would be a lot easier if I had DHCP in the environment and just used the VxRail default VLAN for node discovery. But this is a very useful process to know if you find yourself in an environment where there is no DHCP or there are other network complications that require a manual workaround. I may just have to create another short video on this at some stage soon.

With my vCenter ready and my VxRail ready to run, I’ll fire up the wizard and allow the node discovery process to run. After that, I chose to use an existing JSON configuration file I had for another workload domain I created not too long ago. I’ll be changing pretty much everything for this run, it just saves some time to have some of the information prepopulated. I am of course building this VxRail cluster with an external vCenter, the same vCenter that SDDC Manager just created.

The installer kicks off and if I log into the SDDC management vCenter, I can watch the workload domain cluster being built.

A little while later the cluster build is completed, but I’m not done yet. I need to go back into SDDC Manager and complete the workload domain addition. Under my new workload domain, which is currently showing as ‘activating’, I need to add my new VxRail cluster. SDDC Manager discovers the new VxRail Manager instance, I confirm password details for the nodes in the cluster and choose my preferred NSX deployment. In this case, I’m choosing NSX-V. I only have two physical 10Gbit NICs in the nodes, so NSX-T isn’t an option. Roll on Cloud Foundation 4.0 for the fix to that.

I enter all the details required to get NSX-V up and running; NSX manager details, NSX controllers and passwords for everything. I choose licenses to apply for both NSX and vSAN, then let the workload domain addition complete. All done, the configuration state now shows ‘active’ and I’m all done.

Except not quite. In the video I have also enabled vRealize Log Insight on the new workload domain before finishing up.

On the subject of the vRealize Suite, that’s up next.

Deploying Cloud Foundation 3.9.1 on VxRail; Parts 1 & 2

Before I move on and dedicate the majority of my time to Cloud Foundation 4, I created a relatively short series of screencasts detailing the process to deploy Cloud Foundation 3.9.1 on VxRail. 

I say detailing, I really mean quite a high-level overview. It’s by no means a replacement for actually reading and understanding the documentation. I’ve split the whole show into six parts;

  1. Deploying Cloud Builder
  2. Performing the Cloud Foundation bringup
  3. Creating a workload domain
  4. Deploying vRealize Lifecycle Manager
  5. Deploying vRealize Automation
  6. Deploying vRealize Operations Manager

It’s my hope that each of the fairly brief videos will provide an overview of the deployment process and maybe even help someone that is in a “what the hell is this screen and what do I do next?” scenario.

My environment for this series is 7 E460F VxRail nodes. The nodes have had a RASR upgrade to 4.7.410 and four of them have already been built into a cluster for my Cloud Foundation management domain. It goes without saying that I’m following the VMware bill of materials for version 3.9.1.

Before we do anything, we need to get Cloud Builder running. That’s what I’ve done in part 1 below. For all the videos in this series, it’s better to view fullscreen. Unless you like squinting at microscopic text of course.

Prerequisites for this part are easy, you need the Cloud Builder OVA. Unfortunately, the prerequisites aren’t going to remain this easy to satisfy throughout the rest of the series.

In the video above, I’ve also included two of the prerequisites for the next part;

  1. Externalising the vCenter server. This was made much easier in later VxRail builds thankfully.
  2. Converting the management portgroup to ephemeral binding.

Because simply deploying an OVA isn’t exactly face meltingly exciting, I’m including the second part of the series in this post also.

That second part being the actual deployment/bringup of Cloud Foundation and establishing the management cluster.

The prerequisites for this part are slightly more demanding. In what could be a frustrating move, I’m going to insist that you go out and search for these yourself. Or just deploy Cloud Builder and check out the extensive list you get when you attempt a bringup. The three that concern me most are;

  1. Make sure you have end to end jumbo frames configured (MTU of 9000). VMware don’t specifically recommend this on all VLANs, but I usually go jumbo everywhere to save me time and potential troubleshooting headaches later.
  2. Enable and configure BGP on your top of rack switches. In 3.9.1, we’re going with BGP right from the start with something VMware is calling “Application Virtual Networks” (AVNs). Or to everyone else, NSX-V logical switches. Two of these will be configured from day 1, so we’ll need to set up BGP peers on the ToRs and make sure the network is set up to route to the subnets for the AVNs (in the case where you’re not running dynamic routing everywhere). 
  3. DHCP for VXLAN VTEPs. I don’t have DHCP readily available in the lab, so this has been a pain for me since the first VCF on VxRail deployment. I end up deploying pfsense onto the management cluster, configuring it and then shutting it down and removing it from inventory. Once the Cloud Foundation bringup validation is complete and bringup is running, I hop back into vCenter and add the VM to inventory and power it up. That’s shown in the video below. Reason being that if any unknown VMs are running while bringup validation is running, it seems to make it fail. 

Everything else is taken care of. I’ve configured all the DNS records and ensured the cluster nodes are healthy in vCenter.

A word of caution before continuing. Be sure, very sure, that your deployment parameter excel spreadsheet is correctly completed. Make sure all the IP addresses and FQDNs you’ve entered are correct and everything is set up in DNS and forward & reverse lookups are perfect. The bringup validation won’t necessarily catch all errors and if bringup kicks off or gets half way through and then fails due to an incorrect IP address, you’re going to be resetting your VxRail and starting from scratch. Ask me how I know…

Having a look at the Planning & Preparation guide is probably a wise choice before we go kicking off any bringups.

On with part 2 and getting the management domain up and running.

In the above video, you’ll see the bringup failed while validating BGP. When Cloud Builder deploys the NSX edge service gateways for the AVN subnets, it doesn’t specify default gateways. So no traffic can get out of the two AVN NSX segments. Digging through the planning & prep guide, I can’t see any specific requirement for what I’ve done. That being to enable default-originate within the BGP neighbor config for each of the four peerings to the ESGs. That way, a default route is advertised to the ESGs and everybody is happy. Maybe this is environment specific, maybe it’s an omission from the guide. Either way, works for me in my lab!

That’s it for now. Next up, I’ll be adding a workload domain.