VMware Cloud on AWS; SDDC Groups & Advanced Connectivity

Now that we’ve got our first SDDC up and running with a VPN connection back to our office/data center, it’s time to tear it all down again and turn it into something a little more useful. I need to define the word useful before I go on, because this isn’t going to be a topology that everyone wants or needs. It’s simply a demo of what you can do with SDDC and AWS at scale. If you have multiple SDDCs in a region or across multiple regions and need them to communicate with multiple AWS accounts (possibly managed by AWS Control Tower), then this will be relevant for you. Or if like me you’re a networking nerd and just want to see something cool, read on.

This can obviously also work on a smaller scale, but as some of the components I’ll be deploying will lead to increased AWS charges, there’s likely to be a cheaper way to accomplish your goals if you have a smaller environment.

For reference as we move through this exercise, the below diagram is roughly the end state I want for my environment.

On the left side of the diagram is where our VMware Cloud components live. Right now we’ve just got a single SDDC so I’ll be creating a SDDC group and adding our SDDC to it. The functionality of an SDDC group goes well beyond what we’ll be doing here. It can connect multiple SDDCs together via high bandwidth, low latency links (what VMware calls “Transit Connect”) across up to three AWS regions. SDDC groups can only be formed with SDDCs that are within the same VMC organisation. If your intention isn’t to peer with an AWS account, you can also connect an AWS Direct Connect link to an SDDC group.

Getting back to our scenario, on the right side of the diagram is our existing AWS infrastructure. We’ve got AWS Control Tower setup and to keep things neat and tidy I’ve got a ‘Network Shared Services’ account where all the network interconnects terminate and where our transit gateway is setup. I’ve also got a couple of other accounts running various workloads. Some production, test & dev, sandbox, stuff like that. In some cases we’ve got VPCs created in our network account shared to those accounts and other accounts are free to create their own VPCs and request to attach them to the transit gateway in the networking account.

The bottom of the diagram is the easiest part. I’ve got a VMware cluster on-premises which is currently attached to the AWS transit gateway using a route based VPN connection. If I was in need of more bandwidth or lower latency than a VPN could provide, I’d look at a direct connect (or multiple direct connects for redundancy).

With all that covered, lets launch into the demo.

We start off pretty basic by creating the SDDC group, then things get a little more complicated. To establish the peering between the VMware managed TGW and our TGW, I needed to provide the account number where the TGW is located and the ID of the TGW itself. I then needed to accept the peering request manually. I haven’t enabled auto-accept on the TGW and unless you’d happily give a set of your house keys to everyone that has access to your AWS organisation, you shouldn’t either. The potential for chaos on a grand scale is just too much of a risk to accept.

With that done, I decided to create a prefix list with VMC. That will make routing updates a little easier. A prefix list creates an aggregated list of all network routes from the VMC compute gateway and shares it with the AWS account you specify. So instead of manually adding/removing a route every time you add a new or remove an old segment on VMC, the prefix list will take care of it and ensure our transit gateway has an up to date routing table. Better still, because we’re running BGP over the VPN the routing updates will also be pushed down to the on-premises router without us having to do anything.

To finish up the VMC prefix list setup, I need to accept the resource share from VMC. I then need to create a reference in the transit gateway routing table for it. That’s it. Almost no effort to have dynamic routing everywhere.

After creating the necessary firewall rules within the SDDC, I have full connectivity to my AWS accounts and my SDDCs from on-premises via a single VPN connection. I also have bi-directional connectivity from all of my AWS accounts to my SDDC. A new segment created in the VMC console is immediately pushed to the SDDC vCenter and almost immediately pushed to the AWS and on-premises routing tables. Temper your expectations, it is BGP after all.

The networking is now complete and like all good projects, my introductory journey into VMware Cloud on AWS is experiencing some scope creep. So stay tuned for part three of my two part series, in which I’ll cover the full setup of VMware HCX from my on-premises cluster to my VMC SDDC and migrate some VMs to the cloud.

VMware Cloud on AWS; Getting up and running

I make no bones about being a reformed server hugger. One of my more recent catchphrases is “I’d be a happy man if I never had to build another physical server spec”. So for those currently in the VMware ecosystem, I’ve been turning a lot of attention toward VMware Cloud on various hyperscalers. There is a particular emphasis on AWS and Azure here. GCP is also available, I just don’t get asked about it a lot. Sorry Google.

Some of the most frequent questions I hear include topics such as difficulty to setup and maintain the platform, required knowledge of the underlying hyperscaler, the ubiquitous horror of “vendor lock-in”, and of course the overall cost of the solution. The latter is a whole series of articles in itself, so I’ll be tactically dodging that one here. The short answer, as it so often is, is “it depends”.

Another frequently asked question surrounds the whole idea of “why should I be in the cloud”. There are quite a few well documented use cases for VMware Cloud. The usual ones like disaster recovery, virtual desktop, datacenter extension or even full on cloud migration. One that’s sometimes overlooked is that because SDDCs are so quick to spin up and can be done so with on-demand pricing, they’re great for short lived test environments. Usually, you can create an SDDC, get connectivity to it and be in a position to migrate VMs from your on-premises cluster or DR solution within a couple of hours. No servers to rack, no software to install, no network ports to configure, no week long change control process to trundle through.

For those organisations currently using native cloud services, there are a whole load of additional use cases. Interaction with cloud native services, big data, AI/ML, containers, and many many more buzzwords.

What I can do is show you how quickly you can get a VMC on AWS SDDC spun up and ready for use. The process is largely the same for Azure VMware Solution, with the notable difference that the majority of the VMC interaction in Azure is done via the Azure console itself. The video below gives you an idea of what to expect when you get access to the VMC console and what creating a new SDDC on AWS looks like.

I’ve included some detail on setting up a route based VPN tunnel back to an on-premises device. Policy based and layer 2 VPN are also available, but as my device (a Ubiquiti Edgerouter) is BGP capable I’m taking what I consider to be the easy option. I prefer to use a route based VPN because it makes adding new networks easier and allows a great deal of control using all the usual BGP goodness any network team will be comfortable with. There is also some content covered for firewall rules, because as you might expect they are a core component of controlling access to your newly created SDDC.

Something you may immediately notice (although not explicitly demonstrated in the video) is that the only interaction I had with AWS native services throughout the SDDC creation was to link an existing AWS account to the SDDC. For those without AWS knowledge, this is an incredibly straightforward process with every step walked through in just enough detail. If you’re creating an SDDC, you need to link it to an AWS account. If you don’t already have an AWS account, it’ll take less than five minutes to create one.

The only other requirement right now is to know a little bit about your network topology. Depending on how complex you want to get with VMC on AWS in the future, you’ll need to know what your current AWS and/or on-premises networks look like so you can define a strategy for IP addressing. But of course even if you never plan on connecting the SDDC to anything, you should still have an IP addressing strategy so it doesn’t all go horribly wrong in the future when you need to connect two things you had no plan to connect when you built them.

The one exception I’ll call out for the above is self-contained test environments. What I’m thinking here is a small SDDC hosting some VMs to be tested and potentially linked to an AWS VPC which hosts testing tools, jumpboxes, etc on EC2. If this never touches the production network, fill your boots with all the overlapping IP addresses you could possibly ever need. In this case, creating an environment as close as possible to production is critical to get accurate test results. Naturally, there’s also a pretty decent argument to be made here for lifecycling these type of environments with your infrastructure as code tool of choice.

On that subject, how do you connect a newly created SDDC to an AWS VPC?

This is the process at it’s most basic. Connecting a single VMC SDDC to a single AWS VPC in a linked AWS account. Great for the scenario above where you have self-contained test environments. Not so good if you have multiple production SDDCs in one or more AWS region and several AWS accounts you want them to talk to. For that, we’ve got SDDC groups and transit gateways. That’s where the networking gets a little fancier and we’ll cover that in the next post.

I hope by now I’ve shown that a VMC on AWS SDDC is relatively easy to create and with a bit of help from your network team or ISP, very easy to connect to and start actively using. I’ve tried to keep connectivity pretty basic so far with VPN. If you are already in the AWS ecosystem and are using one or more direct connect links, VMC ties in nicely with that too.

The concept of vendor lock-in comes up more than once here. It’s also something that’s come up repeatedly since Broadcom appeared on the VMware scene. To what extent do you consider yourself locked-in to using VMware and more importantly, what are the alternatives and would you feel any less locked-in to those if you had to do a full lift & shift to a new platform? If you went through the long and expensive process to go cloud native, would going multi-cloud solve your lock-in anxiety? Are all these questions making you break out in a cold sweat?

If you take nothing else away from all the above, I hope you’ve seen that despite some minor hyperscaler platform differences, VMware Cloud is the same VMware platform you’re using on-premises so there is no cloud learning curve for your VMware administrators. You can connect it to your on-premises clusters and use it seamlessly as an extension of your existing infrastructure. You can spin it up on demand and scale up and down hosts quickly if you need disaster recovery. You can test sensitive production apps and environments in isolation.

As I brought up those ‘minor hyperscaler differences’, how would they impact your choice to go with VMC on AWS, Azure VMware Solution, GCP, or another provider? As the VMware product is largely the same on any of the above, it comes down to what relationship you currently have (if any) with the cloud provider. The various providers will have different VMC server specs and connectivity options which would need to be properly accounted for depending on what your use case is for VMC. This is another instance of it being a whole topic by itself. If you’re curious, the comment box is below for your questions.

In the next post on this subject I’ll go into more detail about some complex networking and bring in the concept of SDDC groups, transit connect and AWS transit gateways. I’ll also cover a full end-to-end HCX demo in the next post, so you’ll have what is possibly the best way (in my biased opinion anyway) of getting your VMs into VMware Cloud on AWS.

Scaling out of your first meltdown on AWS

After a recent server hiccup that I’d deem ‘larger than what I’d call acceptable’ I was forced to think about scaling and high availability. My old setup was as basic as you can get. After migrating from a shared hosting platform to AWS, I spun up a small instance on Lightsail for a couple of reasons. I was eager to try it out as it was new to me and it looked very easy to get off the ground with. It is, but with a few caveats which I’ll go into throughout this post.

I went to AWS because I knew I’d eventually want to move beyond Lightsail and further into the big, bad AWS ecosystem and explore some of the other services available. But Lightsail offers a very easy way to get a server up and running for folks that maybe don’t have experience with AWS or just don’t want the complication of EC2, security groups, VPC, subnets, dozens of instance types, etc…

The old setup. A small Lightsail instance with a couple of basic firewall rules. To save instance disk space and hopefully speed up loading time, website assets are stored in an S3 bucket with CloudFront distributing them to website visitors. For those entirely new to AWS, S3 is a basic file storage service and CloudFront is a content delivery network (CDN). The CDN has multiple locations worldwide and caches files from the S3 bucket when website visitors request them. There is a much longer explanation for how CloudFront works, possibly too lengthy to go into here. Maybe in a future post.

My little, cheap as chips Lightsail instance was humming away nicely until it wasn’t. A spike in visitors to one of the sites hosted on it caused a little meltdown. It did recover of course, without any intervention I’ll boastfully add, but it got me thinking about the people that are getting off the ground with their apps, wondering how they’ll move beyond that first little server that can’t really hack it anymore. Potentially even worse than that, startups without any IT people or developers that know a bit of AWS stuff might not actually know about any problems until one of their customers or potential customers calls to report the website or app is down.

I’m not “Scaling for your first ten million users” here, I just want to ride out the peaks and ensure that if something does go down, it wasn’t a server resource capacity issue.

But I’m not going to talk about any of that hypothetical stuff, I want to take a bit of time to show how I moved onto something more future-proof.

In Summary;

What I’m keeping – S3 (Assets & DB backups), CloudFront.

What I’m not keeping – everything else.

Moving away from Lightsail.

Lightsail, even though it’s a service within AWS, isn’t tightly integrated into the ecosystem. This is where the caveats I mentioned above start to come into play. I initially had the possibly flawed plan of taking my existing instance on Lightsail and just ‘moving’ it to EC2. But it’s not quite so straightforward. What I can do is take a snapshot of the instance and then export that snapshot to EC2.

I got the usual warnings/disclaimers about security (I’d need to review firewalling again later) and my Lightsail SSH key would no longer work, so I’d need to create and apply a new one when I launch the image in EC2.

As part of the export to EC2, an AMI (machine image) is also created. Once the export is finished, you can launch an EC2 instance from that AMI. Not overly difficult, just a little time consuming. Also somewhat convoluted if you’re one of those people above that don’t understand a lot of the inner workings of AWS.

 

Once the Lightsail export is done, it’s about as simple as right click on the AMI and click launch. I had to do all the normal instance launch stuff like picking an instance type and defining security groups. One important thing to note here is that the Lightsail snapshot is exported to EC2 in the same AWS region as the original Lightsail instance. Neither snapshots nor AMIs have global scope.

Having gone through the above, I then decided that I really didn’t want an exact copy of my existing instance. Changes in the infrastructure were planned that’d make my Lightsail instance obsolete.

Building the new instance

Once I had decided that a brand new instance was needed, I launched a small Ubuntu instance (my distro of choice for the last few years) and set about doing all the normal security hardening and services setup. I installed Nginx, PHP, some PHP extensions and a few more tools I’d need in the future to manage the instance. There is no magic here and there are multiple guides online. To be certain I don’t forget to do something, I usually have one of Digital Ocean’s excellent tutorials open in another window.

Another way I could have done this is to use a standard AWS machine image and employ any one of the configuration management tools; Ansible, Chef, Puppet to do all the configuration and installation for me. But this is just a web server and won’t see huge rate of configuration change. Tooling to manage change is probably overkill right now.

Taking advantage of EFS

I’d read a couple of blog posts on EFS, a relatively new managed file system service from AWS. I wanted to give it a try, but also wanted to be able to step back from it if it didn’t make sense in the long run. I created a new file system across all three availability zones in my chosen region, because ultimately I’ll end up load balancing across all three AZs if the shit hits the fan again. There’s nothing particularly difficult in creating the file system, I pretty much just took default options the whole way through.

Security group rules caught me out initially, I had forgotten to allow NFS traffic inbound from the security group that the web instances will use. I got a timeout when trying to mount the file system from the Ubuntu instance so the issue was immediately obvious.

Web instance security group tag partially obfusticated in the above screenshot because of my (crippling) healthy web security paranoia.

On the topic of mounting EFS in the instance, there are a couple of ways to go about it. On Amazon Linux, all the tools are already installed. In Ubuntu, I had to install EFS helper and then just put the mount point into /etc/fstab to have it mount on startup. I did a quick write test to the file system and everything looked good. I’ll give it a little time to see if EFS will work for me and more importantly, how it’ll affect my monthly bill.

As a last act on EFS setup, I cloned the git repo containing my web content to the EFS mount point and configured Nginx to serve from that location.

Auto scaling and load balancing

With my Ubuntu instance setup complete, I powered it down and created an AMI.

After this, I created a launch configuration in the EC2 dashboard to use the new AMI and set a few specific options. First, I changed the instance type. I want to try out some of the new(ish) t3a instance types. Second, I disabled public IP address assignment on the instances launched by this configuration. The only access to these instances should be through the load balancer I’ll create later. Lastly, I duplicated the existing security group applied to the staging instance (the one I created the AMI from above) and removed SSH access. I won’t need to log in to the production instances so reducing the potential attack surface a bit.

Next, I created a classic load balancer. Gave it the same security group as above and changed the health check to ping the HTTP port of the web instances and look for a specific URL. When the scaling group I’ll create next scales up, it’ll add the new instance to the load balancer and once HTTP becomes active and the URL I’ve given to it is reachable, the load balancer will start sending traffic to the instance.

I’m not adding instances to the load balancer right now because, well, there aren’t any. When I create the auto scaling group next, I’ll point it at the load balancer and it’ll handle the registration and deregistration of instances.

For the auto scaling group, I decided to try out the new (and improved?) method of first creating a launch template and applying that to an auto scaling group. In theory, I can version launch templates if I decide to change instance type, image or other options so I can swap out instances with less effort. Also, it’s another new thing to play with. Mostly that.

Hopefully the above isn’t too much of an eye chart. I’ve given the ASG a name, picked my launch template version from the list (or choose ‘latest’ which always takes the newest) and defined my VPC and subnets. As I mentioned above, I’ve linked the load balancer to the ASG so all the load balancing wizardry will be taken care of.

For the scaling policy, I tried to strike a balance between being a cheapskate and having unavailable websites. Usually, scaling at about 40% usage is good for availability. Up to 70% is good for cost. So I picked a number in between.

For notifications, I want to know when a scale out, scale in or instance failure occurs. I created a new SNS topic here and gave it my email address. I might change this out later on to log elsewhere, depending on how spammy it gets.

I finished creating the ASG and it fired up the first instance. Looking at the load balancer, the new instance was out of service and so the load balancer never kicked in. A few minutes of head scratching later, I realised I’d been caught again by my old friend the security group. Extra points if you can guess what I’ve done without rereading the entire post.

No? When I duplicated the security group, I forgot to add an inbound NFS rule on my security group attached to the EFS file system. So instances launched with the new security group couldn’t mount EFS. With that little issue resolved, I relaunched the instances (deleted them, the ASG took care of launching new ones) and hey presto, instance in service.

I also verified that instances were launching in separate availability zones by temporarily scaling the group to 3 instances.

Finally, because WebLB-1855773837.eu-central-1.elb.amazonaws.com doesn’t exactly roll off the tongue to advertise as your website address, I created a record set in Route53 to point to the load balancer.

Database

This part is going to be a little anticlimactic after what I’ve just done above. For the database, I launched a small Ubuntu instance with an IAM role to allow it to interact with an S3 bucket I’m going to use in the next step to send database backups to. For those unfamiliar with IAM, its basically access & permissions management. Like the web instance, the DB instance is internal only. It does not have a public IP address.

I configured all the basics, installed and configured MySQL, then did some security hardening. Nothing too taxing. I also applied a new security group to the instance that only permits MySQL traffic to and from the web instance(s).

I could have gone crazy here and gone with Amazon’s RDS to host the MySQL databases. That’d give me all kinds of backup, availability and scalability options out of the box. Nice option to have, but complete overkill for me. It’d also be a hell of a lot more expensive that running a small Ubuntu instance. In the future, if I find I’m getting bottlenecks on the database instance or the MySQL service is complaining about resources, I’ll just take a short period of downtime, shut down and snapshot the instance, then launch a larger instance from that snapshot.

Backup and monitoring

My requirements here are easily satisfied. My databases are backed up by a very simple bash script I wrote to regularly dump the databases to files, then move those files to an encrypted S3 bucket. The S3 bucket is lifecycled so only the last 10 backups are kept. My rate of change is low, so I might even change that to store fewer files. The files are tiny, so it’s not really costing me anything to store them.

The web assets are also on an S3 bucket, being stored in the standard storage tier. That has something crazy like seven 9’s of availability/durability so I’m not overly worried about anything disappearing. Even if I mistakenly delete something, I can easily upload it again from my PC/laptop/phone/ipad. I could enable versioning on the bucket if I was super paranoid.

The web code is stored in a private repo on Github. I briefly thought about moving it to CodeCommit on AWS but came to the conclusion that it was more trouble than it’s worth.

Monitoring is also pretty relaxed. With the new auto scaling group, I’m notified when a scale up/down occurs or if an instance has failed to launch or terminate. That’s useful because I know immediately if I’m getting a spike in traffic or if something has gone horribly wrong. I’m also notified every time a database backup finishes successfully. All of this is using the Simple Notification Service (SNS) and reports are sent to my email address.

I take a look at CloudWatch every now and then if I want to see a bit more about what’s going on with the instances or if I want to keep an eye on my S3 usage. I’m not currently overly fascinated with creating custom metrics or turning on enhanced monitoring. As before, that’d be overkill. I may create an alarm within CloudWatch to monitor if the healthy instance number in the load balancer goes to zero and report it to an SNS topic. That way I’d know if all instances are down and/or if they just stop serving traffic for some reason.

Wrap up

I chose to set everything up manually for a few reasons. Chief among those is my love of knowing how everything is put together in case/when I need to replace or upgrade a bit of it. I realise I could have clicked a couple of buttons, saved myself some time and just got this all going on Elastic Beanstalk. I love EB for ease of setup. For working with several environments. For dev, staging and production. Blue/Green deployments. CI/CD. All that stuff. I know all the buzzwords. That being said, if I needed repeatable environments I think I’d just go the infrastructure as code route and use CloudFormation or Terraform to manage the environments outside of anything that could be potentially restrictive, like Elastic Beanstalk.

I could have also containerised the web instance and run it on EKS. I could have run it on ECS. As with many things, there are many ways to achieve the same goal. I’d like to think I took something simple that was approaching it’s limits and scaled it without transforming it or complicating it too much. I also added a dev/staging server that I can power on and work on the code without interrupting production. Additionally, it’s all properly version controlled now. It’s also possibly a bit more secure.

It’s now possible for me to do OS patching and upgrading without taking the site down. I can introduce a new custom built AMI into the mix and do a rolling redeploy of the web instances without downtime. I can probably do a lot more, I just haven’t really thought about it yet.