VMware Cloud on AWS: From Zero to TKG

Gilles Chekroun
Lead VMware Cloud on AWS Solutions Architect
With the recent release of Tanzu Kubernetes Grid (aka TKG), the updated preview from William Lam and the excellent post from Alex Dess, I wanted  to use the Terraform work I did in previous blogs here and here and try to automate the complete deployment from creating the SDDC, configuring the NSX-T networking and security and deploying the TKG clusters.
I wanted  also to give credits to Tom Schwaller for helping me on various traps in this whole process.

Terraform + Ansible = buddies

In this exercise, I will use Terraform to deploy the VMware Cloud on AWS infrastructure and Ansible to configure and deploy the TKG clusters.

Recap on TKG+ on VMC

There are many many posts around TKG and the short description is that Tanzu Kubernetes Grid leverages Cluster API to bring declarative statements for the creation, configuration and management of Kubernetes Clusters.

VMware Tanzu Kubernetes Grid Plus on VMware Cloud on AWS enables you to deploy your SDDC in the cloud, with all the required components needed to architect and scale Kubernetes to fit your needs.

Software Setup

My setup is the following:
venv Virtual environment with Python3 
- Macbook as local host to run terraform and ansible playbooks.

(vmc)$ terraform version
Terraform v0.12.24 

(vmc)$ ansible --version
ansible 2.9.6

(vmc)$ python --version
Python 3.6.5

Lab Setup

VMware Cloud on AWS SDDC

  • Deployed using Terraform VMC provider.

Attached VPC

  • Deployed using Terraform AWS provider

EC2 as TKG CLI Host

  • Using TKG official AMIs with Kubernetes installed (available here) but also coded in variables.tf

S3 hosting all TKG Binaries

  • Simplest way to host our TKG OVAs (download from here) and use GOVC to deploy templates in VMC vCenter

Step 1

Credentials and bash script

I decided to export my credentials to my ENV variables for a few reasons:
- my AWS console is now controlled internally by VMware and is changing my AWS access keys and secret keys on a regular basis
- I will need to somehow get these variables to my EC2 with Ansible to sync the content of the S3 bucket to my EC2 using AWSCLI
- Terraform can also import variables if they are in the format TF_VAR_xxxx
- Easy to export variables in a shell script and

To make things easy, I have a deploy-lab.sh script that will prompt for all the variables needed. This script actually deploys the complete environment.

#!/usr/bin/env bash

echo -e "\033[1m"   #Bold ON
echo " ==========================="
echo "    TKG on VMC deployment"
echo " ==========================="
echo "===== Credentials ============="
echo -e "\033[0m"   #Bold OFF

read -p "Enter your ORG ID (long format) [default=$DEF_ORG_ID]: " TF_VAR_my_org_id
echo ".....Exporting $TF_VAR_my_org_id"
export TF_VAR_my_org_id=$TF_VAR_my_org_id
echo ""
read -p "Enter your VMC API token [default=$DEF_TOKEN]: " TF_VAR_vmc_token
echo ".....Exporting $TF_VAR_vmc_token"
export TF_VAR_vmc_token=$TF_VAR_vmc_token
echo ""

read -p "Enter your AWS Account [default=$ACCOUNT]: " TF_VAR_AWS_account
echo ".....Exporting $TF_VAR_AWS_account"
export TF_VAR_AWS_account=$TF_VAR_AWS_account
echo ""

read -p "Enter your AWS Access Key [default=$ACCESS]: " TF_VAR_access_key
echo ".....Exporting $TF_VAR_access_key"
export TF_VAR_access_key=$TF_VAR_access_key
echo ""

read -p "Enter your AWS Secret Key [default=$SECRET]: " TF_VAR_secret_key
echo ".....Exporting $TF_VAR_secret_key"
export TF_VAR_secret_key=$TF_VAR_secret_key
echo ""

echo ""

echo -e "\033[1m"   #Bold ON
echo "===== PHASE 1: Creating SDDC ==========="
echo -e "\033[0m"   #Bold OFF
cd ./p1/main
terraform apply
cd ../../
export TF_VAR_host=$(terraform output -state=./phase1.tfstate proxy_url)

read  -p $'Press enter to continue (^C to stop)...\n'
cd ./p2/main

echo -e "\033[1m"   #Bold ON
echo "===== PHASE 2: Networking and Security ==========="
echo -e "\033[0m"   #Bold OFF
echo ".....Importing CGW and MGW into Terraform phase2."

if [[ ! -f ../../phase2.tfstate ]]
  echo "Importing . . . . ."
  terraform import -lock=false module.NSX.nsxt_policy_gateway_policy.mgw mgw/default
  terraform import -lock=false module.NSX.nsxt_policy_gateway_policy.cgw cgw/default
echo ".....CGW, MGW already imported."
terraform apply
echo ""

read  -p $'Press enter to continue (^C to stop)...\n'
echo -e "\033[1m"   #Bold ON
echo "===== Ansible will prepare the TKG environment ==========="
echo -e "\033[0m"   #Bold OFF
cd ../../ansible/playbooks

echo "====== 1) Gathering Terraform outputs ========"
ansible-playbook ./10-terraform-info.yaml

echo "====== 2) Prepare EC2 ========"
ansible-playbook ./11-open_terminal.yaml

echo "====== 3) Open Terminal window ========"
ansible-playbook ./12-copy_files_to_EC2.yaml

echo "====== 4) Deploy templates in vCenter ========"
ansible-playbook ./13-deploy_templates.yaml

echo "====== 5) Deploy TKG Clusters ========"
ansible-playbook ./14-Deploy_TKG_clusters.yaml

Terraform variables

The variables.tf file will contain important parameters to set BEFORE we can start anything.
variable "AWS_region"     {default = "eu-central-1"}
variable "TKG_net_name"   {default = "tkg-network"}
variable "TKG_photon"     {default = "photon-3-v1.17.3_vmware.2"}
variable "TKG_haproxy"    {default = "photon-3-capv-haproxy-v0.6.3_vmware.1"}
variable "TKG_EC2"        {default = "tkg-linux-amd64-v1.0.0_vmware.1"}
variable "TKG_S3_bucket"  {default = "set-tkg-ova"}
Note that the file name for photon, haproxy and EC2 have NO EXTENSIONS

Step 2

Terraform Phase 1

In this lab I will use 2 phases:
- Phase 1 for:
  • Deploying the AWS attached VPC with a subnet and an EC2 that will be our TKG CLI host.
  • Deploying a 1 node SDDC
- Phase 2 for 
  • Configuring all NSXT segments, Groups, FW rules needed for TKG
First we need to compile the Terraform VMC provider from the source.
- Create a tmp directory and execute:
git clone https://github.com/terraform-providers/terraform-provider-vmc/
cd terraform-provider-vmc/
go get
go build -o terraform-provider-vmc
chmod 755 terraform-provider-vmc
Place the compiled binary in the main terraform directory for phase 1 and do:
rm -rf .terraform

terraform init
The most important part in phase 1 is the terraform output.
We will export all output parameters in JSON format and use them for our ansible playbooks as well.
The outputs appear at the end of terraform apply in the following format:

GOVC_vc_url = https://vcenter.sddc-3-127-179-50.vmwarevmc.com/sdk
SDDC_mgmt =
TKG_DNS = ec2-3-125-18-211.eu-central-1.compute.amazonaws.com
TKG_EC2 = tkg-linux-amd64-v1.0.0_vmware.1
TKG_S3_bucket = set-tkg-ova
TKG_haproxy = photon-3-capv-haproxy-v0.6.3_vmware.1
TKG_net_name = tkg-network
TKG_photon = photon-3-v1.17.3_vmware.2
cloud_password = <sensitive>
cloud_username = cloudadmin@vmc.local
key_pair = keypair
proxy_url = nsx-3-127-179-50.rp.vmwarevmc.com/vmc/reverse-proxy/api/orgs/84e84f83-bb0e-4e12-9fe0-aaf3a4efcd87/sddcs/a4565d5c-1d34-42e9-95c6-07ec52870510
vc_url = vcenter.sddc-3-127-179-50.vmwarevmc.com
They can be converted to JSON with a simple command:
terraform output -state=../../phase1.tfstate -json > outputs.json
and we get the outputs.json file in the format:
  "GOVC_vc_url": {
    "sensitive": false,
    "type": "string",
    "value": "https://vcenter.sddc-3-127-179-50.vmwarevmc.com/sdk"
  "SDDC_mgmt": {
    "sensitive": false,
    "type": "string",
    "value": ""
  "TKG_DNS": {
    "sensitive": false,
    "type": "string",
    "value": "ec2-3-122-115-96.eu-central-1.compute.amazonaws.com"
  "TKG_EC2": {
    "sensitive": false,
    "type": "string",
    "value": "tkg-linux-amd64-v1.0.0_vmware.1"

etc. . . 

The TKG EC2 instance

I am using the AMI provided by VMware for every AWS region and this includes Kubernetes already. I just need to add docker and we are good to go.
Since i want to use this EC2 to provision the TKG templates in my vCenter, I will also install GOVC and JQ (I like JQ).
To do that, at EC2 instance creation, I can supply a "user-data.ini" code that will be executed upfront. 
I am using a t2.medium instance with 20GB of disk.
resource "aws_network_interface" "TKG-Eth0" {
  subnet_id                     = var.Subnet10-vpc1
  security_groups               = [var.GC-SG-VPC1]
  private_ips                   = [cidrhost(var.Subnet10-vpc1-base, 200)]
resource "aws_instance" "TKG" {
  ami                           = var.TKG-AMI[var.AWS_region]
  instance_type                 = "t2.medium"
  root_block_device {
    volume_type = "gp2"
    volume_size = 20
    delete_on_termination       = true
  network_interface {
    network_interface_id        = aws_network_interface.TKG-Eth0.id
    device_index                = 0
  key_name                      = var.key_pair[var.AWS_region]
  user_data                     = file("${path.module}/user-data.ini")

  tags = {
    Name = "GC-TKG-vpc1"
The user-data looks like:
sudo yum update -y
wget https://github.com/vmware/govmomi/releases/download/v0.22.1/govc_linux_amd64.gz
gunzip govc_linux_amd64.gz
mv govc_linux_amd64 govc
sudo chown root govc
sudo chmod 755 govc
sudo mv govc /usr/bin/.
sudo yum install jq -y
sudo amazon-linux-extras install docker -y
sudo service docker start
sudo groupadd docker
sudo usermod -aG docker ec2-user
sudo chmod 666 /var/run/docker.sock

Terraform Phase 2

Before we can start with phase 2 we need to compile the NSXT terraform provider from the source as we have done similarly for VMC provider. 
- Create a tmp directory and execute:
git clone https://github.com/terraform-providers/terraform-provider-nsxt/
cd terraform-provider-nsxt/
go get
go build -o terraform-provider-nsxt
chmod 755 terraform-provider-nsxt
Place the compiled binary in the main terraform directory for phase 2 and do:
rm -rf .terraform
terraform init
Define the TKG network to be created by the NSX module. It must de DHCP enable with enough IP addresses for our clusters deployments:
Subnets IP ranges
variable "VMC_subnets" {
  default = {
    TKG_net             = ""
    TKG_net_gw          = ""
    TKG_net_dhcp        = ""
Once that's done, we need to import the SDDC NSXT components into terraform since VMC is a pre-build architecture.
To do that we need to do:

terraform import -lock=false module.NSX.nsxt_policy_gateway_policy.mgw mgw/default
  terraform import -lock=false module.NSX.nsxt_policy_gateway_policy.cgw cgw/default
only once and only if we have NO terraform state file for Phase 2 (the deploy-lab.sh takes care of that)
Then, the NSX module will create TKG segment, Management Gateway rules and Compute gateway rules like:

Step 3

Ansible setup

Disclaimer - I am not an expert in Ansible and I am sure there are better ways to achieve what I want but so far, I am happy with what I did ;)


Here we have a super simple environment that consists on 2 hosts:
- My Macbook as a localhost
- The EC2 instance we want to configure.
Since the EC2 is a dynamic resource and will have dynamic public IP, I will add it to my inventory within my playbooks.
There are other ways to do that like using ec2.py module that will return a list but here I only have one instance.


1) Get the terraform output variables

Nothing special here

2) Open a terminal window

Cool command using Mac osascript:
osascript -e 'tell app "Terminal" to do script "ssh -oStrictHostKeyChecking=no -i '{{aws_dir}}{{key_pair.value}}.pem' ec2-user@{{ TKG_IP.value }}"'

3) Copy files to EC2

I will need the json outputs, some shell scripts and ini files

4) Sync my S3 bucket and deploy the templates

This one is a bit tricky. 
On all AWS EC2 linux, the AWSCLI is installed but needs the AWS credentials.
Since I did not want to transfer my credentials (i could but...) I will check my local ENV variables and use the "environment:" keyword for the EC2 and lookup in my local ENV for AWS access and secret keys.
# =================================================
#   Sync templates from S3 and deploy
# =================================================
- name: Get templates from S3
  hosts: TKG_EC2
  gather_facts: true
    - ../credentials.yaml
    - ../outputs.json
    AWS_ACCESS_KEY_ID:  "{{ lookup('env','TF_VAR_access_key') }}"
    AWS_SECRET_ACCESS_KEY:  "{{ lookup('env','TF_VAR_secret_key') }}"
    - name: Sync templates from S3 using the AWS CLI, deploy in vCenter and install TKG binaries
      shell: |
        aws s3 sync s3://{{ TKG_S3_bucket.value}} .
        gunzip "{{ TKG_EC2.value }}".gz
        sudo mv "{{ TKG_EC2.value }}" /usr/bin/tkg
        sudo chmod 755 /usr/bin/tkg
        tkg get mc
The next step is "syncing" my S3 bucket and deploy my templates.
In that script, I check if my TKG folder and Resource Pool are created and for the templates deployment I use GOVC.
Tom told me that it's better to deploy a VM, do a snapshot and mark it as a template. This will be faster for future cloning! (Thanks Tom)
govc import.spec ${PHOTON}.ova | jq ".Name=\"$PHOTON\"" | jq ".NetworkMapping[0].Network=\"$NETWORK\"" > ${PHOTON}.json
  govc import.ova -dc="SDDC-Datacenter" -ds="WorkloadDatastore" -pool="Compute-ResourcePool" -folder="Templates" -options=${PHOTON}.json ${PHOTON}.ova
  govc snapshot.create -vm ${PHOTON} root
  govc vm.markastemplate ${PHOTON}
The first line imports the OVA specs and adds the proper TKG network name
The second line imports the OVA in the "Templates" folder using the imported specs
The third line creates a snapshot
And the last lines marks it as a Template.
After the templates deployment, i simply unzip the TKG CLI binary, place it in /usr/bin and mark it executable.
the command tkg get mc will actually create the .tkg directory but return an empty management cluster.
The last script "config.sh" will append a few variables that are specific to our environment to the config.ini file to create a config.yaml file for management cluster deployment.
The variables are:
VSPHERE_NETWORK: tkg-network
VSPHERE_SERVER: vcenter.sddc-3-127-179-50.vmwarevmc.com
VSPHERE_USERNAME: cloudadmin@vmc.local

VSPHERE_PASSWORD: <encoded:xxxxxxxxxx>

5) Finally deploys a TKG management cluster and a small 

    - name: Ensure docker daemon is running
        name: docker
        state: started
      become: true
    - name: Deploy TKG clusters - this task will take 20 mins - check VMC vCenter
      shell: |
        yes | tkg init  --infrastructure=vsphere
        tkg create cluster --worker-machine-count=4 --plan=dev tkg-cluster-01
The "yes | tkg" will bypass the warning: You are about to provision a Kubernetes cluster on a vSphere 7.0 cluster that has not been optimized for Kubernetes.
And last line creates a small cluster called "tkg-cluster-01" with 4 workers nodes, 1 control plane and 1 load balancer. The --plan=dev can be changed to prod for more substantial environment.

Small video for demo


Complete code in my GitHub here.

Thanks for reading.



Egress VPC and AWS Transit Gateway (Part1)

AWS Transitive routing with Transit Gateways in the same region

Build a VMware Cloud on AWS Content Library using AWS S3