boche.net – VMware vEvangelist

Sysprep Fails after Installing Latest Windows Patches

jason — Thu, 08 Feb 2024 20:35:15 +0000

After returning to the lab from Christmas 2023 break, I noticed that a significant number of the Microsoft Windows Server 2022 virtual machines I was deploying from template were failing to complete the guest customization process as defined by a VM Customization Specificiation – one that I had been using for quite a long time. Symptoms were the guests were not assuming their newly assigned computer name and were not joining the Active Directory domain in the lab. After verifying the health of the Active Directory domain, I felt the sysprep process was failing at some point.

I hadn’t looked at sysprep log files in eons. VMware KB2001932 Locations of sysprep log files reveals their location. For Windows Server 2022, that’s going to be the setupact.log and setuperr.log files located in C:\Windows\System32\Sysprep\Panther\.

What I found is that sysprep was being invoked and was running. However, without getting into specifics, it a critical portion of it was failing and not being executed. Not surprisingly, this is the reason the computer name wasn’t being changed, the computer wasn’t being joined to the Active Directory domain, the time zone wasn’t being set, etc.

But what was the cause? Why the recent change in behavior which had been rock solid for… well since forever. Well I know of one recent change: A batch of Microsoft Windows Updates had been applied to the template:

Security Update for Microsoft Windows (KB5034129)
- Release Date 1/9/2024
- Long list of security updates
Servicing Stack 10.0.20348.2200
- Confusion as to what this specific version is, when it was released, and what it contains
Update for Microsoft Windows (KB5033914)
- Release Date 1/9/2024
- Cumulative Update for .NET Framework 3.5 and 4.8 for Windows Server 2022

Network rules in this particular lab environment require patching and provide no allowance for running unpatched Windows operating systems.

I happened along VMware KB82615 Sysprep Fails for Linked Clones after Installing Latest Windows Patches. Although it addresses VMware Horizon 7 and 8 linked clones only, there was a hope and a prayer as I read through it.

Cause:

The problem tends to manifest when installing the latest windows patches. New behaviour seen after installing the new patches, the guestOS-respecialize operation is triggering the reboot operation when it detects the hardware. The triggered reboot operation is interrupting the Horizon Sysprep customization operation and that is causing an issue.

Not sure if this is my specific case but it’s a lab workload and I’m willing to attempt the workaround.

Workaround:

Add in a delayed start to tools to allow a grace period for the guestOS-respecialize process to complete
The delay interval may vary based on the GuestOs-Image and the interval setting can be modified using registry settings. These registry changes can be set in the parent VM and provisioned out as as a fresh snapshot to your impacted pools.

[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\VMTools]
“DelayedAutostart”=dword:00000001 ————->DelayedAutoStart Enabled
“AutoStartDelay”=dword:00000014 ————->20 secs(in decimal)

Although VMware Tools takes longer to start, this solved the problem.

vCenter 8 upgrade and vami_config_net

jason — Mon, 07 Nov 2022 23:11:42 +0000

Last week, I attempted to upgrade a vCenter Server from version 7.0.3 to 8.0. During the upgrade process, I received an error message along with a resolution:

Error: The source appliance FQDN must be the same as the source appliance primary network identifier.

Resolution: Change the source appliance FQDN according to the officially supported process https://blogs.vmware.com/vsphere/2019/08/changing-your-vcenter-servers-fqdn.html

Somehow, the appliance FQDN was tattood with a host name in upper case when it needed to be all lower case.

Changing the appliance FQDN to all lower case is fairly straight forward and can be performed using the VAMI to change the Hostname in the Network Settings. However, when clicking the link to edit the Network Settings, after selecting the network adapter, the settings for the selected network adapter are blocked from view with a solid black box. This appears to be a UI bug in the VAMI (the same issue no longer exists in vCenter 8). As a result, the Network Settings cannot be reliably changed using this method.

Fortunately, another method exists to change network settings via command line. It’s spelled out in Michls Tech Bog here. In short, SSH into the vCenter Server appliance and run the following command:

/opt/vmware/share/vami/vami_config_net

A menu will be presented with one of the options providing the ability to set the Hostname.

After setting the Hostname, restart the VCSA as well as the upgrade process.

Deploying Amazon EKS Anywhere on vSphere

jason — Wed, 13 Oct 2021 07:00:20 +0000

Last month, the general availability of Amazon Elastic Kubernetes Service Anywhere was announced. Much like vSphere with Tanzu (TKGs) and Tanzu Kubernetes Grid (TKGm), EKS Anywhere (open source) is a deployment option for Amazon EKS that enables the deployment of Kubernetes clusters on premises using VMware vSphere 7.

I had an opportunity this past week to install EKS Anywhere in two different lab environments. Having worked with vSphere with Tanzu quite a bit last year, I was excited to see how the two compared. The EKS Anywhere documentation covers the requirements, configuration of the administrative machine, as well as the creation of a local or production cluster. I found the documentation to fairly straight forward. In a perfect world with all steps working correctly, deployment start to finish could take 30 minutes or less. However, I did run into some challenges. With EKS Anywhere basically being brand new (current version 0.5.0), I found there is little to no troubleshooting information available in the community so I did the best I could and took many notes along the way until I achieved a successful and repeatable deployments. In this blog post, I’ll step through the deployment process, I’ll highlight the challenges I encountered, and the corresponding resolutions or workarounds.

Reminder: I’m stepping through the EKS Anywhere documentation. For the following sections, it may be helpful to have this document open for reference. In addition, I’m not covering every step. The intent is to bridge some gaps where things didn’t go so smoothly.

Install EKS Anywhere

The documentation has us start by getting what they call the administrative machine set up. I’m using Ubuntu Server (Option 2 – Manual server installation ubuntu-20.04.3-live-server-amd64.iso). Nothing to do with EKS Anywhere but rather three basic Linux tips here:

When installing Ubuntu Server, enable openssh when prompted for remote ssh access later
After installing Ubuntu Server, install net-tools: sudo apt install net-tools
open-vm-tools will be installed by default. No need to install VMware Tools

Looking at the EKS Anywhere Create Cluster diagram and reading slightly ahead, we know this administrative machine will host the bootstrap cluster which is used to build out the EKS Anywhere control plane and worker nodes on vSphere. So after installing Ubuntu Server, we’re going to need to install some additional tools.

Install Homebrew prerequisites:

sudo apt update
sudo apt-get install build-essential procps curl file git

Install Homebrew:

sudo apt update
/bin/bash -c “$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)”

When Homebrew installation is complete, be sure to follow the two next steps to add Homebrew to the path so future shell commands are successful. Basically copy and paste the two commands that are provided. For my installation under the administrator created account, this was:

echo ‘eval “$(/home/linuxbrew/.linuxbrew/bin/brew shellenv)”‘ >> /home/administrator/.profile
eval “$(/home/linuxbrew/.linuxbrew/bin/brew shellenv)”

Set up repository and install Docker Engine (I used the Docker repository method):

sudo apt-get update
sudo apt-get install apt-transport-https ca-certificates curl gnupg lsb-release
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg –dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg
echo “deb [arch=$(dpkg –print-architecture) signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable” | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt-get update
sudo apt-get install docker-ce docker-ce-cli containerd.io

When testing Docker Engine commands (docker run hello-world as an example), you may run into an error similar to:

docker: Got permission denied while trying to connect to the Docker daemon socket at unix [redacted]

Fix using the following commands (source, this is also covered in the Troubleshooting section of the EKS Anywhere documentation):

sudo groupadd docker (may produce an error that docker group already exists – ok)
sudo usermod -aG docker $USER
newgrp docker (might be unnecessary – I performed anyway)

Next up, we install eksctl and eksctl-anywhere using Homebrew. According to the documentation: This package will also install kubectl and the aws-iam-authenticator which will be helpful to test EKS clusters. I found this to be false, for kubectl anyway. Go ahead and follow the instructions:

brew install aws/tap/eks-anywhere

In my experience, kubectl was nowhere to be found when issuing kubectl commands in the shell. To remedy this, I used Homebrew to install kubectl:

brew install kubernetes-cli

Update 11/9/21: AWS Development has confirmed kubectl is not being installed as part of the eks-anywhere Homebrew installer. This is a bug and they are working on a Github update here and here.

This wraps up the challenges for the administrative machine. Not too bad and there was plenty of community help to get me through this part. I think with a little documentation clean up, there’d be no surprises here.

Create production cluster

With the administrative machine ready to go, it’s time to get to the fun stuff – deploying EKS Anywhere. I didn’t bother following the instruction to create local cluster because I wanted EKS Anywhere deployed to vSphere infrastructure. That means skipping ahead to the section Create production cluster.

The Prerequisite Checklist has us create a few objects which might not already exist using the vSphere Client. Namely an arbitrarily named Resource Pool and VM folder. Of course a vCenter Server, a Datacenter, a Datastore, and a Network (portgroup) must also exist. Note: Datastore Clusters and Datastores that are a member of a Datastore Cluster will not work with EKS Anywhere deployment – use a standard block, file, or vVol Datastore (or vSAN). There’s one undocumented resource that I did not see mentioned in the documentation which I’ll cover shortly.

After generating the cluster config yaml file and applying it, the administrative machine autonomously goes through a large number of steps to set up the control plane and worker nodes on vSphere. It is this series of steps where I ran into a number of EKS Anywhere related challenges that I had to work through. I’ll go through each of them in roughly the order that I remember they presented themselves. By the way, any time the autonomous script fails, the process is to clean up what it did and start over by re-running it. Lastly, what I’m going to continue calling “the script”, merely means applying the cluster deployment eksa-cluster.yaml.

After validating the vSphere environment, one of the first steps that is performed is the download and templating of the Bottlerocket container image. The image is initially stored as a vSphere content library item so a content library named eks-a-templates is first created. Then the download and import into the content library occurs. For whatever reason, the download process fails, a lot, with the following error:

Validation failed {“validation”: “vsphere Provider setup is valid”, “error”: “failed importing template into library: error importing template: govc: The import of library item 3df0dd7f-2d88-458b-835d-ab83fa6a9107 has failed. Reason: Error transferring file bottlerocket-v1.21.2-eks-d-1-21-4-eks-a-1-amd64.ova to ds:///vmfs/volumes/vvol:afedfe12b3e24fb4-8a0daa5002ac9644//contentlib-d7538b36-1c53-4cba-9f7c-84e78824e456/3df0dd7f-2d88-458b-835d-ab83fa6a9107/bottlerocket-v1.21.2-eks-d-1-21-4-eks-a-1-amd64_fb1094fd-0a42-4ff2-928c-3c0a9cdd30fd.ova?serverId=c2fdf804-b0a0-4bcb-99be-9dab04afa64f. Reason: Error during transfer of ds:///vmfs/volumes/vvol:afedfe12b3e24fb4-8a0daa5002ac9644//contentlib-d7538b36-1c53-4cba-9f7c-84e78824e456/3df0dd7f-2d88-458b-835d-ab83fa6a9107/bottlerocket-v1.21.2-eks-d-1-21-4-eks-a-1-amd64_fb1094fd-0a42-4ff2-928c-3c0a9cdd30fd.ova?serverId=c2fdf804-b0a0-4bcb-99be-9dab04afa64f: IO error during transfer of ds:/vmfs/volumes/vvol:afedfe12b3e24fb4-8a0daa5002ac9644/contentlib-d7538b36-1c53-4cba-9f7c-84e78824e456/3df0dd7f-2d88-458b-835d-ab83fa6a9107/bottlerocket-vmware-k8s-1.21-x86_64-1.2.0-ccf1b754_fb1094fd-0a42-4ff2-928c-3c0a9cdd30fd.vmdk: Pipe closed.\n”, “remediation”: “”}
Error: failed to create cluster: validations failed

If the script is run again with no cleanup, the following error will occur:

Validation failed {“validation”: “vsphere Provider setup is valid”, “error”: “failed deploying template: error deploying template: govc: 400 Bad Request: {\”type\”:\”com.vmware.vapi.std.errors.invalid_argument\”,\”value\”:{\”error_type\”:\”INVALID_ARGUMENT\”,\”messages\”:[{\”args\”:[],\”default_message\”:\”Specified library item is not an OVF.\”,\”id\”:\”com.vmware.ovfs.ovfs-main.ovfs.invalid_library_item\”}]}}\n”, “remediation”: “”}
Error: failed to create cluster: validations failed

It was at this point that I learned that the eks-a-templates content library needs to be deleted before re-running the script. Continue repeating this process until you get a good Bottlerocket download. Eventually it will complete 100% without failing. Once you get a good download of Bottlerocket imported into the content library, you won’t have to go through this process any more, even for future cluster deployments. That is assuming you don’t delete the content library or the Bottlerocket image.

One last hurdle with the Bottlerocket templating function, the following error will occur:

Validation failed {“validation”: “vsphere Provider setup is valid”, “error”: “failed deploying template: error deploying template: govc: folder ‘/Galleon Datacenter/vm/Templates’ not found\n”, “remediation”: “”}
Error: failed to create cluster: validations failed

This happens because the Bottlerocket templating function is looking for a VM folder named Templates and it doesn’t exist. I mentioned this earlier in the Prerequisite Checklist section. It would appear we were supposed to create a VM folder named Templates right off the Datacenter object. However, I wasn’t able to find this in the documentation. If the create cluster script is supposed to create it, it’s not doing it. The fix is of course to create a VM folder named Templates off the Datacenter object and rerun the script. It should now successfully import the Bottlerocket container image into the content library and then create a template from it which will be used to create the control plane and worker nodes.

A few final thoughts on container images and storage:

In the end, both Bottlerocket and Ubuntu container images worked in my deployments. The Ubuntu images are much larger in size (Bottlerocket 617MB vs. Ubuntu 4.25GB), thus they consume more storage capacity and take longer to deploy the cluster, particularly on slower storage. Bottlerocket is the default. To use Ubuntu, simply change the osFamily value from bottlerocket to ubuntu in the yaml.
EKS Anywhere deploys the template to local host storage. For most environments with shared storage, this won’t be a best practice for a variety of reasons. I moved the Bottlerocket and Ubuntu templates from local host storage to shared storage and it didn’t cause an issue with EKS Anywhere. Simply convert the template(s) to VM(s), migrate storage, then convert VM(s) back to template(s). The two essentials tags are maintained throughout the process. Interestingly enough, the content library is created on the same datastore specified in the VSphereMachineConfig section of the yaml which in my case was a shared datastore so no issue with content library storage. I really don’t know why EKS Anywhere was designed to use local host storage for the template(s). Perhaps it was a $/GB savings decision but considering the small amount of capacity each template consumes, especially Bottlerocket at well under 1GB, this wouldn’t make effective sense.

Moving further, the next challenge I ran into was the following error:

Validation failed {“validation”: “vsphere Provider setup is valid”, “error”: “failed setup and validations: provided VSphereMachineConfig sshAuthorizedKey is invalid: ssh: no key found”, “remediation”: “”}
Error: failed to create cluster: validations failed

This error occurs because of the automatically generated sshAuthorizedKeys in the VSphereMachineConfig section of the eksa-cluster.yaml file:

    sshAuthorizedKeys:
    - ssh-rsa AAAA...

Each of the three instances needs to be changed to the following (note the double quotes):

    sshAuthorizedKeys:
    - ""

Performing the above and re-running the script will result in success:

Provided VSphereMachineConfig sshAuthorizedKey is not set or is empty, auto-generating new key pair…
VSphereDatacenterConfig private key saved to prod/eks-a-id_rsa. Use ‘ssh -i prod/eks-a-id_rsa ec2-user@’ to login to your cluster VM

DNS caused problems in one of the labs I was working in. This error is displayed on one or more of the Bottlerocket container image consoles and this error is fatal in that it will halt the deployment of further control plane or worker nodes. This error will also prevent the reporting of a successful deployment to the administrative machine resulting in an unhealthy and dysfunctional EKS Anywhere platform. I used what I learned to recreate the problem in my home lab:

Error deserializing HashMap to Settings: Error deserializing scalar value: Unable to deserialize into ValidLinuxHostname: Invalid hostname ‘WinTest022.boche.lab’: must only be [0-9a-z.-], and 1-253 chars long

The particular lab I was working in leveraged shared DNS infrastructure. Unbeknownst to me, the shared DNS infrastructure had many DNS Reverse Lookup Zone entries with camel case host names. Through troubleshooting this, I learned that Bottlerocket Linux is sensitive to host names and does not tolerate upper case characters. What happens behind the scenes is that each of the control plane and worker nodes receives a DHCP assigned IP address. Bottlerocket performs a reverse lookup on the IP address it receives. Bottlerocket uses the reverse lookup results to construct a host name for itself. If there is no reverse lookup record in DNS, everything works well. However, if the reverse lookup returns a host name with upper case characters, the error above results and the deployment fails. The remedy is to delete stale and unused reverse lookup records, especially those which contain upper case characters. After doing so, re-run the script and all seven control plane and worker nodes should deploy successfully. I do not know if the Ubuntu container image has the same DNS sensitivity that Bottlerocket has.

Sometimes, with all of the above issues addressed and for reasons unknown, the cluster deployment may still fail. In my experience, the first node deploys and powers on. Then the next two nodes deploy and power on. At this point, there is a very long wait and the remaining four nodes do not deploy. After a timeout is reached, the following error is reported on the administrative machine:

Creating new bootstrap cluster
Installing cluster-api providers on bootstrap cluster
Provider specific setup
Creating new workload cluster
Error: failed to create cluster: error waiting for workload cluster control plane to be ready: error executing wait: error: timed out waiting for the condition on clusters/prod
or
Error: failed to create cluster: error waiting for external etcd for workload cluster to be ready: error executing wait: error: timed out waiting for the condition on clusters/prod

Attempting to re-run the deployment script results in further errors because cleanup needs to be performed:

eksctl anywhere create cluster -f eksa-cluster.yaml
Error: failed to create cluster: error creating bootstrap cluster: error executing create cluster: ERROR: failed to create cluster: node(s) already exist for a cluster with the name “prod-eks-a-cluster”
, try rerunning with –force-cleanup to force delete previously created bootstrap cluster

Attempting to re-run the deployment script with the –force-cleanup parameter results in further errors because the –force-cleanup parameter doesn’t actually perform all of the cleanup that is necessary (this is noted in GitHub issue Improve resource cleanup #225):

eksctl anywhere create cluster -f eksa-cluster.yaml –force-cleanup
Error: failed to create cluster: error deleting bootstrap cluster: management cluster in bootstrap cluster

Cleanup is manual at this point. Power off and delete the EKS Anywhere nodes from vSphere inventory. Cleaning up the administrative machine is another matter which at the current time I do not know the correct process for. However, one helpful tip was offered to try at the link above and that is:

kind delete cluster –name prod-eks-a-cluster

Beyond that, the best advice I can offer for cleaning up the administrative machine is to create a snapshot of it just before deploying the EKS Anywhere cluster. If the cluster deployment fails, simply revert to the snapshot. You may now re-run the cluster creation script and keep repeating this process as necessary until the cluster deployment is successful. It can be hit or miss sometimes.

Having worked through each of the errors I encountered, I eventually reached a point where each EKS Anywhere cluster deployment is successful.

administrator@ubuntu-server:~$ eksctl anywhere create cluster -f eksa-cluster.yaml
Performing setup and validations
Warning: VSphereDatacenterConfig configured in insecure mode
✅ Connected to server
✅ Authenticated to vSphere
✅ Datacenter validated
✅ Network validated
✅ Datastore validated
✅ Folder validated
✅ Resource pool validated
✅ Datastore validated
✅ Folder validated
✅ Resource pool validated
✅ Datastore validated
✅ Folder validated
✅ Resource pool validated
✅ Control plane and Workload templates validated
Provided VSphereMachineConfig sshAuthorizedKey is not set or is empty, auto-generating new key pair...
VSphereDatacenterConfig private key saved to prod/eks-a-id_rsa. Use 'ssh -i prod/eks-a-id_rsa ec2-user@' to login to your cluster VM
✅ Vsphere Provider setup is valid
Creating new bootstrap cluster
Installing cluster-api providers on bootstrap cluster
Provider specific setup
Creating new workload cluster
Installing networking on workload cluster
Installing storage class on workload cluster
Installing cluster-api providers on workload cluster
Moving cluster management from bootstrap to workload cluster
Installing EKS-A custom components (CRD and controller) on workload cluster
Creating EKS-A CRDs instances on workload cluster
Installing AddonManager and GitOps Toolkit on workload cluster
GitOps field not specified, bootstrap flux skipped
Writing cluster config file
Deleting bootstrap cluster
🎉 Cluster created!
administrator@ubuntu-server:~$

This wraps up the challenges for the EKS Anywhere create production cluster deployment. I spent considerably more time working through each of these. The deployment process isn’t bad, but it could use improvement. Namely much better error trapping. The several hoops I had to jump through just to get the Bottlerocket image into to the content library and get a template out of it was mostly nonsense. All of the errors could be anticipated and handled without terminating (and leaving a mess behind). Beefing up documentation around some of these issues encountered would also be helpful.

After a successful EKS Anywhere deployment, I proceeded to deploy a handful of containerized demo applications. William Lam has a number of interesting ones here. Since I hadn’t installed a load balancer yet, I used the Kubernetes NodePort service to make each demo application accessible on the network. I may try tackling an EKS Anywhere external load balancer and ingress controller next. It looks like Kube-Vip and Emissary-ingress come highly recommended.

In closing, I’ll share my eksa-cluster.yaml which can be used for comparison purposes. There really isn’t too much to alter from the default which eksctl creates. Provide information about the vCenter Server address and objects where EKS Anywhere will be deployed, fix the sshAuthorizedKeys, and that’s about it.

apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: Cluster
metadata:
  name: prod
spec:
  clusterNetwork:
    cni: cilium
    pods:
      cidrBlocks:
      - 192.168.0.0/16
    services:
      cidrBlocks:
      - 10.96.0.0/12
  controlPlaneConfiguration:
    count: 2
    endpoint:
      host: "192.168.110.40"
    machineGroupRef:
      kind: VSphereMachineConfig
      name: prod-cp
  datacenterRef:
    kind: VSphereDatacenterConfig
    name: prod
  externalEtcdConfiguration:
    count: 3
    machineGroupRef:
      kind: VSphereMachineConfig
      name: prod-etcd
  kubernetesVersion: "1.21"
  workerNodeGroupConfigurations:
  - count: 2
    machineGroupRef:
      kind: VSphereMachineConfig
      name: prod

---
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: VSphereDatacenterConfig
metadata:
  name: prod
spec:
  datacenter: "Galleon Datacenter"
  insecure: true
  network: "vlan110"
  server: "vc.boche.lab"
  thumbprint: ""

---
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: VSphereMachineConfig
metadata:
  name: prod-cp
spec:
  datastore: "freenas1_nfs_share1"
  diskGiB: 25
  folder: "eksa"
  memoryMiB: 8192
  numCPUs: 2
  osFamily: bottlerocket
  resourcePool: "eksa"
  users:
  - name: ec2-user
    sshAuthorizedKeys:
    - ""

---
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: VSphereMachineConfig
metadata:
  name: prod
spec:
  datastore: "freenas1_nfs_share1"
  diskGiB: 25
  folder: "eksa"
  memoryMiB: 8192
  numCPUs: 2
  osFamily: bottlerocket
  resourcePool: "eksa"
  users:
  - name: ec2-user
    sshAuthorizedKeys:
    - ""

---
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: VSphereMachineConfig
metadata:
  name: prod-etcd
spec:
  datastore: "freenas1_nfs_share1"
  diskGiB: 25
  folder: "eksa"
  memoryMiB: 8192
  numCPUs: 2
  osFamily: bottlerocket
  resourcePool: "eksa"
  users:
  - name: ec2-user
    sshAuthorizedKeys:
    - ""

---

Blog Comments and Discussions Disabled

jason — Tue, 12 Oct 2021 22:42:08 +0000

Since the boche.net blog began back in 2008, there have been many relevant and valuable comments left on blog articles here. Over time, I have implemented various plug-ins, security measures, as well as backup and recovery mechanisms to combat the growing amount of comment spam and hacking attempts on this blog.

Akismet Anti-Spam (a plug-in I have been using) recently announced that their free to use plug-in is moving towards a payware model. This change took effect in mid-August and my free use account was disabled. Fast forward to today and I found that I had nearly 700,000 comments (probably all spam – I only looked at the first 100) waiting to be approved. While these unmoderated comments didn’t make their way to polluting the blog, attempts to delete them did cause problems with the the blog and MySQL database resources on the server.

The good news is that after some unplanned time spent, I got it all cleaned up and the blog appears to be healthy and functional.

The bad news is that, for now, I’m waiving the white flag on combating spam. Until further notice, blog comments and discussions will effectively be disabled for pages as well as blog posts older than 14 days. If I find that is not effective enough, I will disable comments and discussions across the board. For any feedback, discussion, questions, etc. on any blog post, please reach out to me via Twitter or Email.

Thank you,

Jas

Configure ntpd to start with host via CLI

jason — Tue, 04 Aug 2020 19:02:07 +0000

Good afternoon. I hope you’re all staying safe and healthy in the midst of the COVID-19 pandemic.

I had someone reach out to me yesterday with a need to script NTP configuration on ESXi hosts. He had all of the NTP configuration working except enabling the ntpd daemon to start automatically with the host. That’s easy enough I said. I use the following PowerCLI script block to configure many of my vSphere hosts. Row three takes care of automatically starting the daemon.

Get-VMHost | Get-VMHostFirewallException | Where-Object {$_.Name -eq "NTP client"} |Set-VMHostFirewallException -Enabled:$true
Get-VMHost | Get-VmHostService | Where-Object {$_.key -eq "ntpd"} | Start-VMHostService
Get-VMhost | Get-VmHostService | Where-Object {$_.key -eq "ntpd"} | Set-VMHostService -policy "on"
Get-VMHost | Get-VmHostService | Where-Object {$_.key -eq "ntpd"}
Get-VMHost | Get-VMHostFirewallException | Where-Object {$_.Name -eq "NTP client"}

However, he didn’t want to use PowerCLI – he needed an ESXi command line method. He had scoured the internet finding no solution and actually turned up many discussions saying it can’t be done.

I dug deep into my ESX 3.5.0 build document, last updated 1/23/10 (just prior to my VCDX defense which occurred in early February 2010). ‘Try this”, I said:

chkconfig ntpd on

He responded that it didn’t work. He wants the UI to show the NTP Service startup policy was Start and stop with host. It was still grayed out showing Start and stop manually.

“Ok do this – restart hostd”:

/etc/init.d/hostd restart

That worked and was likely the missing step from many of the forum threads saying it this can’t be done.

vSphere with Kubernetes

jason — Mon, 18 May 2020 05:01:34 +0000

During the past couple of months, I’ve had the opportunity to participate in both the vSphere 7 and Project Pacific beta programs. While the vSphere 7 beta was fairly straightforward (no intent to downplay the incredible amount of work that went into one of VMware’s biggest and most anticipated releases in company history), Project Pacific bookends the start of my Kubernetes journey – something I’ve wanted to get moving on once the busy hockey season concluded (I just wrapped up my 4th season coaching at the Peewee level).

For myself, the learning process can be broken down into two distinct parts:

Understanding the architecture and deploying Kubernetes on vSphere. Sidebar: Understanding the Tanzu portfolio (and the new names for VMware modern app products). To accomplish this In vSphere 7, we need to deploy NSX-T and then enable Workload Management in the UI. That simplification easily represents several hours of work when you consider planning and in my case, failing a few times. I’ve seen a few references made on how easy this process is. Perhaps if you already have a strong background in NSX-T. I found it challenging during the beta.
Day 1 Kubernetes. The supervisor cluster is up and running (I think). Now how do I use it? YAML? Pods? What’s a persistent volume claim (PVC)? Do I now have a Tanzu Kubernetes Grid Cluster? No, not yet.

This blog post is going to focus mainly on part 1 – deployment of the Kubernetes platform on vSphere 7, learning the ropes, and some of the challenges I overcame to achieve a successful deployment.

During the Project Pacific beta, we had a wizard which deployed most of the NSX-T components. The NSX-T Manager, the Edges, the Tier-0 Gateway, the Segment, the uplinks, it was all handled by the wizard. I’m an old hand with vShield Manager and NSX Manager after that for vCloud Director, but NSX-T is a beast. If you don’t know your way around NSX-T yet, the wizard was a blessing because all we had to do was understand what was needed, and then supply the correlating information to the wizard. I think the wizard also helped drive the beta program to success within the targeted start and end dates (these are typical beta program constraints).

When vSphere 7 went GA, a few notable things had changed.

Licensing. Kubernetes deployment on vSphere 7 requires VMware vSphere Enterprise Plus with Add-on for Kubernetes. Right now I believe the only path is through VMware Cloud Foundation (VCF) 4.0 licensing.
Unofficially you can deploy Kubernetes on vSphere 7 without VCF. All of the bits needed already exist in vCenter, ESXi, and NSX-T 3.0. But as the Kubernetes features seem to be buried in the ESXi license key, it involves just a bit of trickery. More on that in a bit.
Outside of VCF, there is no wizard based installation like we had in the Project Pacific beta. It’s a manual deployment and configuration of NSX-T. To be honest and from a learning perspective, this is a good thing. There’s no better way to learn than to crack open the books, read, and do.

So here’s VMware’s book to follow:

vSphere with Kubernetes Configuration and Management (PDF, online library).

It’s a good guide and should cover everything you need from planning to deployment of both NSX-T as well as Kubernetes. If you’re going to use it, be aware that it does tend to get updated so watch for those changes to stay current. To that point, I may make references to specific page numbers that could change over time.

I’ve made several mentions of NSX-T. If you haven’t figured it out by now, the solution is quite involved when it comes to networking. It’s important to understand the networking architecture and how that will overlay your own network as well as utilize existing infrastructure resources such as DNS, NTP, and internet access. When it comes to filling in the blanks for the various VLANs, subnets, IP addresses, and gateways, it’s important to provide the right information and configure correctly. Failure to do so will either end up in a failed deployment, or a deployment that from the surface appears successful but Kubernetes work later on fails miserably. Ask me how I know.

There are several network diagrams throughout VMware’s guide. You’ll find more browsing the internet. I borrowed this one from the UI.

They all look about the same. Don’t worry so much about the internal networking of the supervisor cluster or even the POD or Service CIDRs. For the most part these pieces are autonomous. The workload enablement wizard assigns these CIDR blocks automatically so that means if you leave them alone, you can’t possibly misconfigure them.

What is important can be boiled down to just three required VLANs. Mind you I’m talking solely about Kubernetes on vSphere in the lab here. For now, forget about production VCF deployments and the VLAN requirements it brings to the table (but do scroll down to the end for a link to a click through demo of Kubernetes with VCF).

Just three VLANs. It does sound simple but where some of the confusion may start is terminology – depending on the source, I’ve seen these VLANs referred to in different ways using different terms. I’ll try and simply as much as I can.

ESXi host TEP VLAN – Just a private empty VLAN. Must route to Edge node TEP VLAN. Must support minimum 1600 MTU (jumbo frames) both intra VLAN as well as routing jumbo frames to the Edge node TEP VLAN. vmk10 is tied to this VLAN.
Edge node TEP VLAN– Another private empty VLAN. Must route to ESXi host TEP VLAN. Must support minimum 1600 MTU (jumbo frames) both intra VLAN as well as routing jumbo frames to the ESXi host TEP VLAN. The Edge TEP is tied to this VLAN.

A routed tunnel is established between the ESXi host tunnel endpoints on vmk10 (and vmk11 if you’re deploying with redundancy in mind) and each Edge node TEP interface. If jumbo frames aren’t making it unfragmented through this tunnel, you’re dead in the water.
The third VLAN is what VMware calls the Tier 0 gateway and uplink for transport node on page 49 of their guide. I’ve seen this called the Overlay network. I’ve seen this called the Edge uplink network. The Project Pacific beta quickstart guide called it the Edge Logical Router uplink VLAN as well as the Workload Network VLAN. Later in the wizard it was simply referred to as the Uplink VLAN. Don’t ever confuse this with the TEP VLANs. In all diagrams it’s going to be the External Network or the network where the DevOps staff live. The Tier-0 gateway provides the north/south connectivity between the external network and the Kubernetes stack (which also includes a Tier-1 gateway). Another helpful correlation: The Egress and Ingress CIDRs live on this third VLAN. You’ll find out sooner or later that existing resources must exist on this external network just as DNS, NTP, and internet access.

All of the network diagrams I’ve seen, including the one above, distinguish between the external network and the management network. For the home labbers out there, these two will most often be the same network. In my initial deployment, I made the mistake of deploying Kubernetes with a management VLAN and a separate DevOps VLAN that had no route to the internet. Workload enablement was successful but I found out later that applying a simple YAML resulted in endless failed pods being created. This is because the ESXi host based image fetcher container runtime executive (CRX) had no route to the internet to access public repository images (a firewall blocking traffic can cause this as well). I was seeing errors such as the following in /var/log/spherelet.log on the vSphere host where the pod was placed:

Failed to resolve image: Http request failed. Code 400: ErrorType(2) failed to do request: Head https://registry-1.docker.io/v2/library/nginx/manifests/alpine: dial tcp 34.197.189.129:443: connect: network is unreachable

spherelet.log:time="2020-03-25T02:47:24.881025Z" level=info msg="testns1/nginx-3a1d01bf5d03a391d168f63f6a3005ff4d17ca65-v246: Start new image fetcher instance. Crx-cli cmd args [/bin/crx-cli ++group=host/vim/vmvisor/spherelet/imgfetcher run --with-opaque-network nsx.LogicalSwitch --opaque-network-id a2241f05-9229-4703-9815-363721499b59 --network-address 04:50:56:00:30:17 --external-id bfa5e9d2-8b9d-4b34-9945-5b7452ee76a0 --in-group host/vim/vmvisor/spherelet/imgfetcher imgfetcher]\n"

The NSX-T Manager and Edge nodes both have management interfaces that tie to the management network, but much like vCenter and ESXi management interfaces, these are for management only and are not in the data path nor are they a part of the Geneve tunnel. As such, the management work does not require jumbo frames.

Early on in the beta, I took some lumps trying to deploy Kubernetes on vSphere. These attempts were unsuccessful for a few reasons and 100% of the cause was networking problems.

First networking problem: My TEP VLANs were not routed. That was purely my fault for not understanding in full the networking requirements for the two TEP VLANs. Easy fix – I contacted my lab administrator and had him add two default gateways, one for each of the TEP VLANs. Problem solved.

Second networking problem: My TEP VLANs supported jumbo frames at Layer 2 (hosts on the same VLAN can successfully send and receive unfragmented jumbo frames all day), but did not support the routing of jumbo frames. (Using vmkping with the -d switch is very important in testing for jumbo frame success the command looks something like vmkping -I vmk10 -S vxlan -s 1572 -d). In other words, when trying to send a jumbo frame from an ESXi host TEP to an Edge TEP on the other VLAN, standard MTU frames make it through, but jumbo frames are dropped at the physical switch interface which was performing the intra switch intervlan routing.

A problem with jumbo frames can manifest itself into somewhat of a misleading problem and resulting diagnosis. When a jumbo frames problem exists between the two TEP VLANs:

A workload enablement appears successful and healthy in the UI
The Control Plane Node IP Address is pingable
The individual supervisor cluster nodes are reachable on their respective IP addresses and accept kubectl API commands
The Harbor Image Registry is successfully deployed

But…

The Control Plane Node IP Address is not reachable over https in a web browser
The Harbor Image Registry is unreachable via web browser at its published IP address

These are symptoms of an underlying jumbo frames problem but they can be misidentified as a load balancer issue.

I spent some time on this because my lab administrator assured me jumbo frames were enabled on the physical switch. It took some more digging to find out intervlan routing of jumbo frames was a separate configuration on the switch. To be fair, I didn’t initially ask for this configuration (I didn’t know what I didn’t know at the time). Once that configuration was made on the switch, jumbo frames were making it to both ends of the tunnel was traversed VLANs. Problem solved.

Just one more note on testing for intervlan routing of jumbo frames. Although the switch may be properly configured and jumbo frames are making it through between VLANs, I have found that sending vmkping commands with jumbo frames to the switch interfaces themselves (this would be the default gateway for the VLAN) can be successful and it can also fail. I think it all depends on the switch make and model. Call it a red herring and try not to pay attention to it. What’s important is that the jumbo frames ultimately make it through to the opposite tunnel endpoint.

Third networking problem: The third critical VLAN mentioned above (call it the overlay, call it the Edge Uplink, call it the External Network, call it the DevOps network), is not well understood and is implemented incorrectly. There are few ways you can go wrong here.

Use the wrong VLAN – in other words a VLAN which has no reachable network services such as DNS, NTP, or a gateway to the internet. You’ll be able to deploy the Kubernetes framework but the deployment of pods requiring access to a public image repository will fail. During deployment, failed pods will quickly stack up in the namespace.
Using the correct VLAN but using Egress and Ingress CIDR blocks from some other VLAN. This won’t work and it is pretty well spelled out everywhere I’ve looked that the Egress and Ingress CIDR blocks need to be on the same VLAN which represents the External Network. Among other things, the Egress portion is used for outbound traffic through the Tier-0 Gateway to the external network. The image fetcher CRX is one such function which uses Egress. The Ingress portion is used for inbound traffic from the External Network through the Tier-0 Gateway. In fact, the first usable IP address in this block ends up being the Control Plane Node IP Address for the Supervisor Cluster where all the kubectl API commands come through. If you’ve just finished enabling workload management and your Control Plane Node IP Address is not on your external network, you’ve likely assigned the wrong Egress/Ingress CIDR addresses.

Fourth networking problem: The sole edge portgroup that I had created on the distributed switch needs to be in VLAN Trunking mode (1-4094), not VLAN with a specified tag. This one can be easy to miss and early in the beta I missed it. I followed the Project Pacific quickstart guide but don’t ever remember seeing the requirement. It is well documented now on page 55 of the VMware document I mentioned early on.

To summarize the common theme thus far, networking, understanding the requirements, mapping to your environment, and getting it right for a successful vSphere with Kubernetes deployment.

Once the beta had concluded and vSphere 7 was launched, I was anxious to deploy Kubernetes on GA bits. After deploying and configuring NSX-T in the lab, I ran into the licensing obstacle. During the Project Pacific beta, license keys were not an issue. The problem is when we try to enable Workload Management after NSX-T is ready and waiting. Without the proper licensing, I was greeted with This vCenter does not have the license to support Workload Management. You won’t see this in a newly stood up greenfield environment if you haven’t gotten a chance to license the infrastructure. Where you will see it is if you’ve already licensed your ESXi hosts with Enterprise Plus licenses for instance. Since Enterprise Plus licenses by themselves are not entitled to the Kubernetes feature, they will disable the Workload Management feature.

The temporary workaround I found is to simply remove the Enterprise Plus license keys and apply the Evaluation License to the hosts. Once you do this and refresh the Workload Management page, the padlock disappears and I was able to continue with the evaluation.

Unfortunately the ESXi host Evaluation License keys are only good for 60 days. As of this writing, vSphere 7 has not yet been GA for 60 days so anyone who stood up a vSphere 7 GA environment on day 1 still has a chance to evaluate vSphere with Kubernetes.

One other minor issue I’ve run into that I’ll mention has to do with NSX-T Compute Managers. A Compute Manager in NSX-T is a vCenter Server registration. You might be familiar with the process of registering a vCenter Server with other products such as storage array or data protection software. This really is no different.

However, a problem can present itself whereby a vCenter Server has been register to an NSX-T Manager previously, that NSX-T Manager is decommissioned (improperly), and then an attempt is made sometime later to register the vCenter Server with a newly commissioned NSX-T Manager. The issue itself is a little deceptive because at first glance that subsequent registration with a new NSX-T Manager appears successful – no errors are thrown in the UI and we continue our work setting up the fabric in NSX-T Manager.

What lurks in the shadows is that the registration wasn’t entirely successful. The Registration Status shows Not Registered and the Connection Status shows Down. It’s a simple fix really – not something ugly that you might expect in the CLI. Simply click on the Status link and you’re offered an opportunity to Select errors to resolve. I’ve been through this motion a few times and the resolution is quick and effortless. Within a few seconds the Registration Status is Registered and the Connection Status is Up.

Deploying Kubernetes on vSphere can be challenging, but in the end it is quite satisfying. It also ushers in Day 1 Kubernetes. Becoming a kubectl Padawan. Observing the rapid deployment and tear down of applications on native VMware integrated Kubernetes pods. Digging into the persistent storage aspects. Deploying a Tanzu Kubernetes Grid Cluster.

Day 2 Kubernetes is also a thing. Maintenance, optimization, housekeeping, continuous improvement, backup and restoration. Significant dependencies now tie into vSphere and NSX-T infrastructure. Keeping these components healthy and available will be more important than ever to maintain a productive and happy DevOps organization.

I would be remiss if I ended this before calling out a few fantastic resources. David Stamen’s blog series Deploying vSphere with Kubernetes provides a no nonsense walk through highlighting all of the essential steps from configuring NSX-T to enabling workload management. He wraps up the series with demo applications and a Tanzu Kubernetes Grid Cluster deployment.

It should be no surprise that William Lam’s name comes up here as well. William has done some incredible work in the areas of automated vSphere deployments for lab environments. In his Deploying a minimal vSphere with Kubernetes environment article, he shows us how we can deploy Kubernetes on a two or even one node vSphere cluster (this is unsupported of course – a minimum of three vSphere hosts is required as of this writing). This is a great tip for those who want to get their hands on vSphere with Kubernetes but have a limited number of vSphere hosts in their cluster to work with. I did run into one caveat with the two node cluster in my own lab – I was unable to deploy a Tanzu Kubernetes Grid Cluster. After deployment and power on of the the TKG control plane VM, it waits indefinitely to deploy the three worker VMs. I believe the TKG cluster is looking for three supervisor control plane VMs. Nonetheless, I was able to deploy applications on native pods and it demonstrates that William’s efforts enable the community at large to do more with less in their home or work lab environments. If you find his work useful, make a point to reach out and thank him.

How to Validate MTU in an NSX-T Environment – This is a beautifully written chapter. Round up the usual vmkping suspects (Captain Louis Renault, Casablanca). You’ll find them all here. The author also utilizes esxcli (general purpose ESXi CLI), esxcfg-vmknic (vmkernel NIC CLI), nsxdp-cli (NSX datapath), and edge node CLI for diagnostics.

NSX-T Command Line Reference Guide – I stumbled onto this guide covering nsxcli. Although I didn’t use it for getting the Kubernetes lab off the ground, it looks very interesting and useful for NSX-T Datacenter so I’m bookmarking it for later. What’s interesting to note here is nsxcli CLI is run from the ESXi host via installed kernel module, as well as from the NSX-T Manager and the edge nodes.

What is Dell Technologies PowerStore – Part 13, Integrating with VMware Tanzu Kubernetes Grid – Itzik walks through VMware Tanzu Kubernetes Grid with vSphere 7 and Tanzu Kubernetes clusters. He walks through the creation of a cluster with Dell EMC PowerStore and vVols.

Lastly, here’s a cool click-through demo of a Kubernetes deployment on VCF 4.0 – vSphere with Kubernetes on Cloud Foundation (credit William Lam for sharing this link).

With that, I introduce a new blog tag: Kubernetes

Peace out.

Site Recovery Manager Firewall Rules for Windows Server

jason — Wed, 29 Apr 2020 20:09:12 +0000

I have a hunch Google sent you here. Before we get to what you’re looking for, I’m going to digress a little. tl;dr folks feel free to jump straight to the frown emoji below for what you’re looking for.

Since the Industrial Revolution, VMware has supported Microsoft Windows and SQL Server platforms to back datacenter and cloud infrastructure products such as vCenter Server, Site Recovery Manager, vCloud Director (rebranded recently to VMware Cloud Director), and so on. However, if you’ve been paying attention to product documentation and compatibility guides, you will have noticed support for Microsoft platforms diminishing in favor of easy to deploy appliances based on Photon OS and VMware Postgres (vPostgres). This is a good thing – spoken by a salty IT veteran with a strong Windows background.

2019 is where we really hit a brick wall. vCenter Server 6.7 is the last version that supports installation on Windows and that ended on Windows Server 2016 – there was never support for Windows Server 2019 (reference VMware KB 2091273 – Supported host operating systems for VMware vCenter Server installation). In vSphere 7.0, vCenter Server for Windows has been removed and support is not available. For more information, see Farewell, vCenter Server for Windows. Likewise, Microsoft SQL Server 2016 was the last version to support vCenter Server (matrix reference).

Site Recovery Manager (SRM) is in the same boat. It was born and bred on Winodws and SQL Server back ends. But once again we find a Photon OS-based appliance with embedded vPostres available along with product documentation which highlights diminishing support for Microsoft Windows and SQL.

Taking a closer look at the documentation…

Compatibility Matrices for VMware Site Recovery Manager 8.2

“Site Recovery Manager Server 8.2 supports the same Windows host operating systems that vCenter Server 6.7 supports.”
Supported host operating systems for VMware vCenter Server installation (2091273)
Takeaway: SRM 8.2 is supported on Windows Server 2016, but not 2019.

Compatibility Matrices for VMware Site Recovery Manager 8.3

“Site Recovery Manager Server 8.3 supports the same Windows host operating systems that vCenter Server 7.0 supports.” SRM 8.3 supports vCenter Server 6.7 as well so that should been included here also but was left out, probably an oversight.
Supported host operating systems for VMware vCenter Server installation (2091273)
Takeaway: vCenter Server 7 cannot be installed on Windows. This implies SRM 8.3 supports no version of Windows Server for installation (this implication is not at all correct as SRM 8.3 ships as a Windows executable installation for vSphere 6.x environments). Not a great spot to be in since the Photon OS-based SRM appliance employs a completely different Storage Replication Adapter (SRA) than the Windows installation and not all storage vendors support both (yet).

Ignoring the labyrinth of supported product and platform compatibility matrices above, one may choose to forge ahead and install SRM on Windows Server 2019 anyway. I’ve done it several times in the lab but there was a noted takeaway.

When I logged into the vSphere Client, the SRM plug-in was not visible. In my travels, there’s a few reasons why this symptom can occur.

The SRM services are not started.
The logged on user account is not a member of the SRM Administrators group (yes even super users like administrator@vsphere.local will need to be added to this group for SRM management).
The Windows Firewall is blocking ports used to present the plug-in.

Wait, what? The Windows Firewall wasn’t typically a problem in the past. That is correct. The SRM installation does create four inbound Windows Firewall rules (none outbound) on Windows Server up through 2016. However, for whatever reason, the SRM installation does not create these needed firewall rules on Windows Server 2019. The lack of firewall rules allowing SRM related traffic will block the plug-in. Reference Network Ports for Site Recovery Manager.

One obvious workaround here would be to disable the Windows Firewall but what fun would that be? Also this may violate IT security plans, trigger an audit, require exception filings. Been there, done that, ish. Let’s dig a little deeper.

The four inbound Windows Firewall rules ultimately wind up in the Windows registry. A registry export of the four rules actually created by an SRM installation is shown below. Through trial and error I’ve found that importing the rules into the Windows registry with a .reg file results in broken rules so for now I would not recommend that method.

Windows Registry Editor Version 5.00

[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\SharedAccess\Parameters\FirewallPolicy\FirewallRules]
"{B37EDE84-6AC1-4F7D-9E42-FA44B8D928D0}"="v2.26|Action=Allow|Active=TRUE|Dir=In|App=C:\\Program Files\\VMware\\VMware vCenter Site Recovery Manager\\bin\\vmware-dr.exe|Name=VMware vCenter Site Recovery Manager|Desc=This feature allows connections using VMware vCenter Site Recovery Manager.|"
"{F6EAE3B7-C58F-4582-816B-C3610411215B}"="v2.26|Action=Allow|Active=TRUE|Dir=In|Protocol=6|LPort=9086|Name=VMware vCenter Site Recovery Manager - HTTPS Listener Port|"
"{F6E7BE93-C469-4CE6-80C4-7069626636B0}"="v2.26|Action=Allow|Active=TRUE|Dir=In|App=C:\\Program Files\\VMware\\VMware vCenter Site Recovery Manager\\external\\commons-daemon\\prunsrv.exe|Name=VMware vCenter Site Recovery Manager Client|Desc=This feature allows connections using VMware vCenter Site Recovery Manager Client.|"
"{66BF278D-5EF4-4E5F-BD9E-58E88719FA8E}"="v2.26|Action=Allow|Active=TRUE|Dir=In|Protocol=6|LPort=443|Name=VMware vCenter Site Recovery Manager Client - HTTPS Listener Port|"

The four rules needed can be created by hand in the Windows Firewall UI, configured centrally via Group Policy Object (GPO), or scripted with netsh or PowerShell. I’ve chosen PowerShell and created the script below for the purpose of adding the rules. Pay close attention in that two of these rules are application path specific. Change the drive letter and path to the applications as necessary or the two rules won’t work properly.

# Run this PowerShell script directly on a Windows based VMware Site
# Recovery Manager server to add four inbound Windows firewall rules
# needed for SRM functionality.
# Jason Boche
# http://boche.net
# 4/29/20

New-NetFirewallRule -DisplayName "VMware vCenter Site Recovery Manager" -Description "This feature allows connections using VMware vCenter Site Recovery Manager." -Direction Inbound -Program "C:\Program Files\VMware\VMware vCenter Site Recovery Manager\bin\vmware-dr.exe" -Action Allow

New-NetFirewallRule -DisplayName "VMware vCenter Site Recovery Manager - HTTPS Listener Port" -Direction Inbound -LocalPort 9086 -Protocol TCP -Action Allow

New-NetFirewallRule -DisplayName "VMware vCenter Site Recovery Manager Client" -Description "This feature allows connections using VMware vCenter Site Recovery Manager Client." -Direction Inbound -Program "C:\Program Files\VMware\VMware vCenter Site Recovery Manager\external\commons-daemon\prunsrv.exe" -Action Allow

New-NetFirewallRule -DisplayName "VMware vCenter Site Recovery Manager Client - HTTPS Listener Port" -Direction Inbound -LocalPort 443 -Protocol TCP -Action Allow

Test execution was a success.

PS S:\PowerShell scripts> .\srmaddwindowsfirewallrules.ps1


Name                  : {10ba5bb3-6503-44f8-aad3-2f0253c980a6}
DisplayName           : VMware vCenter Site Recovery Manager
Description           : This feature allows connections using VMware vCenter Site Recovery Manager.
DisplayGroup          :
Group                 :
Enabled               : True
Profile               : Any
Platform              : {}
Direction             : Inbound
Action                : Allow
EdgeTraversalPolicy   : Block
LooseSourceMapping    : False
LocalOnlyMapping      : False
Owner                 :
PrimaryStatus         : OK
Status                : The rule was parsed successfully from the store. (65536)
EnforcementStatus     : NotApplicable
PolicyStoreSource     : PersistentStore
PolicyStoreSourceType : Local

Name                  : {ea88dee8-8c96-4218-a23d-8523e114d2a9}
DisplayName           : VMware vCenter Site Recovery Manager - HTTPS Listener Port
Description           :
DisplayGroup          :
Group                 :
Enabled               : True
Profile               : Any
Platform              : {}
Direction             : Inbound
Action                : Allow
EdgeTraversalPolicy   : Block
LooseSourceMapping    : False
LocalOnlyMapping      : False
Owner                 :
PrimaryStatus         : OK
Status                : The rule was parsed successfully from the store. (65536)
EnforcementStatus     : NotApplicable
PolicyStoreSource     : PersistentStore
PolicyStoreSourceType : Local

Name                  : {a707a4b8-b0fd-4138-9ffa-2117c51e8ed4}
DisplayName           : VMware vCenter Site Recovery Manager Client
Description           : This feature allows connections using VMware vCenter Site Recovery Manager Client.
DisplayGroup          :
Group                 :
Enabled               : True
Profile               : Any
Platform              : {}
Direction             : Inbound
Action                : Allow
EdgeTraversalPolicy   : Block
LooseSourceMapping    : False
LocalOnlyMapping      : False
Owner                 :
PrimaryStatus         : OK
Status                : The rule was parsed successfully from the store. (65536)
EnforcementStatus     : NotApplicable
PolicyStoreSource     : PersistentStore
PolicyStoreSourceType : Local

Name                  : {346ece5b-01a9-4a82-9598-9dfab8cbfcda}
DisplayName           : VMware vCenter Site Recovery Manager Client - HTTPS Listener Port
Description           :
DisplayGroup          :
Group                 :
Enabled               : True
Profile               : Any
Platform              : {}
Direction             : Inbound
Action                : Allow
EdgeTraversalPolicy   : Block
LooseSourceMapping    : False
LocalOnlyMapping      : False
Owner                 :
PrimaryStatus         : OK
Status                : The rule was parsed successfully from the store. (65536)
EnforcementStatus     : NotApplicable
PolicyStoreSource     : PersistentStore
PolicyStoreSourceType : Local



PS S:\PowerShell scripts>

After logging out of the vSphere Client and logging back in, the Site Recovery plug-in loads and is available.

Feel free to use this script but be advised, as with anything from this site, it comes without warranty. Practice due diligence. Test in a lab first. Etc.

With virtual appliance being fairly mainstream at this point, this article probably won’t age well but someone may end up here. Maybe me. It has happened before.

vSphere 7 Deploy OVF Template: A fatal error has occurred. Unable to continue.

jason — Sun, 26 Apr 2020 23:52:47 +0000

vSphere 7 is one of the most anticipated releases of VMware’s flagship product ever (prior to that, it’s probably going to be vCenter 2.0 and ESX 3.0 which introduced clustering, DRS, and HA). I spent quite a bit of time with it during beta. Now that it has been GA for a few weeks, I’m having quite a good time with the refresh and the cool new features and enhancements that come with it.

If you’re anything like me, you habitually do two things in the lab:

Configure session.timeout = 0 in /etc/vmware/vsphere-ui/webclient.properties
Stay logged into the vSphere Client for weeks

I’ve found in vSphere 7 that can lead to a problem. For instance, I was recently attempting to deploy a vRealize Operations Manager OVF template. At step three Select a compute resource I was presented with the error A fatal error has occurred. Unable to continue.

Indeed I was unable to continue no matter which compute resource I selected.

I cancelled and tried again. Same result.

I cancelled and tried a different OVF template (vCloud Director for Service Providers 10.x). Same result.

This smelled like a vSphere Client issue and my first thought was to recycle the vSphere Client service (vsphere-ui under the hood of the VCSA) and if need be, I’d recycle the VCSA itself.

Before all of that though it occurred to me that this Google Chrome browser tab had been open for probably several days and likewise my vSphere Client session had also been logged in for several days.

Logging out of the vSphere Client session and logging back in proved to resolve the issue. I was able to proceed with deploying the OVF template with no further problems.

On the subject of vSphere 7 UI anomalies, I thought I’d mention another one I ran into a week or two ago and it has to do with the nesting of blue folders on the VMs and Templates view. The basic gist of it is that objects nested three or more layers deep become “hidden” and won’t be displayed in the left side pane of the UI. In the example below, an assortment of virtual machines represent the objects that are nested three levels deep. What we should see in the left pane is that we can drill down further into the V15 (vSphere 6.7u2) folder, but we don’t. This becomes even more evident when the objects that are hidden are nested blue folders themselves. They will simply disappear from view, although they do still exist.

No real worries here as it’s just a UI nuisance which is fairly easy to work around and VMware is already aware of it. I’d expect a patch in a coming release.

Last but not least…

vCloud Director web page portal fails to load

jason — Mon, 25 Mar 2019 19:22:53 +0000

Last week I went through the process to upgrade a vCloud Director for Service Providers environment to version 9.5.0.2. All seemed to go well with the upgrade. However, after all was said and done, the vCloud Director web page portal failed to open. It would partially load… but then failed.

I seem to recall this happening at some point in the past but couldn’t remember the root cause/fix nor could I find it documented on my blog. So… time to dig into the logs.

The watchdog log showed the cell services recyling over and over.

The cell log showed a problem with Transfer Server Storage. Error starting application: Unable to create marker file in the transfer spooling area

[root@vcdcell1 logs]# tail -f cell.log
Application Initialization: ‘com.vmware.vcloud.networking-server’ 20% complete. Subsystem ‘com.vmware.vcloud.common-cell-impl’ started
Application Initialization: ‘com.vmware.vcloud.common.core’ 12% complete. Subsystem ‘com.vmware.vcloud.common-util’ started
Application Initialization: ‘com.vmware.vcloud.cloud-proxy-server’ 42% complete. Subsystem ‘com.vmware.vcloud.common-util’ started
Application Initialization: ‘com.vmware.vcloud.networking-server’ 40% complete. Subsystem ‘com.vmware.vcloud.common-util’ started
Application Initialization: ‘com.vmware.vcloud.cloud-proxy-server’ 57% complete. Subsystem ‘com.vmware.vcloud.cloud-proxy-services’ started
Application Initialization: ‘com.vmware.vcloud.common.core’ 16% complete. Subsystem ‘com.vmware.vcloud.api-framework’ started
Application Initialization: ‘com.vmware.vcloud.cloud-proxy-server’ 71% complete. Subsystem ‘com.vmware.vcloud.hybrid-networking’ started
Application Initialization: ‘com.vmware.vcloud.cloud-proxy-server’ 85% complete. Subsystem ‘com.vmware.vcloud.hbr-aware-plugin’ started
Application Initialization: ‘com.vmware.vcloud.cloud-proxy-server’ 100% complete. Subsystem ‘com.vmware.vcloud.cloud-proxy-web’ started
Application Initialization: ‘com.vmware.vcloud.cloud-proxy-server’ complete.
Application Initialization: ‘com.vmware.vcloud.common.core’ 20% complete. Subsystem ‘com.vmware.vcloud.common-vmomi’ started
Application Initialization: ‘com.vmware.vcloud.common.core’ 25% complete. Subsystem ‘com.vmware.vcloud.jax-rs-activator’ started
Application Initialization: ‘com.vmware.vcloud.common.core’ 29% complete. Subsystem ‘com.vmware.pbm.placementengine’ started
Application Initialization: ‘com.vmware.vcloud.common.core’ 33% complete. Subsystem ‘com.vmware.vcloud.vim-proxy’ started
Application Initialization: ‘com.vmware.vcloud.common.core’ 37% complete. Subsystem ‘com.vmware.vcloud.fabric.foundation’ started
Application Initialization: ‘com.vmware.vcloud.common.core’ 41% complete. Subsystem ‘com.vmware.vcloud.imagetransfer-server’ started
Application Initialization: ‘com.vmware.vcloud.common.core’ 45% complete. Subsystem ‘com.vmware.vcloud.fabric.net’ started
Application Initialization: ‘com.vmware.vcloud.networking-server’ 60% complete. Subsystem ‘com.vmware.vcloud.fabric.net’ started
Application Initialization: ‘com.vmware.vcloud.common.core’ 50% complete. Subsystem ‘com.vmware.vcloud.fabric.storage’ started
Application Initialization: ‘com.vmware.vcloud.common.core’ 54% complete. Subsystem ‘com.vmware.vcloud.fabric.compute’ started
Application Initialization: ‘com.vmware.vcloud.common.core’ 58% complete. Subsystem ‘com.vmware.vcloud.service-extensibility’ started
Application Initialization: ‘com.vmware.vcloud.common.core’ 62% complete. Subsystem ‘com.vmware.vcloud.backend-core’ started
Application Initialization: ‘com.vmware.vcloud.networking-server’ 80% complete. Subsystem ‘com.vmware.vcloud.backend-core’ started
Application Initialization: ‘com.vmware.vcloud.common.core’ 66% complete. Subsystem ‘com.vmware.vcloud.vapp-lifecycle’ started
Application Initialization: ‘com.vmware.vcloud.networking-server’ 100% complete. Subsystem ‘com.vmware.vcloud.networking-web’ started
Application Initialization: ‘com.vmware.vcloud.networking-server’ complete.
Application Initialization: ‘com.vmware.vcloud.common.core’ 70% complete. Subsystem ‘com.vmware.vcloud.content-library’ started
Application Initialization: ‘com.vmware.vcloud.common.core’ 75% complete. Subsystem ‘com.vmware.vcloud.presentation-api-impl’ started
Application Initialization: ‘com.vmware.vcloud.common.core’ 79% complete. Subsystem ‘com.vmware.vcloud.metrics-core’ started
Application Initialization: ‘com.vmware.vcloud.ui.h5cellapp’ 33% complete. Subsystem ‘com.vmware.vcloud.h5-webapp-provider’ started
Application Initialization: ‘com.vmware.vcloud.common.core’ 83% complete. Subsystem ‘com.vmware.vcloud.multi-site-core’ started
Application Initialization: ‘com.vmware.vcloud.common.core’ 87% complete. Subsystem ‘com.vmware.vcloud.multi-site-api’ started
Application Initialization: ‘com.vmware.vcloud.ui.h5cellapp’ 50% complete. Subsystem ‘com.vmware.vcloud.h5-webapp-tenant’ started
Application Initialization: ‘com.vmware.vcloud.ui.h5cellapp’ 66% complete. Subsystem ‘com.vmware.vcloud.h5-webapp-auth’ started
Application Initialization: ‘com.vmware.vcloud.ui.h5cellapp’ 83% complete. Subsystem ‘com.vmware.vcloud.h5-swagger-doc’ started
Application Initialization: ‘com.vmware.vcloud.ui.h5cellapp’ 100% complete. Subsystem ‘com.vmware.vcloud.h5-swagger-ui’ started
Application Initialization: ‘com.vmware.vcloud.ui.h5cellapp’ complete.
Application Initialization: ‘com.vmware.vcloud.common.core’ 91% complete. Subsystem ‘com.vmware.vcloud.rest-api-handlers’ started
Application Initialization: ‘com.vmware.vcloud.common.core’ 95% complete. Subsystem ‘com.vmware.vcloud.jax-rs-servlet’ started
Application Initialization: ‘com.vmware.vcloud.common.core’ 100% complete. Subsystem ‘com.vmware.vcloud.ui-vcloud-webapp’ started
Application Initialization: ‘com.vmware.vcloud.common.core’ complete.
Successfully handled all queued events.
Error starting application: Unable to create marker file in the transfer spooling area: /opt/vmware/vcloud-director/data/transfer/cells/8a483603-43b8-4215-b33f-48270582f03d

To be honest, the NFS server which hosts Transfer Server Storage in this environment isn’t always reliable but upon checking, it was up and healthy. Furthermore, I was able to manually create a test file within this Transfer Server Storage space from the vCD cell itself.

Walking the directory structure and looking at permissions, a few things didn’t look right.

[root@vcdcell1 data]# ls -l -h
total 4.0K
drwx——. 3 vcloud vcloud 27 Mar 22 11:39 activemq
drwxr-x—. 2 vcloud vcloud 6 Mar 15 04:58 generated-bundles
drwxr-x—. 2 vcloud vcloud 4.0K Mar 15 04:58 transfer
[root@vcdcell1 data]# pwd
/opt/vmware/vcloud-director/data
[root@vcdcell1 data]#
[root@vcdcell1 data]#
[root@vcdcell1 data]#
[root@vcdcell1 data]# cd transfer/
[root@vcdcell1 transfer]# ls -l -h
total 1.0K
drwx——. 2 1002 1002 64 Mar 22 11:38 cells
-rw——-. 1 root root 386 Mar 21 11:51 responses.properties
[root@vcdcell1 transfer]# cd cells/
[root@vcdcell1 cells]# ls -l -h
total 512
-rw——-. 1 1002 1002 0 May 27 2018 8a483603-43b8-4215-b33f-48270582f03d.old
-rw-r–r–. 1 root root 6 Mar 22 11:38 jgbtest.txt

Looking at some of the pieces above, I seem to recall vcloud is supposed to be the owner and group for the vCD file and directory structure. I further verified this by restoring my old vCD cell from a previous snapshot and spot checking. So let’s fix it using the chown example on page 53 of the vCloud Director Installation and Upgrade Guide.

[root@vcdcell1 cells]# chown -R vcloud:vcloud /opt/vmware/vcloud-director
[root@vcdcell1 cells]#
[root@vcdcell1 cells]#
[root@vcdcell1 cells]# ls -l -h
total 512
-rw——-. 1 vcloud vcloud 0 May 27 2018 8a483603-43b8-4215-b33f-48270582f03d.old
-rw-r–r–. 1 vcloud vcloud 6 Mar 22 11:38 jgbtest.txt

The watchdog daemon followed up by restarting vCD cell. With the correct permissions now, a new cell file was successfully created and the vCD cell successfully started. I deleted the .old cell file and of course my jgbtest.txt file.

[root@vcdcell1 cells]# ls -l -h
total 512
-rw——-. 1 vcloud vcloud 0 Mar 22 12:23 8a483603-43b8-4215-b33f-48270582f03d
-rw——-. 1 vcloud vcloud 0 May 27 2018 8a483603-43b8-4215-b33f-48270582f03d.old
-rw-r–r–. 1 vcloud vcloud 6 Mar 22 11:38 jgbtest.txt

How did this happen? I’m pretty sure it was my own fault. Last week I was also doing some deployment testing with the vCD appliance. At the time I felt it was safe for this test cell to use the same Transfer Server Storage NFS mount (so that I wouldn’t have to go through the steps to create another one). Upon further investigation, the vCD appliance cell tattooed the folders and files with the 1002 owner and group seen above.

All is well with the vCD world now and I’ve got it documented so the next time my vCD web portal doesn’t load, I’ll know just where to look.

vSphere 6.7 Storage and action_OnRetryErrors=on

jason — Fri, 08 Feb 2019 19:30:40 +0000

VMware introduced a new storage feature in vSphere 6.0 which was designed as a flexible option to better handle certain storage problems. Cormac Hogan did a fine job introducing the feature here. Starting with vSphere 6.0 and continuing on in vSphere 6.5, each block storage device (VMFS or RDM) is configured with an option called action_OnRetryErrors. Note that in vSphere 6.0 and 6.5, the default value is off meaning the new feature is effectively disabled and there is no new storage error handling behavior observed.

This value can be seen with the esxcli storage nmp device list command.

vSphere 6.0/6.5:
esxcli storage nmp device list | grep -A9 naa.6000d3100002b90000000000000ec1e1
naa.6000d3100002b90000000000000ec1e1
Device Display Name: sqldemo1vmfs
Storage Array Type: VMW_SATP_ALUA
Storage Array Type Device Config: {implicit_support=on; explicit_support=off; explicit_allow=on; alua_followover=on; action_OnRetryErrors=off; {TPG_id=61459,TPG_state=AO}{TPG_id=61460,TPG_state=AO}{TPG_id=61462,TPG_state=AO}{TPG_id=61461,TPG_state=AO}}
Path Selection Policy: VMW_PSP_RR
Path Selection Policy Device Config: {policy=rr,iops=1000,bytes=10485760,useANO=0; lastPathIndex=0: NumIOsPending=0,numBytesPending=0}
Path Selection Policy Device Custom Config:
Working Paths: vmhba1:C0:T2:L141, vmhba1:C0:T3:L141, vmhba2:C0:T3:L141, vmhba2:C0:T2:L141
Is USB: false

If vSphere loses access to a device on a given path, the host will send a Test Unit Ready (TUR) command down the given path to check path state. When action_OnRetryErrors=off, vSphere will continue to retry for an amount of time because it expects the path to recover. It is important to note here that a path is not immediately marked dead when the first Test Unit Ready command is unsuccessful and results in a retry. It would seem many retries in fact and you’ll be able to see them in /var/log/vmkernel.log. Also note that a device typically has multiple paths and the process will be repeated for each additional path tried assuming the first path is eventually marked as dead.

Starting with vSphere 6.7, action_OnRetryErrors is enabled by default.

vSphere 6.7:
esxcli storage nmp device list | grep -A9 naa.6000d3100002b90000000000000ec1e1
naa.6000d3100002b90000000000000ec1e1
Device Display Name: sqldemo1vmfs
Storage Array Type: VMW_SATP_ALUA
Storage Array Type Device Config: {implicit_support=on; explicit_support=off; explicit_allow=on; alua_followover=on; action_OnRetryErrors=on; {TPG_id=61459,TPG_state=AO}{TPG_id=61460,TPG_state=AO}{TPG_id=61462,TPG_state=AO}{TPG_id=61461,TPG_state=AO}}
Path Selection Policy: VMW_PSP_RR
Path Selection Policy Device Config: {policy=rr,iops=1000,bytes=10485760,useANO=0; lastPathIndex=2: NumIOsPending=0,numBytesPending=0}
Path Selection Policy Device Custom Config:
Working Paths: vmhba1:C0:T2:L141, vmhba1:C0:T3:L141, vmhba2:C0:T3:L141, vmhba2:C0:T2:L141
Is USB: false

If vSphere loses access to a device on a given path, the host will send a Test Unit Ready (TUR) command down the given path to check path state. When action_OnRetryErrors=on, vSphere will immediately mark the path dead when the first retry is returned. vSphere will not continue the retry TUR commands on a dead path.

This is the part where VMware thinks it’s doing the right thing by immediately fast failing a misbehaving/dodgy/flaky path. The assumption here is that other good paths to the device are available and instead of delaying an application while waiting for path failover during the intensive TUR retry process, let’s fail this one bad path right away so that the application doesn’t have to spin its wheels.

However, if all other paths to the device are impacted by the same underlying (and let’s call it transient) condition, what happens is that each additional path iteratively goes through the process of TUR, no retry, immediately mark path as dead, move on to the next path. When all available paths have been exhausted, All Paths Down (APD) for the device kicks in. If and when paths to an APD device become available again, they will be picked back up upon the next storage fabric rescan, whether that’s done manually by an administrator, or automatically by default every 300 seconds for each vSphere host (Disk.PathEvalTime). From an application/end user standpoint, I/O delay for up to 5 minutes can be a painfully long time to wait. The irony here is that VMware can potentially turn a transient condition lasting only a few seconds into a more of a Permanent Device Loss like condition.

All of the above leads me to a support escalation I got involved in with a customer having an Active/Passive block storage array. Active/Passive is a type of array which has multiple storage processors/controllers (usually two) and LUNs are distributed across the controllers in an ownership model whereby each controller owns both the LUNs and the available paths to those LUNs. When an active controller fails or is taken offline proactively (think storage processor reboot due to a firmware upgrade), the paths to the active controller go dark, the passive controller takes ownership of the LUNs and lights up the paths to them – a process which can be measured in seconds, typically more than 2 or 3, often much more than that (this dovetails into the discussion of virtual machine disk timeout best practices). With action_OnRetryErrors=off, vSphere tolerates the transient path outage during the controller failover. With action_OnRetryErrors=on, it doesn’t – each path that goes dark is immediately failed and we have APD for all the volumes on that controller in a fraction of a second.

The problem which was occurring in this customer escalation was a convergence of circumstances:

The customer was using vSphere 6.7 and its defaults
action_OnRetryErrors=on
The customer was using an Active/Passive storage array
The customer virtualized Microsoft Windows SQL cluster servers (cluster disk resources are extremely sensitive to APDs in the hypervisor and immediately fail when it detects a dependent cluster disk has been removed – a symptom introduced by APD)
The customer was testing controller failovers

Windows failover clusters have zero tolerance for APD disk

To resolve the problem in vSphere 6.7, action_OnRetryErrors needs to be disabled for each device backed by the Active/Passive storage array. This must be performed on every host in the cluster having access to the given devices (again, these can be VMFS volumes and/or RDMs). There are a few ways to go about this.

To modify the configuration without a host reboot, take a look at the following example. A command such as this would need to be run on every host in the cluster, and for each device (ie. in an 8 host cluster with 8 VMFS/RDMs, we need to identify the applicable naa.xxx IDs and run 64 commands. Yes, this could be scripted. Be my guest.):

esxcli storage nmp satp generic deviceconfig set -c disable_action_OnRetryErrors -d naa.6000d3100002b90000000000000ec1e1

I don’t prefer that method a whole lot. It’s tedious and error prone. It could result in cluster inconsistencies. But on the plus side, a host reboot isn’t required, and this setting will persist across reboots. That also means a configuration set at this device level will override any claim rules that could also apply to this device. Keep this in mind if a claim rule is configured but you’re not seeing the desired configuration on any specific device.

The above could also be scripted for a number of devices on a host. Here’s one example. Be very careful that the base naa.xxx string matches all of the devices from one array that should be configured, and does not modify devices from other array types that should not be configured. Also note that this script is a one liner but for blog formatting purposes I manually added a line break starting with esxcli.:

for i in `ls /vmfs/devices/disks | grep -v ":" | grep -i naa.6000D31`; do echo $i; 
esxcli storage nmp satp generic deviceconfig set -c disable_action_OnRetryErrors -d $i; done

Now to verify:

for i in `ls /vmfs/devices/disks | grep -v ":" | grep -i naa.6000D31`; do echo $i; 
esxcli storage nmp device list | grep -A2 $i | egrep -io action_OnRetryErrors=\\w+; done

I like adding a SATP claim rule using a vendor device string a lot better, although changes to claim rules for existing devices generally requires a reboot of the host to reclaim existing devices with the new configuration. Here’s an example:

esxcli storage nmp satp rule add -s VMW_SATP_ALUA -V COMPELNT -P VMW_PSP_RR -o disable_action_OnRetryErrors

Here’s another example using quotes which is also acceptable and necessary when setting multiple option string parameters (refer to this):

esxcli storage nmp satp rule add -s “VMW_SATP_ALUA” -V “COMPELNT” -P “VMW_PSP_RR” -o “disable_action_OnRetryErrors”

When a new claim rule is added, claim rules can be reloaded with the following command.

esxcli storage core claimrule load

Keep in mind the new claim rule will only apply to unclaimed devices. Newly presented devices will inherit the new claim rule. Existing devices which are already claimed will not until the next vSphere host reboot. Devices can be unclaimed without a host reboot but all I/O to the device must be halted – somewhat of a conundrum if we’re dealing with production volumes, datastores being used for heartbeating, etc. Assuming we’re dealing with multiple devices, a reboot is just going to be easier and cleaner.

I like claim rules here better because of the global nature. It’s one command line per host in the cluster and it’ll take care of all devices from the Active/Passive storage array vendor. No need to worry about coming up with and testing a script. No need to worry about spending hours identifying the naa.xxx IDs and making all of the changes across hosts. No need to worry about tagging other storage vendor devices with an improper configuration. Lastly, the claim rule in effect is visible in a SATP claim rule list (sincere apologies for the formatting – it’s bad I know):

esxcli storage nmp satp rule list

Name Device Vendor Model Driver Transport Options Rule Group Claim Options Default PSP PSP Options Description
——————- —— ——– —————- —— ——— —————————- ———- ———————————– ———– ———– ———————————————–
VMW_SATP_ALUA COMPELNT disable_action_OnRetryErrors user VMW_PSP_RR

By the way… to remove the SATP claim rules above respectively:

esxcli storage nmp satp rule remove -s VMW_SATP_ALUA -V COMPELNT -P VMW_PSP_RR -o disable_action_OnRetryErrors

esxcli storage nmp satp rule remove -s “VMW_SATP_ALUA” -V “COMPELNT” -P “VMW_PSP_RR” -o “disable_action_OnRetryErrors”

The bottom line here is there may be a number of VMware customers with Active/Passive storage arrays, running vSphere 6.7. If and when planned or unplanned controller/storage processor failover occurs, APDs may unexpectedly occur, impacting virtual machines and their applications, whereas this was not the case with previous versions of vSphere.

In closing, I want to thank VMware Staff Technical Support Engineering for their work on this case and ultimately exposing “what changed in vSphere 6.7” because we had spent some time trying to reproduce this problem on vSphere 6.5 where we had an environment similar to what the customer had and we just weren’t seeing any problems.

References:

Managing SATPs

No Failover for Storage Path When TUR Command Is Unsuccessful

Storage path does not fail over when TUR command repeatedly returns retry requests (2106770)

Handling Transient APD Conditions

VSPHERE 6.0 STORAGE FEATURES PART 6: ACTION_ONRETRYERRORS

Updated 2-20-19: VMware published a KB article on this issue today:
ESXi 6.7 hosts with active/passive or ALUA based storage devices may see premature APD events during storage controller fail-over scenarios (67006)

MegaSceneryEarth automated HTTP download PowerShell script

jason — Mon, 30 Jul 2018 20:36:20 +0000

Regulars: Thank you for stopping by but just as a heads up, this post is not VMware virtualization related.

After a bit of hardware trouble from the local store, my replacement flight sim rig is up and running Lockheed Martin Prepar3D 4.3. I’m trying to shake the rust off after not having flown a leg for quite some time but everything sim and add-on related is looking good and I was able to fly KMSP-KLAX with many more FPS that I’ve been used to the past many years on FSX.

Now, out of sheer coincidence late last week I received a promotion from MegaSceneryEarth. Normally I ignore these but since this was a 50% off deal and I’m essentially starting from ground zero with no scenery for Prepar3D, I took a look. After some chin scratching, I checked out with Minnesota, Arizona, Northern California, and Southern California. I felt like quite the opportunist until the email arrived with file download instructions. Each scenery is broken up into 2GB download chunks. Between the four sceneries, I was presented with 123 download links. With no FTP option that I’m aware of, I’m going to have to babysit this process. With each 2GB download taking anywhere from 5-10 minutes, this is going to feel a lot like installing Microsoft Office 2016 via floppy disk. Except longer. MegaSceneryEarth is available on shipped DVDs or USB sticks they cost a small fortune and of course the shipping and handling time. This is going to be painful.

But… there’s a cheaper, faster, and fully automated way. PowerShell scripting.

Now admittedly for my first scenery, I threw the v1.0 PowerShell script version together quickly using the Invoke-WebRequest cmdlet with my own static scenery-specific lines of code to serve my own needs. And it worked. I launched the script, walked away for a few hours, and when I came back I had a directory full of MegaSceneryEarth .zip files ready for installation. It worked beautifully and it didn’t consume bunches of my time.

I still had three sceneries left to download. While I could have made new static scripts for the remaining three sceneries, my mind was already going down the road that this script would be more powerful if one script worked on a variety of MegaSceneryEarth downloads. It becomes flexible to use for my future MegaSceneryEarth sceneries as well as extensible if shared with the community.

So on Sunday, version 2.0 is what I came up with. It takes three inputs in the form of modifying variables at the beginning of the script before running it:

The download URL for the first file of the scenery
The path on the local hard drive where all of the .zip files should be downloaded to
The starting point or first file to download. Usually this will be 1 but if for some reason a previous download was stopped or interrupted at say 6, we’d want to continue from there assuming we don’t want to re-download the first 5 files all over again.

The script does the rest of the heavy lifting. All three of the above are required for functionality but by far the first variable is the most important and where I spent almost all of my time making the script flexible so that it works across MegaSceneryEarth sceneries. When provided with the download URL of the first file of the scenery, the script does a bunch of parsing, splitting, and concatenation of stuff to automatically determine some things:

It figures out what the base path and file name URL is for the scenery. For any given scenery, these parts never change across all of the download links within a scenery but they will change across sceneries.
It figures out how many .zip files there are to download in total for the entire scenery.
Because MegaSceneryEarth likes to use leading zeroes as part of their naming convention, and leading zero strings aren’t the same thing as leading zero integers in terms of PowerShell constructs, it figures out how to deal with those for file names and looping iterations. Basically when you go from “009” files in a scenery download to “010”, without dealing with that, stuff breaks. I think most MegaSceneryEarth sceneries have more than 9 files, but some have less (Deleware has 1). The script deals with both kinds. That I know of, MegaSceneryEarth doesn’t have a scenery with more than 99 .zip files but version 2.1 of the script will accommodate that.
It figures out the correct file name to save on the local hard drive for each scenery file downloaded. This may sound stupid but that I know of, the Invoke-WebRequest cmdlet requires the output file name and doesn’t automatically assume the file name should be the same as the file name being downloaded via HTTP such as a copy command to a folder would.

Generally speaking, for automation to work properly, it relies on input consistency so that all assumptions it makes are 100% correct. Translated, this script relies on consistent file/path naming conventions used by MegaSceneryEarth. Not only within sceneries, but across sceneries as well.

And so here it is. Copy the PowerShell script below and execute in your favorite PowerShell environment. Most of the time I simply use Windows PowerShell ISE. The code isn’t signed so you may need to temporarily relax your PowerShell ExecutionPolicy to Bypass . It’s there to protect you by default.

######################################################################################################################
# MegaSceneryEarth.ps1 version 2.1 7/31/18
#
# This PowerShell script automates unattended HTTP download of up to 999 MegaSceneryEarth scenery files
# Relies on MegaSceneryEarth naming conventions current as of version date above
# Tested successfully with four MegaSceneryEarth downloads
# As with anything on the internet, use at your own risk, no warranty provided by the author
#
# Update the three variables below as necessary for your specific scenery and download parameters
#
# Jason Boche
# http://boche.net
# jason@boche.net
######################################################################################################################

# Provide the first file download link copied from the MegaSceneryEarth .htm link provided with license key email
# Internet Explorer: Right click the download link and choose "Copy shortcut"
# Google Chrome: Right click the download link and choose "Copy link address"
# Then paste the link between the quotation marks

$firstfile1 = "paste first file download link here.zip"

# Provide the path on your hard drive where the files will temporarily be saved.
# This will usually be where you have the MegaSceneryEarthInstallManager.exe file for subsequent installation
# Syntax requires a trailing backslash (\)

$dlpath = "D:\junk\"

# Provide the starting file count. Usually we start with 1 unless you're starting the download somewhere after 1

$startcount = 1

# Do not edit anything below this line

######################################################################################################################

# Get the HTTP path
$firstfile2 = split-path $firstfile1
$firstfile3 = $firstfile2.Replace("\","/")
$httppath = $firstfile3 + "/"

# Get the .zip file name
$firstfile4 = split-path $firstfile1 -leaf

# Get the static part of the .zip file name
$firstfile5 = $firstfile4 -split "001_of_"
$firstfile6 = $firstfile5[0]
$fileprefix = $firstfile6

# Get the total number of scenery files in string format with and without .zip extension
$firstfile7 = $firstfile5[1] -split ".zip"
$firstfile8 = $firstfile5[1]
# Convert total number of scenery files in string format to number format
$total = [decimal]::Parse($firstfile7)

if ($startcount -gt $total)
{
Write-Host Error: Starting file specified is greater than total number of files. Please check the `$startcount variable
pause
exit
}

# If there are 1-9 scenery files, download those
if ($total -ge 1)
{
$leadingzero = "00"
while ($startcount -le $total)
{
if ($startcount -gt 9){break}
Write-Host Downloading $httppath$fileprefix$leadingzero$startcount"_of_"$firstfile8 to $dlpath$fileprefix$leadingzero$startcount"_of_"$firstfile8

Invoke-WebRequest -Uri $httppath$fileprefix$leadingzero$startcount"_of_"$firstfile8 -OutFile $dlpath$fileprefix$leadingzero$startcount"_of_"$firstfile8

$startcount ++

}
}

# If there are 10-99 scenery files, download those also
if ($total -ge 10)
{
$leadingzero = "0"
while ($startcount -le $total)
{
if ($startcount -gt 99){break}
Write-Host Downloading $httppath$fileprefix$leadingzero$startcount"_of_"$firstfile8 to $dlpath$fileprefix$leadingzero$startcount"_of_"$firstfile8

Invoke-WebRequest -Uri $httppath$fileprefix$leadingzero$startcount"_of_"$firstfile8 -OutFile $dlpath$fileprefix$leadingzero$startcount"_of_"$firstfile8

$startcount ++

}
}

# If there are 100-999 scenery files, download those also
if ($total -ge 100)
{
$leadingzero = ""
while ($startcount -le $total)
{
if ($startcount -gt 999){break}
Write-Host Downloading $httppath$fileprefix$leadingzero$startcount"_of_"$firstfile8 to $dlpath$fileprefix$leadingzero$startcount"_of_"$firstfile8

Invoke-WebRequest -Uri $httppath$fileprefix$leadingzero$startcount"_of_"$firstfile8 -OutFile $dlpath$fileprefix$leadingzero$startcount"_of_"$firstfile8

$startcount ++

}
}

VMware Horizon Share Folders Issue with Windows 10

jason — Tue, 13 Jun 2017 00:13:19 +0000

I spent some time the last few weekends making various updates and changes to the lab. Too numerous and not all that paramount to go into detail here, with the exception of one issue I did run into. I created a new VMware Horizon pool consisting of Windows 10 Enterprise, Version 1703 (Creators Update). The VM has 4GB RAM and VMware Horizon Agent 7.1.0.5170901 is installed. This is all key information contributing to my new problem which is the Shared Folders feature seems to have stopped functioning.

That is to say, when launching my virtual desktop from the Horizon Client, there are no shared folders or drives being passed through from where I launched the Horizon Client. Furthermore, the Share Folders menu item is completely missing from the blue Horizon Client pulldown menu.

I threw something out on Twitter and received a quick response from a very helpful VMware Developer by the name of Adam Gross (@grossag).

Adam went on to explain that the issue stems from a registry value defining an amount of memory which is less that the amount of RAM configured in the VM.

The registry key is HKLM\SYSTEM\CurrentControlSet\Control\ and the value configured for SvcHostSplitThresholdInKB is 3670016 (380000 Hex). The 3670016 is expressed in KB which comes out to be 3.5GB. The default Windows 10 VM configuration is deployed with 4GB of RAM which is what I did this past weekend. Since 3.5GB is less than 4GB, the bug rears its head.

Adam mentioned the upcoming 7.2 agent will configure this value at 32GB on Windows 10 virtual machines (that’s 33554432 or 2000000 in Hex) and perhaps even larger in the 7.2 version or some future release of the agent because the reality some day is that 32GB won’t be large enough. Adam went on to explain the maximum amount of RAM supported by Windows 10 x64 is 2TB which comes out to be 2147483648 expressed in KB or 80000000 in Hex. Therefore, it is guaranteed safe (at least to avoid this issue) to set the registry value to 80000001 (in Hex) or higher for any vRAM configuration.

To move on, the value needs to be tweaked manually in the registry. I’ll set mine to 32GB as I’ll likely never have a VDI desktop deployed between now and when the 7.2 agent ships and is installed in my lab.

And the result for posterity.

I found a reboot of the Windows 10 VM was required before the registry change made the positive impact I was looking for. After all was said and done, my shared folders came back as did the menu item from the pulldown on the blue Horizon Client pulldown menu. Easy fix for a rather obscure issue. Once again my thanks to Adam Gross for providing the solution.

VMware Tools causes virtual machine snapshot with quiesce error

jason — Sat, 30 Jul 2016 20:07:26 +0000

Last week I was made aware of an issue a customer in the field was having with a data protection strategy using array-based snapshots which were in turn leveraging VMware vSphere snapshots with VSS quiesce of Windows VMs. The problem began after installing VMware Tools version 10.0.0 build-3000743 (reported as version 10240 in the vSphere Web Client) which I believe is the version shipped in ESXI 6.0 Update 1b (reported as version 6.0.0, build 3380124 in the vSphere Web Client).

The issue is that creating a VMware virtual machine snapshot with VSS integration fails. The virtual machine disk configuration is simply two .vmdks on a VMFS-5 datastore but I doubt the symptoms are limited only to that configuration.

The failure message shown in the vSphere Web Client is “Cannot quiesce this virtual machine because VMware Tools is not currently available.” The vmware.log file for the virtual machine also shows the following:

2016-07-29T19:26:47.378Z| vmx| I120: SnapshotVMX_TakeSnapshot start: ‘jgb’, deviceState=0, lazy=0, logging=0, quiesced=1, forceNative=0, tryNative=1, saveAllocMaps=0 cb=1DE2F730, cbData=32603710
2016-07-29T19:26:47.407Z| vmx| I120: DISKLIB-LIB_CREATE : DiskLibCreateCreateParam: vmfsSparse grain size is set to 1 for ‘/vmfs/volumes/51af837d-784bc8bc-0f43-e0db550a0c26/rmvm02/rmvm02-000001.
2016-07-29T19:26:47.408Z| vmx| I120: DISKLIB-LIB_CREATE : DiskLibCreateCreateParam: vmfsSparse grain size is set to 1 for ‘/vmfs/volumes/51af837d-784bc8bc-0f43-e0db550a0c26/rmvm02/rmvm02_1-00000
2016-07-29T19:26:47.408Z| vmx| I120: SNAPSHOT: SnapshotPrepareTakeDoneCB: Prepare phase complete (The operation completed successfully).
2016-07-29T19:26:56.292Z| vmx| I120: GuestRpcSendTimedOut: message to toolbox timed out.
2016-07-29T19:27:07.790Z| vcpu-0| I120: Tools: Tools heartbeat timeout.
2016-07-29T19:27:11.294Z| vmx| I120: GuestRpcSendTimedOut: message to toolbox timed out.
2016-07-29T19:27:17.417Z| vmx| I120: GuestRpcSendTimedOut: message to toolbox timed out.
2016-07-29T19:27:17.417Z| vmx| I120: Msg_Post: Warning
2016-07-29T19:27:17.417Z| vmx| I120: [msg.snapshot.quiesce.rpc_timeout] A timeout occurred while communicating with VMware Tools in the virtual machine.
2016-07-29T19:27:17.417Z| vmx| I120: —————————————-
2016-07-29T19:27:17.420Z| vmx| I120: Vigor_MessageRevoke: message ‘msg.snapshot.quiesce.rpc_timeout’ (seq 10949920) is revoked
2016-07-29T19:27:17.420Z| vmx| I120: ToolsBackup: changing quiesce state: IDLE -> DONE
2016-07-29T19:27:17.420Z| vmx| I120: SnapshotVMXTakeSnapshotComplete: Done with snapshot ‘jgb’: 0
2016-07-29T19:27:17.420Z| vmx| I120: SnapshotVMXTakeSnapshotComplete: Snapshot 0 failed: Failed to quiesce the virtual machine (31).
2016-07-29T19:27:17.420Z| vmx| I120: VigorTransport_ServerSendResponse opID=ffd663ae-5b7b-49f5-9f1c-f2135ced62c0-95-ngc-ea-d6-adfa seq=12848: Completed Snapshot request.
2016-07-29T19:27:26.297Z| vmx| I120: GuestRpcSendTimedOut: message to toolbox timed out.

After performing some digging, I found VMware had released VMware Tools version 10.0.9 on June 6, 2016. The release notes identify the root cause has been identified and resolved.

Resolved Issues

Attempts to take a quiesced snapshot in a Windows Guest OS fails
Attempts to take a quiesced snapshot after booting a Windows Guest OS fails

After downloading and upgrading VMware Tools version 10.0.9 build-3917699 (reported as version 10249 in the vSphere Web Client), the customer’s problem was resolved. Since the faulty version of VMware Tools was embedded in the customer’s templates used to deploy virtual machines throughout the datacenter, there were a number of VMs needing their VMware Tools upgraded, as well as the templates themselves.

vCenter Server 6 Appliance fsck failed

jason — Mon, 04 Apr 2016 19:37:42 +0000

A vCenter Server Appliance (vSphere 6.0 Update 1b) belonging to me was bounced and for some reason was unbootable. The trouble during the boot process begins with /dev/sda3 contains a file system with errors, check forced. At approximately 27% of the way through, the process terminates with fsck failed. Please repair manually and reboot.

Unable to access a bash# prompt from the current state of the appliance, I followed VMware KB 2069041 VMware vCenter Server Appliance 5.5 and 6.0 root account locked out after password expiration, particularly the latter portion of it which provides the steps to modify a kernel option in the GRUB bootloader to obtain a root shell (and subsequently run the e2fsck -y /dev/sda3 repair command.

The steps are outlined in VMware KB 2069041 and are simple to follow.

Reboot the VCSA
Be quick about highlighting the VMware vCenter Server appliance menu option (the KB article recommends hitting the space bar to stop the default countdown)
p (to enter a root password and continue with additional commands the next step)
e (to edit the boot command)
Append init=/bin/bash (followed by Enter to return to the GRUB menu
b (to start the boot process)

This is where e2fsck -y /dev/sda3 is executed to repair file system errors on /dev/sda3 and allow the VCSA to boot successfully.

When the process above completes, reboot the VCSA and that should be all there is to it.

Update 10/9/17: I ran into a similar issue with VCSA 6.5 Update 1 where the appliance wouldn’t boot and I was left at an emergency mode prompt. In this situation, following the steps above isn’t so straight forward in part due to the Photon OS splash screen and no visibility to the GRUB bootloader (following VMware KB 2081464). In this situation, I executed fsck /dev/sda3 at the emergency mode prompt answering yes to all prompts. After reboot, I found this did not resolve all of the issues. I was able to log in by providing the root password twice. The journalctl command revealed a problem with /dev/mapper/log_vg-log. Next I ran fsck /dev/mapper/log_vg-log again answering yes to all prompts to repair. When that was finished, the appliance was rebooted and came up operational.

Error 1603: Java Update did not complete

jason — Wed, 10 Feb 2016 02:48:17 +0000

3 Billion Devices Run Java.

Unfortunately (or fortunately depending on how you look at it) my workstation isn’t one of them.

I’m no stranger to Java and I’m more than willing to share my bitter experiences with it but rarely does my dissatisfaction stem from the installation process itself (that is if you don’t count the update frequency). This blog post is for my future self as I’m bound to run into it again.

I encountered a never before seen by me problem installing the Windows Offline (64-bit) version of Java Version 8 Update 73 on Windows 7.

Java Update did not complete

Error Code: 1603

The usual tricks did not work. Uninstalling the old version first. Run as administrator. Directory and registry scrubbing. System reboot. Trying an older version. In fact all version 8 builds exhibited this behavior. It wasn’t until I rolled all the way back to a version 7 build that the installation was successful.

For posterity, this spiceworks thread covers a lot of ground; there seems to be a variety of fixes for an equal number of root causes which all yield the same Java installation failure. The Java knowledgebase covering this issue offers some basic workarounds under Option 1 such as a system reboot, offline installer, uninstalling old versions, all of which I had already tried without success.

The fix in my particular instance was Option 2: Disable Java content (in the web browser) through the Java Control Panel. You’ll find the Java applet in the Windows Control Panel when a 32-bit version of Java is installed.

On the Security tab, temporarily deselect the “Enable Java content in the browser” checkbox.

Complete the 64-bit version of Java installation (launch the installer using “Run as Administrator“) and don’t forget to “Enable Java content in the browser” when finished.

Update 5-12-16: If you’re receiving this error and “Enable Java content in the browser” is already deselected, check the box, Apply, OK, uncheck the box, Apply, OK.