Configuring VMware HCX in my home lab to migrate VM’s between two VMware vCenter clusters was my goal this week. HCX simplifies application mobility and migration between clouds. Last week I successfully paired both sites and I was ready to extend the network.
I discovered that my target site was inaccessible on Monday morning. I was disappointed since this worked last week. The troubleshooting process pointed to my tp-link T1700G-28TQ switch in my home lab as a possible culprit. After ping failures, I unplugged the Ethernet cable connected to my target site router and to my surprise the link light stayed on instead of going out. Quickly I discovered that the management plane of the switch crashed but the data plane was still switching some but not all traffic. I rebooted the switch and the networking problem was solved. I successfully logged into the HCX target site but I started to feel the heat from the sun melt the wax in my wings.
I didn’t expect I would run into new problems at the source site that after I solved the target site networking problem. The management UI for both NSX-T and vCenter Server at the source site weren’t accessible. I started to loose altitude from some feathers coming off my wing once I saw the dreaded write failures from their Linux console on both VMs. My home lab uses both VMware vSAN and NFSv3 on a QNAP NAS for storage. These critical VM’s were stored on the QNAP NAS. This NAS has one network path through the failed switch. I wouldn’t of have any issues if I stored these VM’s on vSAN since these servers are connected to two switches for redundancy in case of a single failure. After rebooting both management VM’s I saw that the file systems were corrupted and the VM’s were halted.
I knew I wouldn’t crash and drown in the ocean below like Icarus when I was able to successfully boot the VM and access the vCenter Server UI after cleaning the filesystem. I followed VMware knowledge base article 2149838 which described the recommended approach with e2fsck.
Prior to taking an in-depth enterprise Linux class I would have been anxious editing the grub loader to change the boot target and clean the file system. However these steps were now second nature to me since I had to do these steps by memory to pass the associated hands-on Linux certification from the class.
I haven’t managed my home lab like an enterprise environment by taking shortcuts to save time and money. I was lucky that fsck worked since I didn’t have a vCenter or a distributed virtual switch (dvs) backup. Due to this hard lesson I configured a vCenter backup schedule and exported the dvs configuration. My next blog will go over the steps I took to recover the NSX management console and VM.