DEV Community: Arosh Wijepala

Automating VMware Tools Upgrades with PowerShell and N-Central

Arosh Wijepala — Mon, 14 Apr 2025 11:59:18 +0000

As a systems engineer supporting a wide range of clients with all kinds of VMware environments—from vSphere to standalone ESXi hosts—I kept running into the same challenge: VMware Tools versions were often out of date. Manual upgrades were inconsistent, time-consuming, and prone to being forgotten.

We already had a well-established patch management process using N-Central RMM, and I figured: Why not piggyback off the same maintenance window to upgrade VMware Tools, too? That would eliminate the manual effort and ensure every machine stayed updated and supported, automatically.

This blog is a story of how a simple PowerShell script helped streamline this process—and how you can use it too.

The Problem

VMware Tools is essential for good VM performance, better guest OS integration, and clean VM shutdowns. But in environments with dozens or hundreds of virtual machines, keeping it updated across different versions of ESXi is messy.

N-Central, our Remote Monitoring and Management (RMM) tool, allowed us to schedule tasks during patching windows. So I decided to build a PowerShell script that:

Checks the latest VMware Tools version online
Compares it with the version installed locally
Downloads and silently installs the update—only if needed
Logs everything along the way

The PowerShell Script

Here’s the full script. I’ll explain it part by part in a moment. You can download it from my GitHub here

$URL = "https://packages.vmware.com/tools/esx/latest/windows/x64/"
$LogFilePath = "C:\temp\VMwareToolsUpdateScript.log"

(Get-Date).ToString() + " :  Script Initiated" >> $logfilepath

$PSversion = Get-Host | Select-Object Version
(Get-Date).ToString() + " : PowerShell version = " + $PSversion >> $logfilepath

$QueryVMWareToolsVersion = Invoke-WebRequest $URL -UseBasicParsing
$VMWareToolsSetupName = $QueryVMWareToolsVersion.Links.HREF | Select -Skip 1

[string]$VMWareToolsNewestVersion = $VMWareToolsSetupName -replace ".*VMware-tools-" -replace "-.*"
$VMWareToolsInstalledVersion = Get-WmiObject Win32_Product -Filter "Name like 'VMware Tools'" | Select-Object -ExpandProperty Version
[string]$VMWareToolsInstalledVersion = $VMWareToolsInstalledVersion.Substring(0,$VMWareToolsInstalledVersion.LastIndexOf('.'))

If ([version]$VMWareToolsInstalledVersion -lt [version]$VMWareToolsNewestVersion) {
    $DownloadURL = $URL + $VMWareToolsSetupName
    try {
        Invoke-WebRequest -Uri $DownloadURL -OutFile "C:\temp\$VMWareToolsSetupName"
        (Get-Date).ToString() + " : Download Finished!" >> $logfilepath
    } catch {
        (Get-Date).ToString() + " : Download Failed" >> $logfilepath
    }

    $ArgumentList = "/S /v " + '"/qn REBOOT=R ADDLOCAL=ALL"' + " /l C:\temp\VMwareToolsSetup.log"
    $FilePath = "C:\temp\" + $VMWareToolsSetupName

    try {
        Start-Process -FilePath $FilePath -ArgumentList $ArgumentList
        (Get-Date).ToString() + " : Installation Finished!" >> $logfilepath
    } catch {
        (Get-Date).ToString() + " : Installation Failed" >> $logfilepath
    }
} else {
    (Get-Date).ToString() + " : VMware latest version is already installed!" >> $logfilepath
}

(Get-Date).ToString() + " : Script executed successfully" >> $logfilepath

How It Works

Let’s break this down step by step:

1. Logging and PowerShell Version Detection

(Get-Date).ToString() + " : Script Initiated" >> $logfilepath
$PSversion = Get-Host | Select-Object Version

We start with timestamped logging and note which PowerShell version is running. This helps in troubleshooting if the script fails due to version incompatibility (some older versions have quirks with Invoke-WebRequest).

2. Check for the Latest VMware Tools Version

$QueryVMWareToolsVersion = Invoke-WebRequest $URL -UseBasicParsing
$VMWareToolsSetupName = $QueryVMWareToolsVersion.Links.HREF | Select -Skip 1

We hit VMware’s official tools repository and look at the second link on the page (the first is usually ../). This should be the .exe file for the latest version.

[string]$VMWareToolsNewestVersion = $VMWareToolsSetupName -replace ".*VMware-tools-" -replace "-.*"

Using a little regex magic, we extract the version string from the filename.

3. Detect the Currently Installed Version

$VMWareToolsInstalledVersion = Get-WmiObject Win32_Product -Filter "Name like 'VMware Tools'"

We use WMI to get the currently installed version of VMware Tools. Since VMware’s versioning can include build numbers (e.g. 12.2.5.45654), we trim that last segment for a cleaner version comparison.

4. Version Comparison and Conditional Upgrade

If ([version]$VMWareToolsInstalledVersion -lt [version]$VMWareToolsNewestVersion)

If the installed version is older, we build the download URL and pull the installer.

5. Silent Install and Logging

$ArgumentList = "/S /v " + '"/qn REBOOT=R ADDLOCAL=ALL"' + " /l C:\temp\VMwareToolsSetup.log"
Start-Process -FilePath $FilePath -ArgumentList $ArgumentList

The installer is launched silently (/qn) with reboot handling (REBOOT=R) and full feature installation (ADDLOCAL=ALL). The log file helps diagnose any install issues.

Why This Approach Worked

No reliance on vCenter or ESXi APIs — just plain HTTP and PowerShell
Compatible with any guest OS that supports VMware Tools on Windows
No unnecessary upgrades — it installs only when there’s a newer version
Built-in logging makes it easy to audit and troubleshoot

We scheduled the script to run during our regular patch maintenance window, so every machine was updated without manual intervention. It’s a small thing, but it shaved hours off our monthly maintenance efforts and ensured better consistency across customer environments.

Download the Script

👉Download the full script on GitHub

Final Thoughts

Automation doesn't always need to be fancy. Sometimes, it's just about using the tools you already have in a smarter way. If you're running VMware environments and want to keep things simple, give this approach a try—and let me know how it works for you!

Feel free to fork, improve, or suggest changes to the script. I’d love to hear how others are handling VMware Tools upgrades in complex environments.

My Cloud Resume Challenge Journey: Embracing Serverless

Arosh Wijepala — Mon, 07 Apr 2025 11:44:21 +0000

I’ve spent nearly 14 years in the tech industry, working my way up from providing end-user support to managing full-scale corporate infrastructure. For a good portion of my career, I worked with physical servers running virtualization—systems I could walk up to, plug into, and manage directly, with on-prem infrastructure that I understood inside and out.

As cloud computing grew in popularity, I naturally found myself migrating infrastructure to the cloud—deploying virtual servers, and also migrating data from various platforms to Microsoft 365, and earning certifications like AZ-104 and AZ-305 along the way. I understood Infrastructure-as-a-Service (IaaS) and had real-world experience using it. But there was still something I hadn’t really touched: serverless. I’d read about it, played with it here and there on my own time, but never had a proper project to truly dive in.

Still, the curiosity was always there. I found myself drawn to the idea of serverless, of building things without thinking about the underlying machines. I’d read The Phoenix Project and The Unicorn Project, and those books stayed with me. They planted seeds—seeds of understanding how DevOps, automation, and cloud-native technologies were transforming how we deliver technology.

Then one day, I stumbled upon the Cloud Resume Challenge.

The Start of Something New

The challenge wasn’t just a list of tasks. It was an experience designed to stretch you. It gave you a direction, but no step-by-step instructions. And that’s exactly what I needed. I already had the certifications (AZ-900, AZ-104, AZ-305) and some real-world Azure experience under my belt, but this was a chance to apply all that knowledge—and then some.
It started off easy. I had some basic experience with HTML and CSS (though I wasn’t exactly passionate about front-end development). So I leaned on AI to generate my resume web page, uploaded it to Azure Storage as a static site, and pointed my Cloudflare DNS to it. That part was smooth.

Getting Uncomfortable

Things got harder when I tried to set up Azure Front Door. I knew what it was—a global load balancer—from my certification studies. But putting theory into practice is another thing entirely. Understanding how frontends, backends, routes, and origins all fit together took me down a rabbit hole of documentation. I spent evenings reading, testing, and debugging. Eventually, it clicked. It was a small win, but a satisfying one.

The real turning point came when I got to creating the API with Azure Functions. I had never worked with them before. I knew Python well enough, so writing a basic HTTP-triggered function wasn’t too intimidating. But integrating it with Cosmos DB? That was tough. I wanted to develop it locally in VS Code, and setting everything up took longer than expected. The biggest hurdle was understanding how bindings worked to connect the function to the database.

It was one of those moments where frustration starts to creep in. But I reminded myself—this is why I took on the challenge in the first place. With enough reading, trial and error, reviewing Azure Application Insights and a little help from ChatGPT, I started to get the hang of it. The sense of progress was incredibly rewarding.

Stepping into Infrastructure as Code

Next, I took on Terraform. I’d used it before, but only briefly. This time, I committed to really learning it. I installed it, started writing out my .tf files, explored tfvars, and learned about state files and variable assignments.

What really tripped me up was how detailed and nested Azure resources could be. Configuring something like Front Door in Terraform meant understanding all of its sub-resources and dependencies. But I stuck with it, and eventually managed to define the full infrastructure in code—including monitoring and alerting with Azure Monitor.

I also automated the DNS records in Cloudflare using its Terraform provider. Seeing everything come together—the infrastructure, the code, the monitoring—was incredible. It started to feel like I was really building something robust and production-ready.

Adding the Real-World Touches

Since I had experience using PagerDuty in previous roles, I decided to bring it into this project as well. I set up a team, schedules, and escalation policies. Then I wired Azure Monitor to send alerts via webhook, so I’d get a ping on my phone if something broke. It gave the project a sense of realism—it wasn’t just a lab experiment anymore.

Learning CI/CD from Scratch

The final mountain to climb was CI/CD using GitHub Actions. Until this point, I hadn’t worked with pipelines, and it felt like a completely different world. I started reading up on YAML workflows, learning how to trigger deployments with code pushes, how to use Git in VS Code, and how to structure commits and branches.

One thing that really stuck with me during this phase was the importance of security. As I prepared to make the project public, I had to learn how to manage secrets safely. I used GitHub’s encrypted secrets for things like my Azure credentials and Cloudflare API tokens. I also configured a backend in Azure Storage for Terraform’s state file, ensuring I could work on the project from multiple machines without losing consistency.

All of this took time. There were late nights. There were moments of confusion. But each hurdle taught me something meaningful—and that’s what made the experience so valuable.

Final Thoughts

Completing the Cloud Resume Challenge wasn’t just about putting a resume online. It was about stepping out of my comfort zone and immersing myself in the technologies I’d always been curious about but hadn’t truly explored.

This challenge taught me more than just how to use serverless resources or write Terraform code. It taught me how to break down problems, how to keep learning even when things feel difficult, and how to build something real from scratch.

Even with years of experience in IT, this project gave me a new perspective on what the cloud can do—and how far I’ve come in my own journey. It reminded me why I got into tech in the first place: to keep building, to keep learning, and to keep pushing forward.

If you're someone who's cloud-curious, or even if you've been in tech a long time but want to expand your skills, I wholeheartedly recommend taking on the Cloud Resume Challenge. It might just change the way you see your career.

Want to check out my project? [https://github.com/ph4n7om2000/cloudresumechallenge]

Proactive Web Application Monitoring and Automated Recovery with Selenium and Python

Arosh Wijepala — Wed, 02 Apr 2025 09:53:30 +0000

Introduction

Ensuring the availability and reliability of critical web applications is a challenge for any organization. During my tenure at one of the largest education providers in the world, I encountered a recurring issue with a secure file transfer platform that frequently became unavailable due to database deadlocks. As part of my role, I researched, developed, and implemented an automated monitoring and remediation solution using Selenium and Python to address this challenge.

The Problem

The secure file transfer platform supported hundreds of concurrent users, making its uptime crucial. However, we repeatedly faced an issue where the application became unresponsive due to database deadlocks, causing the database connection to become unavailable. The only known workaround was restarting the SQL Server services, followed by restarting the application services—or vice versa, depending on the situation.

A major issue was that we had no proactive way of detecting downtime. We only became aware of failures when users reported them. While working with vendor support for a long-term fix, we needed an interim solution that could monitor the application, detect downtime, and apply the necessary workarounds automatically.

Choosing a Solution

Due to budget constraints, commercial monitoring solutions like New Relic were not an option. After thorough research, I determined that Selenium, a web automation framework, could be used to automate periodic login attempts and verify application availability. Selenium allowed us to interact with the web application just as a user would, making it an ideal choice.

Tools Used

Python: Scripting language for automation

Chrome Headless: Command-line interface browser

Selenium: Web automation framework

PS Tools: pskill.exe is used to terminate services, while psService.exe is utilized to start remote services, as both the database and application services are hosted on a Windows Server environment.

Download the Script

You can download the full monitoring script along with the required files from GitHub: 🔗 Download from GitHub

Implementing the Monitoring Function

Importing modules:

selenium module is imported to navigate through web pages and interact with web elements. smtplib module is imported to send notification emails. datetime module is used to write a date stamp in log files for keeping a log of the script's activity. time module is utilized to wait for a specific amount of time before proceeding with the next task. os module is imported to create and delete reboot_flag file utilised for changing the sequence of service restarts.

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import smtplib
from email import message
from datetime import datetime
import time
import os
import subprocess

Setting up log file: The script is set to open a log file named uat_runlogin append mode to record the execution logs:

log_object = open('uat-runlog', 'a')

Email notification: The following code defines a function send_failure_email() that sends an email notification in case of a login check failure meaning the web application is down. The function sets email parameters and it then creates a message object using the message.Message() method and sets its headers and payload.

The function then creates an SMTP server instance and connects to the SMTP server using the smtplib.SMTP() method with the server address and port number as arguments. The ehlo() method is called twice to identify the client and initiate the SMTP conversation. The function then logs in to the SMTP server using the login() method with the from_addr and yoursmtppassword as arguments.

Finally, the function sends the message to the specified to_addr and to_addr2 email addresses using the server.send_message() method, passing the message object, from_addr and to_addrs as arguments.

def send_failure_email():
    from_addr = 'sftalert@yourdomain.com'
    to_addr = 'team1@yourdomain.com'
    to_addr2 = 'team2@yourdomain.com'
    subject = 'SFT Alert!'
    body = 'Login check has failed. Check application availablility ASAP!'
    msg = message.Message()
    msg.add_header('from', from_addr)
    msg.add_header('to', to_addr)
    msg.add_header('subject', subject)
    msg.set_payload(body)
    server = smtplib.SMTP('smtp.yoursmtpserver.com', 587)
    server.ehlo()
    server.starttls()
    server.ehlo()
    server.login(from_addr, 'yoursmtppassword')
    server.send_message(msg, from_addr=from_addr, to_addrs=[to_addr])
    server.send_message(msg, from_addr=from_addr, to_addrs=[to_addr2])

Monitoring: The code snippet below outlines the fundamental purpose of the script - to log in to the web application, click and verify the loading of specific elements.

First the code is setting up the options and configuration for the ChromeDriver using the Selenium WebDriver library in Python to control the Chrome browser.

options.add_argument("--headless"): This line sets the "headless" option to run the Chrome browser in headless mode, meaning the browser will run without a graphical user interface.

options.add_argument("--no-sandbox"): This line sets the "no-sandbox" option to disable the Chrome browser sandbox, which is a security feature that isolates browser tabs and prevents them from affecting each other.

s = Service("chromedriver"): This line creates an instance of the Service class which specifies the path to the ChromeDriver executable.

url = "https://uat-sft.yourdomain.com/" line sets the URL of the web application that the script will be interacting with. driver.get(url) line instructs the Chrome browser to navigate to the specified URL.

The next part of the code is a try-except block that attempts to login to the web application and perform certain checks. If the login and checks are successful, the code returns up and writes logs of its activities. If the login and checks fail, the script catches the exception, writes logs of the failure, sends a failure email using the send_failure_email() function, and returns down.

Here is a breakdown of the code:

The script first waits for a maximum of 10 seconds for the presence of the HTML username element with EC.presence_of_element_located((By.ID, "username") and writes a log if the login page is loaded successfully.

The script then finds the HTML elements username and password , enters the login credentials, clicks the sign-in button with the following code:

driver.find_element(By.ID, "username").send_keys("platformuser@yourdomain.com")
 driver.find_element(By.ID, "password").send_keys("platformuserpassword")
 driver.find_element(By.ID, "signinButton").click()

Once it is able to login, in the first page, there is an HTML element called compose-delivery-link which is a compose button for a secure delivery. The script waits for a maximum of 10 seconds for the presence of the compose-delivery-link, with EC.presence_of_element_located((By.ID, "compose-delivery-link")) and writes a log if the login is successful.

Then the script clicks the compose button, waits for a maximum of 10 seconds for the presence of the divSecureMessage element which is the body of the secure message window, then it writes logs if the checks are passed, logs out by calling driver.get(logouturl) , and writes logs of the successful logout.

If the checks fail, the script catches the exception, writes logs of the failure, sends a failure email using the send_failure_email() function, and returns down. If the checks are successful, the script creates a reboot_flag file if it does not exist and returns up. Finally, the script closes the web driver and the log file.

def monitor():
    options = Options()
    options.add_argument("--headless")
    options.add_argument("--no-sandbox")
    s = Service("chromedriver")
    url = "https://uat-sft.yourdomain.com/"
    driver = webdriver.Chrome(options=options, service=s)
    driver.get(url)
    try:
        usernameelement = WebDriverWait(driver, 10).until(
                EC.presence_of_element_located((By.ID, "username"))
            )

        time.sleep(10)
        if usernameelement.is_displayed() == True:
                print ("Login page loaded!")
                now = datetime.now()
                log_object.write("Login page loaded at: " + now.strftime("%m/%d/%Y, %H:%M:%S") + "\n")

        driver.find_element(By.ID, "username").send_keys("platformuser@yourdomain.com")
        driver.find_element(By.ID, "password").send_keys("platformuserpassword")
        now = datetime.now()
        log_object.write("Attempting login at: " + now.strftime("%m/%d/%Y, %H:%M:%S") + "\n")
        driver.find_element(By.ID, "signinButton").click()

        time.sleep(10)
        composebuttonelement = WebDriverWait(driver, 10).until(
                EC.presence_of_element_located((By.ID, "compose-delivery-link"))
            )
        if composebuttonelement.is_displayed() == True:
                now = datetime.now()
                log_object.write("Successfully logged in at: " + now.strftime("%m/%d/%Y, %H:%M:%S") + "\n")

        now = datetime.now()
        log_object.write("Opening compose delivery page at: " + now.strftime("%m/%d/%Y, %H:%M:%S") + "\n")
        driver.find_element(By.ID, "compose-delivery-link").click()

        time.sleep(10)
        divSecureMessageelement = WebDriverWait(driver, 10).until(
                EC.presence_of_element_located((By.ID, "divSecureMessage"))
            )
        if divSecureMessageelement.is_displayed() == True:
                now = datetime.now()
                log_object.write("Opening compose delivery page at: " + now.strftime("%m/%d/%Y, %H:%M:%S") + "\n")
                now = datetime.now()
                log_object.write("All checks passed at: " + now.strftime("%m/%d/%Y, %H:%M:%S") + "\n")

        logouturl = "https://uat-sft.yourdomain.com/bds/Logout.do"
        now = datetime.now()
        log_object.write("Successfully logged out at: " + now.strftime("%m/%d/%Y, %H:%M:%S") + "\n")
        log_object.write("------------------------------------------------\n")
        driver.get(logouturl)
        log_object.close()
        if os.path.exists('reboot_flag') == False:
            open('reboot_flag', 'x')
        driver.close()
        return "up"

    except:
        now = datetime.now()
        log_object.write("SFT health check failed at: " + now.strftime("%m/%d/%Y, %H:%M:%S") + "\n")
        send_failure_email()
        return "down"

Corrective Actions:

In the following code, appstatus = monitor() assigns the output of the monitor() function to appstatus. monitor() checks the status of the application and returns up or down.

If appstatus is down, the code checks for the existence of a file called reboot_flag. If the file exists, it initiates a SQL services restart using subprocess.call(['psService.exe', mssql_arguments]), then it initiates a Tomcat server restart by first using subprocess.call(['pskill.exe', tomcat_kill_arguments]) The reason for using pskill.exe to kill the process instead of using psService.exe is that tomcat service takes a significant amount of time to gracefully shutdown. To avoid that, we are forcefully killing the process and using subprocess.call(['psService.exe', tomcat_start_arguments]) to start it. Finally it deletes the reboot_flag file using os.remove("reboot_flag")

Once the reboot_flag file is deleted using os.remove("reboot_flag"), the code iterates from the beginning to check whether the application is up and running. If it still fails, it comes back to the part where it checks if os.path.exists('reboot_flag') == True and goes inside code in else and start restarting the application services first and then start the SQL services. The it again creates the reboot_flag file. This is how the reboot flag has been used to change the service start sequence.

appstatus = monitor()

if appstatus == "down":
    now = datetime.now()

    if os.path.exists('reboot_flag') == True:
        log_object.write("Initiated SQL services restart at: " + now.strftime("%m/%d/%Y, %H:%M:%S") + "\n")
        mssql_arguments = '\\sqlserver.yourdomain.com restart mssqlserver'
        subprocess.call(['psService.exe', mssql_arguments])
        time.sleep(60)
        log_object.write("SQL services have been restarted at: " + now.strftime("%m/%d/%Y, %H:%M:%S") + "\n")

        log_object.write("Initiated application services restart at: " + now.strftime("%m/%d/%Y, %H:%M:%S") + "\n")  
        tomcat_kill_arguments = '\\applicationserver.yourdomain.com Tomcat9'
        subprocess.call(['pskill.exe', tomcat_kill_arguments])
        time.sleep(60)

        tomcat_start_arguments = '\\applicationserver.yourdomain.com start Tomcat9'
        subprocess.call(['psService.exe', tomcat_start_arguments])
        log_object.write("Application services have been restarted at: " + now.strftime("%m/%d/%Y, %H:%M:%S") + "\n")
        os.remove("reboot_flag")
        log_object.close()

    else:
        log_object.write("Initiated application services restart at: " + now.strftime("%m/%d/%Y, %H:%M:%S") + "\n")
        tomcat_kill_arguments = '\\applicationserver.yourdomain.com Tomcat9'
        subprocess.call(['pskill.exe', tomcat_kill_arguments])
        time.sleep(60)

        tomcat_start_arguments = '\\applicationserver.yourdomain.com start Tomcat9'
        subprocess.call(['psService.exe', tomcat_start_arguments])
        time.sleep(60)
        log_object.write("Application services have been restarted at: " + now.strftime("%m/%d/%Y, %H:%M:%S") + "\n")

        log_object.write("Initiated SQL services restart at: " + now.strftime("%m/%d/%Y, %H:%M:%S") + "\n")
        mssql_arguments = '\\sqlserver.yourdomain.com restart mssqlserver'
        subprocess.call(['psService.exe', mssql_arguments])
        time.sleep(60)
        log_object.write("SQL services have been restarted at: " + now.strftime("%m/%d/%Y, %H:%M:%S") + "\n")
        open('reboot_flag', 'x')
        log_object.close()

Flowchart:

Conclusion

Through research and development, I was able to build a proactive monitoring solution using Selenium and Python that significantly reduced downtime. This automation eliminated the need for manual intervention, allowing engineers to focus on higher-priority tasks. Eventually, after months of investigation, the vendor provided a script to clean unnecessary data, permanently resolving the deadlock issue. However, during that time, our automation saved countless hours and prevented service disruptions.

By leveraging Selenium, Python, and system administration tools, we successfully implemented an automated recovery system that ensured seamless application availability with minimal human intervention.