Convergence of Data, Cloud, and Infrastructure

Thoughts on AI Data Infrastructure Field Day #1

David — Mon, 11 Nov 2024 00:07:47 +0000

On October 2^nd and 3^rd, I took part in an amazing event called AI Data Infrastructure Field Day as a delegate. If you’ve never heard of this event before, it’s a series of tech events from the folks over at Tech Field Day where different tech companies present on cutting-edged technologies. The delegates get to ask the hard questions during the presentations to try to understand the tech and determine if it really lives up to the marketing. It’s a blast, and the presentations are on some serious next-level gear.

The companies presenting at this event were Google Cloud, HPE, Infinidat, MinIO, Pure Storage, and Solidigm. The scale is just incredible with all of these groups. The petabyte is the new gigabyte, and exabytes were thrown around like I drink glasses of tea. The price of storage is secondary to the price of the GPU, so the goal is make the most of the GPU by maximizing everything underneath.

Now, what is ‘AI data infrastructure’ you might ask? I had to ask that question a few times, as my background is largely relational data. OLTP databases require ultra-low latency, high IOPs, and lots of concurrency, but with AI data, the needs are different.

From some companies’ perspectives, it’s delivering the fastest and best scale-out storage that money can buy so you can scale-out read faster than anyone, and then checkpoint to disk your AI training data to keep the GPUs busy. The faster you can read, the more data you can feed the GPUs. The faster you can checkpoint, the more the AI training can return to the GPU. The more you lean on the GPU, the more efficient the platform is, and the business gains a competitive edge.

So, we need massive scale-out capabilities. We need ultra-high throughput since these data objects can be quite large. We also need insanely low latency so we can checkpoint to disk – and NOW.

Most of the presentations fell into this category, but each took a different twist on the offerings.

MinIO makes software that allows for object storage as containers on both public and private clouds. Deploy container-based scale-out S3-compatible object storage endpoints quickly and easily. They’ve become a standard for object storage, even though the basic offering is open source and free. They demonstrated that this is a go-to for this sort of storage, and did a great job demonstrating their efficacy at providing this level of storage, no matter the choice in the underlying platform. I love the flexibility their offerings deliver, mostly because it was a standardized storage protocol (S3) and one platform that could scale as low or as high as required.

Google Cloud then presented on their object storage offerings. Google shouldn’t be a stranger at this point. I find it interesting that they have numerous distinct offerings, based on your needs – Cloud Storage, Parallelstore, Storage Scale, Filestore, and Hyperdisk ML. It also felt like there was significant overlap in the offerings. They have so many offerings they had a decision tree matrix, and even then, it felt a bit complex. The platform would certainly scale and work very well, if your AI workload were located on the Google public cloud. Native offerings on a given cloud provider should just work. I would just really need to study to determine which is the right initial platform to choose, and as a workload scaled, determine the best method to move the data to the next level as needed.

Infinidat came next to discuss their on-premises massive storage arrays. I’ve been a big fan of their arrays for years for relational database workloads, as we have many clients on their storage with great success stories. Their presentation was largely focused on the pretense of ‘you need fast storage for AI, so here’s what we can do.’ Storage companies are always focused on faster, larger, and easier, and they hammered on these points hard. I just didn’t see any features within the array that could contribute to the AI initiative, other than being great at scaling out and being fast.

Solidigm took a different approach that I found refreshing. Solidigm makes NVMe storage rather than arrays or protocols. They focused on the Total Cost of Ownership (TCO) rather than a more operational approach of faster, bigger, etc. The presentation focused on how their storage provide more efficient storage, which saves costs through reducing factors such as power, rack space, cooling, maintenance, disposal, etc. It was quite refreshing, and I could have spent the rest of the day just drilling into the actual calculations on the cost drivers and price optimization that businesses can leverage from utilizing their storage.

Pure Storage also discussed their scale-out FlashBlade storage platforms. I’ve also been a huge fan of Pure since the beginning, have numerous clients happily using their arrays, and have actually done significant work with their internal teams on the SQL Server side of the house. The presentation felt similar to Infinidat in that it focused on ‘we are insanely fast and can scale out’. Both did a great job presenting their storage, but could have done a better job of talking specifics on optimizing the AI workloads rather than focusing on ‘faster means better AI.’

From a different perspective, rather than being a foundation for improved efficiency in the GPUs, other offerings want to be more than just storage. To me, HPE stood out in this area, and I feel that the presentation was the most impressive of the entire event. HPE focused not just on storage, but wanted to provide the end-to-end foundation for enabling businesses to get started with AI. Organizations could have storage or compute ready to go, but may need tooling to feed their existing data into AI platforms. They have to wire in each data source manually, determine the data model, figure out a way to get the data into a central location (and store it), then figure out how to best utilize it. Too many organizations, IMHO, are too busy putting out fires each day to put the energy into these layers of the AI story, but these items are required before the data can be even considered for AI modeling.

They have released a platform that they call “HPE Private Cloud AI”. It’s an end-to-end solution that contains the storage, compute, GPUs, AND the software components. The software is included that allows organizations to integrate and connect to the various data sources, plus includes the AI-specific packages as well so you can quickly get up and running. The platform packages up numerous open-source packages to provide a ready-to-go endpoint that will act as the cornerstone that allows business to just ‘use’ the platform to start the journey into AI.

I think this one of the best ‘private cloud’-like stories that I’ve ever seen. Many companies treat private cloud as a virtual platform with administrators on the other end of the support ticket manually performing tasks like deploying and integrating, but the automation aspect just means more bodies. Few have what I’d consider good automation in-house, and fewer have good integration of data sources. This platform allows the organization to skip the lengthy journey into automation and integration, and to just start using it. Save the stress, time, and re-work, and just get going.

Check out the sessions on YouTube here. I’m proud to have been a part of this even, even if I was a bit quieter than I’d have liked. It was a lot of great data to absorb, and I’m still digesting a lot of the content here weeks later. I can’t wait for the next one!

Know Thy Platform

David — Thu, 22 Feb 2024 15:34:00 +0000

IT and data professionals, I implore you – know thy platform. All of it. Not just the layer your job is tasked with. Modern public cloud (or any infrastructure for that matter) platforms means there’s more to it than just one layer.

Here’s an example from just recently. A long-term client called up and said one of my favorite phrases – “our test system is running slow.” The test system is on Azure IaaS.

Once I got further details, it became a bit clearer. On the test system, they were modifying an integration with an external system, and something hiccupped on a test run and overwrote some data incorrectly. No biggie – they recovered the database from the previous night’s backups and fixed the problem. But – about ten minutes after the restore, the system just slowed to a crawl, and stayed that way for three days.

After digging into it, we found that there was a continuous loop of a full database backup sending the backup to Azure blob. It was Azure not handling the refresh appropriately and instead of doing a full backup then resuming log backups, it was just looping through a full backup each time.

The symptoms were clear:

Sustained disk reads over 200MB/s
Storage latency was over 640ms for reads
Significant CPU utilization
Network transmit throughput over 900Mb/s
Sp_whoisactive reporting an active database backup to Azure blob URL

The backup loop was saturating the VM and either the vDisk or VM’s scale was imposing a storage speed ceiling, capping the overall performance and tremendously slowing down read access. An application login would normally take five to six seconds, but now took a whopping six to seven minutes to complete.

We ended up killing the backup SPID and adjusting the backup policy on the Azure VM from automatic to manual, with one full backup being performed at night and log backups running every five minutes. Immediately, the problem disappeared, app performance went back to normal, and everything stabilized.

These symptoms and signs taken independently of each other would not tell you too much individually, but together, painted a pretty clear picture of what was going on. The folks there are seasoned pros, sharp, and well-intentioned, but hadn’t communicated the symptoms from the various layers with each other, so it didn’t make sense.

So – please – spend time digging in and understanding the basics of each layer at and underneath the data. Knowledge of the architecture and components, how they tie together, awareness of any performance limits at that layer, and being able to spot when one of the artificial caps is triggered, are all critical to daily management of these platforms.

Digging up undiscovered savings

David — Sun, 21 Jan 2024 18:20:14 +0000

You might be able to shave off thousands – or more – in your monthly cloud bills for your critical SQL Servers, all while maintaining or even improving performance.

Public cloud providers charge organizations for everything they deploy, and while a few items in the cloud are based on a pay-by-consumption model, most of the services that really stack up, namely SQL Server licensing, are charged on a pay-by-allocation model. If you provision a SQL Server VM with eight cores and promptly forget about it, you will be shocked at the end of the month when you receive your cloud subscription bill. If you provision a 32-core VM, and your workload only uses four cores, you are still paying for all 32 cores, regardless of the low utilization rates. The same goes for memory and storage. A large memory footprint and lots of high-speed attached managed disks all add up.

It’s time to be objective. It begs the question: “How are my servers using the resources currently allocated?” I find that most organizations do not have a solid resource consumption baseline for their servers, so understanding where cost optimizations can be made becomes a guessing game, which is the exact opposite of what we need, and usually backfires spectacularly.

A few tools exist to help with this process, but be careful and weigh the output carefully. Highly granular data helps immensely, as the lack of granularity can either skew the output from the law of averages, or can weight the results differently. An example could be a CPU spike for 30 seconds because of a time-sensitive job running at midnight might not even appear in the telemetry if the telemetry reports an average of each 30-minute interval. A utility that reports just the minimum, average, and maximum CPU usage in a day might show a max at 95% and base their recommendations from this spike. But, if this spike occurs at 2:00am because of routine database index maintenance, and finishes in an hour but we have three hours for it to safely run, and the business-day CPU consumption peak is just 15%, the lack of context in the results can cause you to spend too much.

I’ve even seen some folks provision 16-32TB storage just to get a high IOPs rate for VMs, just to find that the metrics they were basing the estimate on were from database backups over a weekend, and were in no way even close to consumption rates during the week. (They were also hitting the VM’s I/O limits well before maximizing the available speed on the virtual disks.)

Usually, I find that retrieving the data yourself with built-in tools such as Windows’ Perfmon and Linux’s sar can provide as granular data as you need to help you build a better resource consumption baseline. Take the raw data and view the in-depth metrics, such as CPU consumption by-core, the memory usage of your application/database engine (and not just the OSE, as many applications like databases hide their memory usage stats from the underlying OSE), disk read and write demand, etc. Every bit of data helps paint a more complete picture of a day in the life of this server.

Next, understand the workload itself from a business perspective. A database server underneath a Monday-to-Friday 8:00am to 5:00pm business has far different resource consumption footprint than an around-the-clock, three shift-a-day, manufacturing facility. What are the normal operating windows that have time sensitive tasks? It could be limited to just normal business hours, or could be constant or scattered across the week.

If you see spikes or dips in the telemetry, what task or process corresponds to the change? A sharp CPU spike at 2:00am from index maintenance that runs in 20 minutes could be perfectly normal and acceptable to run a few minutes longer. Review the big items to understand if they are time sensitive or not, and what sort of impact it has on everything running concurrently.

Now, map out consumption rates versus resource allocations. You’ll be shocked to find that in almost all cases, your servers are oversized. Oversized means you are paying too much. Use the baseline to estimate how much you could reduce your VM’s resource allocations without impacting performance. (In a few cases, you might find some of your servers are actually undersized and starved for resources, and you can adjust the resource allocations accordingly to regain your performance.)

In some cases, reducing the allocated resources could actually improve the performance of the VM. Every single compute resource allocated to a VM must pass through the hypervisor’s scheduling queues, and cloud is nothing more than someone else’s datacenter with automation on top of some hypervisor. The less resources that you must push through the queues, the less overhead the hypervisor could impose on your workload, and that means it could get faster.

An example could be a 16-core VM with SQL Server Standard edition installed. An Azure E16ads v5 VM, with 16 vCPUs and 128GB of RAM MSRP’s at $2,470 a month without managed storage. If your workload is only using three or four CPUs of actual consumption, why not test downsizing it? Chopping the VM scale in half to an E8ads v5 reduces the monthly cost to $1,235. Your utilization percentage goes up, but if your server is running in a ‘safe’ range and not constantly maxed out, you just saved your organization a significant amount of money with just a couple of clicks while maintaining the speed your users expect.

This example covered just one server. What if you had a hundred like this? Or a thousand? How much of your monthly cloud budget can you reclaim? DBAs today must be as much of a cost optimizer as they are a performance and availability engineer, so use this technique to review your resource allocations and save your organization a lot of money!

SQL Server query runtime is not everything

David — Wed, 18 Oct 2023 20:14:09 +0000

SQL Server query developers, listen up! Query execution time is not everything you should be worried about. You need to examine the parse and compilation time for each of your queries too.

When you run a query, the SQL Server engine executes it. You get a runtime in SSMS or whichever tool you’re developing in. Take this “fine” example of a dashboard query from an ERP system.

If you go to the Messages pane to look for more details, it’s pretty sparse by default.

This runtime is not the complete picture of what it is doing while processing your query. Run the command “SET STATISTICS TIME, IO ON”. It provides greater level of information in the Messages pane in SSMS. Run that command, then re-select and execute your query.

Now open the Messages pane, and you are presented with much more information.

Do you see how it now breaks out parse and compilation time from execution time? In this example, this query took 3.4 seconds to parse and compile. That is an eternity.

We can also see that the query went parallel. The total execution time was 7.7 seconds, but we have approximately four times that of CPU time. We can estimate that our actual max degree of parallelism was across four CPUs.

You might even have some queries that spend an eternity compiling and then the execution time is very brief!

What is interesting is that when we look at our execution plan via SSMS, it does not directly display any item related to compilation in the root node of the plan. (It does show a lot of other garbage, but that’s a topic for a different day.)

It does show that our degree of parallelism is four, just as we estimated. (There are some execution plan tools that do show this, but shall remain nameless.)

Edit the plan’s raw XML, however, and we can see some more details missing in the GUI.

At the top of the plan XML under a second called QueryPlan is an entry for CompileTime, measured in milliseconds.

Now, in many cases where queries contain any number of suboptimal items, our compilation time can actually (sometimes greatly) exceed the runtime of the query! This list includes:

Significant amounts of business logic embedded in the code
Missing indexes or outdated statistics
Use of dynamic SQL
Querying views
Views joined to other views
UDFs and TVFs, especially in query predicates

The tough part is that as the query complexity grows, so does parse and compile time. The use of suboptimal constructs amplifies it. With that compilation time comes a spike in CPU and some memory consumed while it compiles. If the query parse and compile time reaches two seconds, an internal engine threshold, it will hit a timeout and stops the process and just goes with what it has at that time. It might be an optimal plan… or it might be a bad plan. (If the engine encounters some of these items listed above or a lot of other scenarios, it can recompile, which can cause the parse and compilation time to grow even higher.)

It gets worse. Many SQL Server monitoring tools usually do not capture the compilation time, but just capture the execution runtimes. This omission means that a significant portion of the actual runtime of the overall command could be completely missing from any monitoring telemetry you might have. With a diagnostics query, you can view the compilation time for all queries pulled from the execution plan cache. Thanks Jonathan for the query!

At a micro-level, review your query parse and compilation time as you develop your queries. If it suddenly spikes or is elevated to begin with, undo the last change and review before deploying the code to production.

At a macro-level, review the top compilation offenders. You might be surprised at what you find! At that point, start reworking the queries that you have under your control and cut that time down! An easy bandaged solution is to repackage the queries as a stored procedure, as you might just find your plans get re-used much more frequently, and that compilation time impact is cut to zero once a plan is developed. (It won’t fix the inefficient commands, but can at least reduce that compilation time.)

If high compilation times are encountered in third-party databases, take this raw detail and go to them to present the findings. They need to address these inefficiencies as soon as they can, or else you will eventually be needing to add more CPU cores to this server just to handle the inefficient code, all of which costs your organization money in the form of new SQL Server licensing.

Data Platform SLAs

David — Mon, 18 Sep 2023 14:41:00 +0000

Database professionals of the world – I have a question. Has your organization defined service level agreements (SLAs) for your data estate? I’m talking specifically the Recovery Point Objective (RPO) and Recovery Time Objective (RTO), and to have these defined not in an arbitrary number of nines, but in minutes or hours. If these aren’t defined from above, your business continuity plan is doomed to fail.

In basic terms, RPO is how much data, usually measured in time, of data your business is willing or able to lose if a system failure occurs. RTO is the amount of time that the business-critical systems can be offline before the outage causes a disaster for the organization. These metrics should be defined both for planned and unplanned outages. A planned outage could be that a critical system needs routine maintenance, such as software updates or operating system patching. Planned outages are sometimes not factored into system designs, so core systems can go unmaintained and lead to security issues or platform instability. Unplanned outages can be as small and limited in scope, such as an operating system freezes and needs to be restarted, or as large as a critical site outage has occurred because of a centralized storage failure or a natural disaster.

When you design a business continuity strategy, these SLAs must be defined not by a given database availability feature that you might want to use, but by the C-level in your business. The business must be on board with these metrics from the top-down. If the business hasn’t (or won’t) define these two metrics, the unwritten expectations in the minds of the leaders are that any outages will result in no data loss and have near immediate recoverability. They might tell you best effort, or give vague requirements, but without the formal SLAs having been defined, an outage will bring out the ’best’ in people when the systems are offline longer than the business can sustain.

In some cases, then, failure to define the SLAs mean the business continuity strategy is left to the implementers in IT without clear design targets. At that point, it becomes best effort. The constraints of designing a modern data platform, be it in the cloud or on-premises, mean the availability and disaster recovery options are limited by the budget of the IT organization. This budget rarely allows for a design that meets or exceeds these unwritten expectations.

In other cases, members of IT want experience with a certain continuity feature, such as Microsoft SQL Server availability groups. But, without the defined SLAs as targets, designing a continuity architecture by starting with the features rarely gives the ‘right’ level of SLA, and usually results in overcomplicating the architecture. Overcomplicating the platform almost always results in additional outages, or longer outages, at the end of a given calendar year, which defeats the purpose of the architecture.

So, let’s say your organization has now properly defined the SLAs. Examples of SLAs that we usually see in the field are RPOs for no data loss for minor localized incidents and 30 minutes for larger incidents, and an RTO example of 30 minutes for smaller-scale incidents and 24 hours for a larger scale outage. At this point, we can then evaluate our options and start to select certain techniques and features to meet (and hopefully exceed) the SLAs.

Three main items now need to be planned – a backup and restoration solution, high availability, and disaster recovery. These tend to blur some lines and overlap a bit, depending on the solution. For more complex platforms like database servers, the OSE and the databases need to be considered separately. If a backup solution can accommodate both to the same degree to meet or exceed the SLAs, single solutions are better than multiple solutions that might collide or compete.

At this point in the process, I always ask a lot of questions now that the SLAs are defined. First, how advanced or “seasoned” are the staff members managing these servers? Would a stand-alone server with no fancy HA or DR configurations meet the SLAs? Can the staff support a more complex architecture, and if not, are they willing (or able) to learn? That question is much harder than it sounds, as most won’t admit if they cannot support a solution that they might want more experience managing. If not, consider the architectural options carefully, and consider bringing in outside help to help engineer and train on the solution.

Start mapping out available features and solutions, based on what the staff can support (or external folks to help engineer and train). Let’s assume that the staff are fairly seasoned and can support a more complex environment. Let’s also assume that these servers are ones in my wheelhouse, namely SQL Server VMs. Map out the features available for HA and DR and their SLAs.

Layer	Feature	RPO	RTO	Note
SQL Server	Availability Groups (synchronous)	Zero	Less than 30 seconds	Assuming not located on same storage device and that device does not fail
SQL Server	Availability Groups (asynchronous)	Low	Less than 30 seconds	“Low” RPO is based on rate of data change and bandwidth to replicate data to secondary replica, and is usually less than one minute during business hours
SQL Server	Failover Cluster Instance	Zero	Less than five minutes	Assuming shared storage failure does not occur, if present
SQL Server	Log shipping	Low	Low	“Low” RPO based on log replication timing and available bandwidth. “Low” RTO is based on staff availability to perform destination promotion to live and time to change app connection config
Infrastructure	Backup replication	Varies	Varies	Varies based on backup software features and platform speed for recovery
Infrastructure	Storage replication	Low	Varies	Subject to SAN LUN-level replication windows and platform/application changes required to promote replicated copy to active
Infrastructure	VM-level replication	Zero to low	Varies	Synchronous or asynchronous, depends on bandwidth, but could lack point-in-time recoverability, and RTO varies greatly

Map out your available options carefully, and note which layer is responsible for the availability action. Knowing which layer is responsible means you can train the people involved, as a system administrator might not be comfortable restoring a SQL Server Availability Group architecture without the help of a DBA. Planning now for the personnel required will help speed up the recovery time in the event of an actual emergency.

Understand the nuances that accompany on your specific platform, such as available bandwidth, endpoint latency, or dependencies on other items in the environment. Just know that a database being online and available doesn’t mean the application can connect, so the data platform is still down for the users. Document and ramp up the level of staff involvement for manual processes, such as application-level changes that must be made to get the application talking to the database, or public IP addresses updated for a web site to fire up. Add variables such as complexity of the architecture, processes if key staff are unavailable, and expected time to recovery for various scenarios impacting the design. Document and identify common situations that might take down a system or site, such as small-scale events like a bad OS-level patch, or larger events such as a hurricane hitting your primary datacenter and taking out the power for a week.

Build your prototype platform. Validate the architecture with failover and fail-back testing. Rarely do organizations factor the fail-back portion of a business continuity strategy, and that usually only becomes apparent during the fallout from an actual emergency.

Finally, test. Test. TEST. And then test some more. And don’t just test until it finally works. Plan to test periodically throughout each year after you take this design to production, as data platforms are a fluid architecture and subject to constant change. A working BC test today might be completely trashed if an undocumented router change is not replicated to the DR equivalent. Incomplete testing almost always results in some setting being missed, which results in a horrendous experience if a failover for an emergency is required. Most successful business continuity strategies actually plan to do a full failover and run FROM the DR site for a portion of the year so that it is confirmed that the end-to-end details of a failover are proven to work.

Proper recoverable backups are the foundation to any business continuity strategy. If you can’t recover a backup, the rest of the platform is almost useless. You must start here, because recovering the data is arguably the most critical step. Your availability architecture flows from this step based on your SLAs.

SQL Server on VMware Accelerator now free!

David — Fri, 07 Jul 2023 15:44:03 +0000

My SQL Server on VMware Accelerator boot camp video series is now live on Youtube! There’s no strings attached and no price of entry, so now there’s no reason why you can’t join me in this adventure to learn more about how to performance and availability tune your SQL Server on VMware data estate.

This immersive video series is designed to accelerate your ability to improve the performance and availability of your enterprise SQL Server databases on VMware. With over twenty years of hands-on operational experience in the topic, this guide is packed full of tips and tricks designed to help you make the most of your virtualized data estate for administrators of all levels.

Module 1 – Why SQL Server on VMware?
Module 2 – Enterprise Storage
Module 3 – Networking & Interconnects
Module 4 – Physical Host
Module 5 – VMware ESXi
Module 6 – Host Clusters & Operations
Module 7 – SQL Server Virtual Machine
Module 8 – Windows Operating System
Module 9 – Linux Operating System
Module 10 – SQL Server Instance & Databases
Module 11 – Performance Tuning
Module 12 – Availability Tuning
Module 13 – SQL Server on VMware Licensing
Module 14 – Operations
Module 15 – Wrap Up

Architecting MS SQL Server on VMware vSphere 8.0 released

David — Thu, 23 Feb 2023 23:48:03 +0000

I’m thrilled to announce that the “Architecting Microsoft SQL Server on VMware vSphere” 8.0 best practices guide has been released! I’m pleased to have been able to contribute to this document. This document covers some very positive and significant changes VMware has made to the 8.0 release that apply to your enterprise SQL Server workloads, especially around vCPU presentation to SQL Server virtual machines. Read this as soon as you can before you embark on your 8.0 upgrade journey. Let me know if you have any questions!

VMware & SQL Server 823-824 alerts

David — Wed, 18 May 2022 14:10:42 +0000

We’ve been tracking a weird state with SQL Server virtual machines on VMware and possible warnings on database corruption while VM backups are running, largely centered around (but not isolated to) the tempdb database.

TLDR: We’ve now got a VMware KB article on this situation that you and your VM admins should read if you hit the condition and fall into the specifics listed below. Reference VMware KB 88201 for more details.

Here’s the specifics.

VMware vSphere 7.0 Update 2 introduced something called a bloom filter to boost storage performance. When a snapshot is present on a virtual machine and the VM is relatively active, we’ve very occasionally seen 823/824 alerts, largely centered on but not limited to the tempdb database. It’s largely during the time when a VM is getting backed up. It’s very rare, but it does happen and we’ve seen our fair share of these over the last two months.

DBAs across the world have been pinging me on this for a few months now as their VM admins update to the latest VMware patches.

VMware has a short-term fix for this to disable the bloom filter, but it must be re-disabled upon ESXi host reboots at this point. They’re working on an improved resolution path on this issue.

If you encounter this, please let me know. We’re looking for more sites that can repeatedly reproduce this state, as it’s quite rare. I’ll keep you all in the loop as new updates are released.

UPDATE: VMware just released 7.0 Update 3f, which officially disables the bloom filter. I’m not sure if VMware intends to re-enable this in the future, but the root cause of the data integrity issue is still being investigated. Apply this update as soon as you can if you are experiencing the issue detailed above!

Add Disk Performance Counters to Windows Task Manager

David — Tue, 26 Apr 2022 14:22:00 +0000

Have you ever wondered why Windows Server doesn’t show disk performance metrics in Task Manager, but your Windows 10/11 OS does? It’s a really silly difference.

Let’s fix that. Open PowerShell or the command prompt with administrative permissions. Run “diskperf -y“.

Close and re-open Task Manager. Click the performance tab again. Voila! Apply this to your templates and your existing servers and never be without some critical performance counters again.

SQL Server on Linux Storage Secrets on Microsoft Channel 9

David — Thu, 29 Jul 2021 14:37:00 +0000

I’m thrilled to have recently released a video with Microsoft’s Data Exposed on Channel 9 where I discuss setting up Linux storage appropriate for on-demand disk expansion for your production SQL Server on Linux workloads. I’ve been working with Linux operating systems since before I knew Windows, and while Linux is an amazing operating system to run SQL Server on, it’s just simply different than Windows in a lot of ways. When people put SQL Server on Linux into production, and then need to grow a virtual disk sometime in the future, normally, well, you can’t expand it. If this is due to an out-of-space issue, this is a major challenge. Come learn with me how to set up your disks for on-demand expansion, similar to how Windows does it, for your production SQL Server on Linux deployments!