Eric Sloof

Kong on vSphere Kubernetes Service – What does this white paper cover?

esloof@ntpro.nl (Eric Sloof) — Sun, 07 Jun 2026 09:18:00 +0000

Broadcom published a technical white paper in February 2026 covering the integration of Kong API Gateway with vSphere Kubernetes Service (VKS) within VMware Cloud Foundation. The paper describes a reference architecture for organizations looking to combine Kubernetes workloads with enterprise-grade API governance.

Why Kong on VKS?

As Kubernetes environments scale, the challenge shifts from raw throughput to governance: how do you ensure consistent security, predictable latency, and auditable traffic management across all your microservices? Kong acts as the intelligent "front door" of the VKS cluster, filling exactly the gap that standard ingress controllers leave behind.

Two deployment models

The white paper fully details two architectures, complete with all accompanying YAML and Helm commands:

On-premises – Both the Control Plane and Data Plane run locally within the VKS cluster. This model is particularly suited for high-compliance environments where the management layer must remain on-premises — a familiar requirement in public sector environments.
Hybrid (Kong Konnect) – The Control Plane is a SaaS service, while the Data Plane remains local. Operational management is centralized in the cloud, but all API traffic is processed within your own infrastructure boundary.

Technical environment

The validation was performed on VKS version 3.5, Kubernetes v1.34.2, Ubuntu 24.04, and vSAN ESA with RAID-5 as the storage policy. The Kong Operator (v1.0.2) was installed via Helm, supplemented by cert-manager for automated mTLS certificate rotation between the Control Plane and Data Plane.

Relevance for VCF infrastructure design

For infrastructure architects working on VCF designs, this paper is particularly interesting because it demonstrates how Kubernetes-native tooling — Gateway API, Cluster API, GitOps via Argo CD — integrates seamlessly with existing vSphere workflows. NSX handles network micro-segmentation, the vSphere CSI driver transparently exposes vSAN storage policies as Kubernetes storage classes, and the entire stack is declaratively manageable — including certificate lifecycle and routing policies as code.

Source: VMware by Broadcom Technical White Paper – Kong on vSphere Kubernetes Service

Chatting with 9,000 Pages of VMware VCF 9.1 Documentation — Locally, Free, and Private

esloof@ntpro.nl (Eric Sloof) — Mon, 25 May 2026 15:54:00 +0000

Imagine having the entire VMware Cloud Foundation 9.1 documentation right at your fingertips. We're talking about over 9,000 pages of dense technical content. Finding a specific answer usually means endless searching through PDFs. But what if you could just chat with it?

That’s exactly what I’ve set up. And the best part? It runs entirely locally on a MacBook Pro. No subscription costs, no cloud processing, and absolutely zero data leaving your machine.

What Did We Build?

We built a local RAG (Retrieval-Augmented Generation) pipeline. In simple terms, it's an AI system that understands your questions, instantly retrieves the most relevant sections from the VCF documentation, and delivers a clear, accurate answer. It's like having a VMware expert sitting right next to you.

Here’s the stack:

Ollama: The engine running AI models locally on your Mac.
Mistral 7B: The language model that answers your questions.
AnythingLLM: The user interface for chatting with your documents.
MacBook Pro M2 Max (64 GB): More than enough power to run this setup smoothly.

Why Not Train a Custom Model?

When people think about combining AI with their own documents, they often jump straight to training a custom model. Let me be direct: that’s unnecessary and, for this use case, completely the wrong approach.

Approach	Cost	Time	Results
Training a Model	€5,000 – €50,000	Weeks	Mediocre for specific docs
RAG with AnythingLLM	€0	One afternoon	Excellent and highly accurate

RAG works differently. Instead of baking knowledge into a model's weights, the system retrieves the relevant document sections at query time and passes them to the language model. It's faster, cheaper, and far more accurate for domain-specific documentation like VCF.

Step 1: Install Ollama

Ollama is the engine that runs AI models locally on your Mac. You can install it easily via Homebrew:

brew install ollama

Next, download the Mistral 7B model. It’s a powerful open-source model that performs exceptionally well on technical documentation:

ollama pull mistral

Mistral is about 4.4 GB and runs flawlessly on an M2 Max. If you have less memory, llama3.2 (2 GB) is a solid, faster alternative, though slightly less accurate.

Step 2: Download and Install AnythingLLM

AnythingLLM is a free desktop app that lets you upload documents and chat with them directly.

Head over to anythingllm.com/download and download the Apple Silicon version.
Open the DMG file and drag the app to your Applications folder.

Pro Tip: If macOS warns you about an "unidentified developer," simply remove the quarantine flag via Terminal:

xattr -cr /Applications/AnythingLLM.app

Step 3: Configure AnythingLLM

Launch AnythingLLM and follow the setup wizard. Use these settings:

LM Provider: Ollama
Ollama Base URL: http://127.0.0.1:11434
Chat Model: mistral:latest
Embedding Provider: Native (built into AnythingLLM )
Vector Database: LanceDB (default)

Troubleshooting: If you encounter the error model 'qwen3-vl:4b-instruct' not found, you'll need to edit the configuration file directly. Open ~/Library/Application Support/anythingllm-desktop/storage/.env and replace the LLM settings with:

LLM_PROVIDER='ollama' OLLAMA_BASE_PATH='http://127.0.0.1:11434' OLLAMA_MODEL_PREF='mistral:latest' OLLAMA_MODEL_TOKEN_LIMIT=4096

Restart the app, and you're good to go.

Step 4: Upload the VMware VCF 9.1 PDF

Create a new Workspace in AnythingLLM (e.g., "VCF 9.1 Docs" ).
Click Upload Document.
Drag your 9,000-page PDF into the upload window.

AnythingLLM will automatically process the PDF: extracting text, splitting it into chunks, calculating embeddings, and storing everything in the vector database. For a document of this size, it takes a few minutes. Once it's done, you're ready to chat.

Step 5: Chat with Your Documentation

Now, you can ask questions just like you're talking to a colleague:

"What are the minimum hardware requirements for VCF 9.1?"
"How do I configure NSX in a VCF 9.1 environment?"
"What's new in VCF 9.1 compared to 9.0?"

The system retrieves the relevant sections and generates an answer, even including references to the source pages it used.

Why This Works So Well on a Mac

Apple Silicon (M1/M2/M3/M4) has a massive advantage: unified memory. The CPU and GPU share the same memory pool, meaning a 7B parameter model fits entirely in RAM on a Mac with 32 GB or more. An M2 Max with 64 GB can even run 13B models locally without breaking a sweat.

Mac Model & Memory	Recommended Model
M1/M2 (16 GB)	llama3.2:3b (fast, compact)
M1/M2 (32 GB)	mistral:7b
M2 Max (64 GB)	mistral:7b or llama3.1:13b
M2 Ultra / M3 Max	Larger models easily supported

Privacy & Security

Everything runs 100% locally. Your VMware documentation, your questions, and the answers never leave your MacBook. No cloud, no subscription, no data sharing. This makes it the perfect setup for handling confidential internal documentation.

Conclusion

With Ollama and AnythingLLM, you can build a powerful AI assistant in a single afternoon that effortlessly navigates 9,000 pages of VMware VCF 9.1 documentation. It’s local, free, and completely private.

The technology behind this—RAG—is exactly what large enterprises use for their AI applications. You just get to skip the enterprise price tag.

Next step? Try combining multiple documents—release notes, best practices, and architecture guides—all in one workspace. That's when you truly build a comprehensive VMware knowledge base right on your laptop.

The Shadow AI Imperative | Transitioning to Private AI

esloof@ntpro.nl (Eric Sloof) — Tue, 19 May 2026 06:13:00 +0000

Public AI platforms are increasingly used by employees to accelerate daily work — but sensitive business data is flowing into them outside IT visibility and governance. This phenomenon, known as "Shadow AI," is now one of the biggest data sovereignty challenges facing enterprise and government organizations.

The answer isn't to block AI — employees will keep using it because the productivity gains are real. The answer is to bring AI inside your own infrastructure boundaries. VMware Private AI Foundation with NVIDIA, deployed on VCF 9.1, delivers exactly that: GPU-accelerated workload domains, VKS clusters with the NVIDIA GPU Operator, and on-prem inference and RAG pipelines — all under the same governance, lifecycle management, and operational model your team already uses for the rest of the private cloud.

The result: full data sovereignty, no external data retention, and AI workloads that run where your data already lives.

Conquering the VMware Cloud Foundation 9.1 Documentation with Google NotebookLM

esloof@ntpro.nl (Eric Sloof) — Thu, 14 May 2026 08:23:00 +0000

As enterprise IT environments become increasingly complex, so does the documentation required to manage them. VMware Cloud Foundation (VCF) 9.1 is a prime example. The release brings significant advancements in architecture and management for modern private cloud infrastructure, but it also comes with a massive, multi-thousand-page technical manual . Navigating this extensive guide to find specific instructions for deployment, lifecycle upgrades, or the new VCF Management Services can be a daunting task for any engineer.

Fortunately, there is a highly effective way to tame this colossal document: combining the full PDF download of the VCF 9.1 documentation with the AI-powered analytical capabilities of Google NotebookLM.

The Challenge of Large-Scale Technical Documentation

The VCF 9.1 documentation is a comprehensive authority on VMware's software-defined data center solutions . It covers everything from core platform components like vSphere, vSAN, and NSX, to specific design blueprints for application modernization and AI-driven workloads. Furthermore, it details critical administrative tasks such as license management, network configuration, identity services, and troubleshooting protocols using VCF Operations .

While this level of detail is necessary, it presents a practical challenge. When an engineer needs to quickly understand how zero-touch provisioning simplifies ESX host installation, or exactly what role VCF Management Services plays in the 9.1 architecture, manually searching through thousands of pages is inefficient.

The Broadcom TechDocs portal offers a crucial feature that enables our solution: the ability to download the entire documentation suite as a single PDF file. This single file contains all the information, but reading it end-to-end is impossible. This is where Google NotebookLM steps in.

Enter Google NotebookLM: Your AI Research Assistant

Google NotebookLM is an AI-powered research and content organization tool designed to analyze large bodies of information . Unlike general-purpose AI chatbots that might hallucinate or pull outdated information from the open web, NotebookLM grounds its answers strictly in the sources you provide .

By uploading the massive VCF 9.1 PDF into NotebookLM, you effectively create a private, highly intelligent search engine dedicated solely to that specific version of the documentation.

How to Set Up Your VCF 9.1 Notebook

The process is straightforward and transforms how you interact with technical manuals:

Download the PDF: Navigate to the Broadcom TechDocs page for VMware Cloud Foundation 9.1 and locate the "Download PDF" button . Save the comprehensive manual to your local machine.
Create a New Notebook: Open Google NotebookLM and create a new project dedicated to VCF 9.1.
Upload the Source: Add the downloaded PDF as a source in your new notebook. NotebookLM supports large PDF documents, making it ideal for this use case .
Start Querying: Once the document is processed, you can begin asking complex, natural language questions.

Extracting Insights Efficiently

With the document loaded, you can leverage NotebookLM to extract precise information without endless scrolling. Here are practical examples of how this combination accelerates workflows:

Understanding Architectural ChangesInstead of hunting for the section on new architecture, you can ask NotebookLM: "Explain the role of VCF Management Services in the architecture."

NotebookLM will parse the document and provide a synthesized answer based on the text. For instance, it will highlight that VCF Management Services serves as a unified platform for centralized lifecycle management, relying on a modular suite of components like Fleet Lifecycle, Identity Broker, and Salt RaaS . It will also explain the deployment models, distinguishing between the first VCF instance and subsequent instances .

Troubleshooting and OperationsIf you encounter an issue or need to configure a specific feature, you can ask targeted questions such as: "How does zero-touch provisioning simplify ESX host installation?" or "What are the required steps for network configuration in VCF 9.1?"

NotebookLM will extract the relevant steps and explanations directly from the manual, providing citations so you can easily verify the information in the original text if needed.

Generating Summaries and Study MaterialsBeyond simple Q&A, NotebookLM can transform the technical content into other formats . You can ask it to generate a summary of the new security enhancements, create an FAQ document for your team based on the VCF 9.1 release notes, or even compile a study guide for the new features.

Conclusion

The release of VMware Cloud Foundation 9.1 brings powerful new capabilities to the private cloud, documented in an exhaustive technical manual. By taking advantage of the PDF download option on the Broadcom TechDocs site and pairing it with Google NotebookLM, engineers and architects can bypass the friction of traditional document searching. This approach turns a massive, static PDF into an interactive, intelligent knowledge base, allowing you to extract exactly the information you need, exactly when you need it.

Running a Local Private AI Stack on Apple Silicon with llama.cpp and Open WebUI

esloof@ntpro.nl (Eric Sloof) — Fri, 08 May 2026 06:39:00 +0000

VMware Private AI Foundation with NVIDIA is the enterprise platform for running generative AI workloads on your own infrastructure. But what about your local development environment? What if you want to experiment with models, test prompts, or build familiarity with private AI concepts before deploying the full stack?

That is exactly where llama.cpp comes in.

What is llama.cpp?

llama.cpp is an open-source inference engine written in C++ by Georgi Gerganov. It was originally built to run Meta's LLaMA models locally, but has since grown into one of the most versatile and widely used inference runtimes available. It supports dozens of model families in GGUF format, runs on CPU, NVIDIA CUDA, AMD ROCm, and Apple Metal — and it is fast.

On Apple Silicon, llama.cpp is particularly well-suited. The unified memory architecture of the M-series chips means that the CPU and GPU share the same memory pool. On a MacBook Pro M2 Max with 64 GB of unified memory, you can comfortably run quantized 70B parameter models locally — something that would require a high-end discrete GPU on any other platform.

Why This Matters for Private AI

The core idea behind VMware Private AI Foundation with NVIDIA is that your data stays inside your own infrastructure. No public cloud, no external API calls, no data leaving your environment. llama.cpp brings that same principle to your local machine.

This makes it an ideal companion for anyone working with Private AI Foundation:

Development and prototyping — test prompts, RAG pipelines, and agent logic locally before deploying to VCF
Disconnected environments — llama.cpp works fully offline, with no dependency on external model registries or APIs
Understanding model behaviour — build intuition for how models respond before you integrate them into production workloads via Private AI Services
Cost-efficient experimentation — no GPU cluster required for development work

Setting Up llama.cpp on Apple Silicon

Install via Homebrew

The easiest way to get started. Homebrew automatically compiles llama.cpp with Metal GPU acceleration enabled for Apple Silicon.

brew install llama.cpp

Or Build from Source

If you want the latest features or full control over the build:

git clone https://github.com/ggerganov/llama.cpp

cd llama.cpp

cmake -B build -DGGML_METAL=ON

cmake --build build --config Release -j$(sysctl -n hw.logicalcpu)

Downloading a Model

llama.cpp uses the GGUF model format. Models in this format are available on Hugging Face. Install the Hugging Face CLI first:

pip install huggingface-hub

Then download a model. A good starting point:

# Lightweight and fast — good for quick tests

huggingface-cli download bartowski/Llama-3.2-3B-Instruct-GGUF \

  --include "*.Q8_0.gguf" \

  --local-dir ~/models

# Serious model — fits comfortably in 64 GB unified memory

huggingface-cli download bartowski/Meta-Llama-3.1-70B-Instruct-GGUF \

  --include "*Q4_K_M*" \

  --local-dir ~/models

The Q4_K_M suffix refers to the quantization level — a good balance between model quality and memory footprint.

Starting the Inference Server

llama.cpp includes a built-in server with an OpenAI-compatible REST API. This means any tool that works with the OpenAI API will work with your local llama.cpp instance — no code changes needed.

llama-server \

  --model ~/models/Meta-Llama-3.1-70B-Instruct-Q4_K_M.gguf \

  --host 0.0.0.0 \

  --port 8080 \

  --n-gpu-layers 99 \

  --ctx-size 8192

The --n-gpu-layers 99 flag offloads all layers to the Metal GPU. On an M2 Max this makes a significant difference in inference speed compared to CPU-only mode.

Testing the API

Once the server is running, test it with a simple curl request:

curl http://localhost:8080/v1/chat/completions \

  -H "Content-Type: application/json" \

  -d '{

    "model": "llama",

    "messages": [

      {"role": "user", "content": "Explain VMware Private AI Foundation in one paragraph."}

    ]

  }'

You now have a fully local, OpenAI-compatible inference endpoint running on your own hardware.

Adding Open WebUI

For a proper chat interface, Open WebUI is the go-to option. It runs in Docker and connects directly to your llama.cpp server.

docker run -d \

  --add-host=host.docker.internal:host-gateway \

  -p 3000:3000 \

  ghcr.io/open-webui/open-webui:main

Open your browser at http://localhost:3000, add your llama.cpp endpoint (http://host.docker.internal:8080) as a custom OpenAI-compatible connection, and you have a full ChatGPT-like interface running entirely on your local machine — no internet connection required.

The Bigger Picture

llama.cpp and Open WebUI give you a lightweight, fully local private AI stack that you can run anywhere — on your MacBook, in a lab VM, or even in an air-gapped environment. It is not a replacement for VMware Private AI Foundation with NVIDIA in production, but it is an excellent way to build familiarity with models, experiment with RAG pipelines, and develop the intuition you need to design production-grade private AI platforms on VCF 9.

For anyone working on VMware Private AI Foundation deployments, having this kind of local stack running is simply good practice.