$blogTopics | % {echo $_}

Kubernetes Audit Logging: Policy, Fluent Bit, and Alerting

2026-04-13T07:37:23+00:00

An incident happens. A secret is read, a ClusterRoleBinding is modified, someone runs kubectl exec into a production pod. You start the post-mortem and reach for the audit trail — and it is either missing, incomplete, or buried under so much noise that the relevant events are invisible. That is the exact situation audit logging is supposed to prevent, and it is surprisingly common because most teams configure it as an afterthought.

Kubernetes audit logging is built into kube-apiserver and gives you a structured JSON record of every API call made against your cluster: who did what, to which resource, when, and what the server returned. Done right, it is the forensic backbone of your cluster security posture. Done wrong, it either floods your log storage with garbage or silently drops the events you actually care about.

This post covers the full picture: how the audit pipeline works, how to write a production policy that suppresses noise first and captures high-value events at maximum fidelity, and how to ship those logs with Fluent Bit to Elasticsearch or Loki so your SIEM can alert on them.

How Kubernetes Audit Logging Works

Every request to the Kubernetes API server moves through a defined lifecycle. The audit subsystem emits one event per stage that is relevant to your policy:

RequestReceived — emitted the moment kube-apiserver receives the request, before any authorization or processing
ResponseStarted — emitted when the response headers are sent but before the body is streamed (relevant mainly for watch calls)
ResponseComplete — emitted when the full response is sent; this is the stage with the most useful context
Panic — emitted when kube-apiserver encounters an internal error handling the request

The audit subsystem supports two backends simultaneously: a log file backend (--audit-log-path) that writes newline-delimited JSON to a file on the control-plane node, and a webhook backend (--audit-webhook-config-file) that POSTs events to an external HTTP endpoint. Most production setups use the file backend as the primary and ship from there.

The policy file (--audit-policy-file) controls what gets recorded and at what verbosity. Without a policy file, nothing is logged. The policy is evaluated top-to-bottom and the first matching rule wins, which is why rule order matters enormously.

The data flow looks like this:

 kubectl / CI pipeline / controller
           |
           v
    kube-apiserver
           |
     Audit pipeline
           |
    Policy evaluation
    (first match wins)
           |
      +----+----+
      |         |
  File backend  Webhook backend
  (audit.log)   (external endpoint)
      |
  Fluent Bit (DaemonSet on control-plane)
      |
  +---+---+
  |       |
  ES     Loki

Each audit event is a JSON object. The fields you will query most in a security context are:

verb — the HTTP verb mapped to a Kubernetes action: get, list, watch, create, update, patch, delete
user.username — the authenticated identity; for service accounts this is system:serviceaccount::
objectRef.resource — the resource type being acted on: secrets, pods, clusterrolebindings, etc.
objectRef.name — the specific object name
sourceIPs — the originating IP addresses
responseStatus.code — the HTTP response code; 401 and 403 are particularly useful for security alerting
stage — which pipeline stage emitted this event

Audit Policy Levels

The policy file assigns one of four recording levels to each matched request. Choosing the right level per resource type is the difference between a useful audit trail and a storage bill you cannot explain.

Level	What is recorded	When to use it
None	Nothing	High-volume noise: health checks, watch loops, controller heartbeats
Metadata	Request metadata only (verb, user, resource, timestamp)	Routine operations where you need the who-did-what but not the payload
Request	Metadata + request body	Mutations where you want to see exactly what was sent
RequestResponse	Full request + full response body	Secret reads, RBAC changes, exec — anything where the payload itself is evidence

The RequestResponse level on a resource like configmaps with a list verb will include the full response body for every list call, which means every value in every ConfigMap in the response ends up in your audit log. That is both a storage problem and a security problem if the audit log destination is not properly secured. Be precise about which verbs you apply RequestResponse to.

Writing a Production Audit Policy

The right approach is noise suppression first. Start by silencing the internal system traffic that would otherwise dominate your log volume — API server self-calls, kube-proxy watch loops, node status reconciliation, controller manager polls — and then escalate the recording level only for resources that carry security significance.

Here is the full production policy:

apiVersion: audit.k8s.io/v1
kind: Policy
omitStages:
  - "RequestReceived"
rules:
  # --- Noise suppression ---
  - level: None
    users: ["system:apiserver"]
    verbs: ["get"]
    resources:
      - group: ""
        resources: ["endpoints"]
  - level: None
    users: ["system:kube-proxy"]
    verbs: ["watch"]
    resources:
      - group: ""
        resources: ["endpoints", "services"]
  - level: None
    userGroups: ["system:nodes"]
    verbs: ["get"]
    resources:
      - group: ""
        resources: ["nodes"]
  - level: None
    users: ["system:kube-controller-manager", "system:kube-scheduler"]
    verbs: ["get", "list", "watch"]
    resources:
      - group: ""
        resources: ["endpoints", "configmaps"]
  - level: None
    nonResourceURLs: ["/healthz*", "/readyz*", "/livez*", "/metrics", "/version"]

  # --- High-fidelity security captures ---
  - level: RequestResponse
    verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
    resources:
      - group: ""
        resources: ["secrets"]
  - level: RequestResponse
    verbs: ["create", "update", "patch", "delete"]
    resources:
      - group: "rbac.authorization.k8s.io"
        resources: ["roles", "clusterroles", "rolebindings", "clusterrolebindings"]
  - level: RequestResponse
    verbs: ["create"]
    resources:
      - group: ""
        resources: ["serviceaccounts/token"]
  - level: RequestResponse
    verbs: ["create"]
    resources:
      - group: ""
        resources: ["pods/exec", "pods/attach", "pods/portforward"]

  # --- Mutation capture ---
  - level: Request
    verbs: ["create", "update", "patch", "delete"]
    resources:
      - group: ""
        resources: ["configmaps"]
  - level: RequestResponse
    verbs: ["create", "delete"]
    resources:
      - group: ""
        resources: ["namespaces"]
  - level: Request
    verbs: ["create", "update", "patch", "delete"]
    resources:
      - group: "apps"
        resources: ["deployments", "statefulsets", "daemonsets", "replicasets"]

  # --- Metadata-level for pod lifecycle ---
  - level: Metadata
    verbs: ["create", "delete"]
    resources:
      - group: ""
        resources: ["pods"]

  # --- Catch-all ---
  - level: Metadata
    omitStages:
      - "ResponseStarted"

A few design decisions worth explaining.

The omitStages: [RequestReceived] at the top of the policy applies globally. RequestReceived fires before authorization, which means it doubles your log volume without adding any information about what actually happened. Omitting it cluster-wide is the single most impactful thing you can do for audit log volume.

The noise suppression rules at the top silence internal system identities doing routine reconciliation work. Without them, system:kube-proxy watch calls and system:kube-controller-manager list operations generate tens of thousands of events per hour on a busy cluster.

Secrets get RequestResponse on all verbs including reads. This is intentional. If an attacker or a misconfigured service account reads a secret, you want the full response in the log — including the base64-encoded values — so you can confirm exactly which credentials were exposed. This means the audit log destination must be treated as a sensitive data store, not a general-purpose logging endpoint.

The RBAC section captures create, update, patch, and delete on all four RBAC resource types. Privilege escalation via RBAC is one of the most common lateral movement techniques in compromised clusters, and you want full request and response fidelity when it happens.

The catch-all Metadata rule at the bottom ensures that any API group or resource not explicitly matched by an earlier rule still gets recorded at the metadata level. Without this rule, new custom resource types or API extensions introduced to your cluster would be silently dropped from the audit log.

Applying the Policy

On a kubeadm-managed cluster, place the policy file on the control-plane node and reference it in the kube-apiserver static pod manifest:

# Copy the policy to the control-plane node
sudo cp audit-policy.yaml /etc/kubernetes/audit/audit-policy.yaml

# Add these flags to /etc/kubernetes/manifests/kube-apiserver.yaml
# under spec.containers[0].command:
#   - --audit-policy-file=/etc/kubernetes/audit/audit-policy.yaml
#   - --audit-log-path=/var/log/kubernetes/audit/audit.log
#   - --audit-log-maxage=30
#   - --audit-log-maxbackup=10
#   - --audit-log-maxsize=100

The kubelet will restart kube-apiserver automatically when it detects a change to the static pod manifest. Verify the API server restarted cleanly and picked up the policy:

kubectl get pods -n kube-system -l component=kube-apiserver
kubectl logs -n kube-system kube-apiserver- | grep -i audit

On managed Kubernetes services (AKS, EKS, GKE), the control plane is not directly accessible. Each provider exposes audit logs through its own mechanism: AKS via Azure Monitor / Log Analytics, EKS via CloudWatch Logs, GKE via Cloud Logging. The policy configuration interface varies by provider, and managed audit log delivery can lag 5–15 minutes on AKS — it is not a real-time feed.

Shipping Logs with Fluent Bit

Now that kube-apiserver is writing structured JSON to /var/log/kubernetes/audit/audit.log on your control-plane nodes, you need to get those logs into a queryable destination. Fluent Bit is the right tool here: it is lightweight, runs as a DaemonSet with tolerations, and has native output plugins for both Elasticsearch and Loki.

The key constraint is that audit logs only exist on control-plane nodes. Your Fluent Bit DaemonSet needs tolerations for the control-plane taint and a nodeSelector to target those nodes specifically.

Fluent Bit ConfigMap

apiVersion: v1
kind: ConfigMap
metadata:
  name: fluent-bit-audit-config
  namespace: logging
data:
  fluent-bit.conf: |
    [SERVICE]
        Flush         5
        Daemon        Off
        Log_Level     info
        Parsers_File  parsers.conf

    [INPUT]
        Name              tail
        Path              /var/log/kubernetes/audit/audit.log
        Parser            json
        Tag               kube.audit
        Refresh_Interval  5
        Mem_Buf_Limit     50MB
        Skip_Long_Lines   On

    [FILTER]
        Name   record_modifier
        Match  kube.audit
        Record cluster prod-cluster-01
        Record log_type kubernetes_audit

    [OUTPUT]
        Name            es
        Match           kube.audit
        Host            elasticsearch.logging.svc.cluster.local
        Port            9200
        Index           kubernetes-audit
        Type            _doc
        Logstash_Format On
        Logstash_Prefix kubernetes-audit
        Retry_Limit     5

  parsers.conf: |
    [PARSER]
        Name        json
        Format      json
        Time_Key    requestReceivedTimestamp
        Time_Format %Y-%m-%dT%H:%M:%S.%LZ

If you are forwarding to Loki instead of Elasticsearch, replace the [OUTPUT] block:

    [OUTPUT]
        Name            loki
        Match           kube.audit
        Host            loki.logging.svc.cluster.local
        Port            3100
        Labels          job=kubernetes-audit,cluster=prod-cluster-01
        Label_Keys      $verb,$user['username'],$objectRef['resource']
        Retry_Limit     5

Fluent Bit DaemonSet

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: fluent-bit-audit
  namespace: logging
  labels:
    app: fluent-bit-audit
spec:
  selector:
    matchLabels:
      app: fluent-bit-audit
  template:
    metadata:
      labels:
        app: fluent-bit-audit
    spec:
      serviceAccountName: fluent-bit
      tolerations:
        - key: node-role.kubernetes.io/control-plane
          operator: Exists
          effect: NoSchedule
        - key: node-role.kubernetes.io/master
          operator: Exists
          effect: NoSchedule
      nodeSelector:
        node-role.kubernetes.io/control-plane: ""
      containers:
        - name: fluent-bit
          image: fluent/fluent-bit:3.2
          resources:
            requests:
              cpu: 50m
              memory: 64Mi
            limits:
              cpu: 200m
              memory: 256Mi
          volumeMounts:
            - name: audit-log
              mountPath: /var/log/kubernetes/audit
              readOnly: true
            - name: config
              mountPath: /fluent-bit/etc
      volumes:
        - name: audit-log
          hostPath:
            path: /var/log/kubernetes/audit
            type: DirectoryOrCreate
        - name: config
          configMap:
            name: fluent-bit-audit-config
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: fluent-bit
  namespace: logging
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: fluent-bit-audit
rules:
  - apiGroups: [""]
    resources: ["namespaces", "pods"]
    verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: fluent-bit-audit
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: fluent-bit-audit
subjects:
  - kind: ServiceAccount
    name: fluent-bit
    namespace: logging

Apply the ConfigMap and DaemonSet to your cluster:

kubectl create namespace logging --dry-run=client -o yaml | kubectl apply -f -
kubectl apply -f fluent-bit-audit-config.yaml
kubectl apply -f fluent-bit-audit-daemonset.yaml

Verify that the DaemonSet pods land only on control-plane nodes and are reading the audit log:

kubectl get pods -n logging -l app=fluent-bit-audit -o wide
kubectl logs -n logging daemonset/fluent-bit-audit | grep -E "audit|flush|chunk"

You should see one pod per control-plane node and log lines indicating it is tailing the audit log file.

What to Alert On

Collecting audit logs is only half the work. The value comes from the alerts you build on top of them. Here are the five security patterns that should have active alerts in any production cluster.

1. Secret reads and lists

Any access to secrets outside your expected service accounts deserves investigation. In Elasticsearch:

# Kibana KQL
objectRef.resource: "secrets" AND verb: ("get" OR "list") AND NOT user.username: "system:serviceaccount:*"

In Loki (LogQL):

{job="kubernetes-audit"} | json | objectRef_resource="secrets" and verb=~"get|list" | line_format " accessed secret  in "

2. Pod exec, attach, and portforward

An exec into a running pod is a major indicator of either legitimate debugging or active intrusion. Either way, you want to know about it. The response code 101 indicates a successful WebSocket upgrade (exec session established):

# KQL
objectRef.resource: "pods" AND objectRef.subresource: ("exec" OR "attach" OR "portforward") AND responseStatus.code: 101

3. RBAC mutations

Any create, update, patch, or delete against ClusterRoleBindings or RoleBindings in sensitive namespaces should alert immediately. Privilege escalation via RBAC is the most common post-compromise lateral movement path:

# KQL
objectRef.resource: ("clusterrolebindings" OR "rolebindings") AND verb: ("create" OR "update" OR "patch" OR "delete")

Correlate these events with the requestObject field, which at RequestResponse level will contain the full binding definition including the subject being granted access.

4. Failed authentication and authorization

A burst of 401 or 403 responses is either a misconfigured service account or credential scanning. Either warrants investigation:

# KQL — rate alert: more than 10 in 5 minutes from the same sourceIP
responseStatus.code: (401 OR 403) AND sourceIPs: *

Set this as a count-based alert in your SIEM rather than alerting on individual events — some 403s in normal operations are expected. A spike is the signal.

5. Anonymous requests

Any request authenticated as system:anonymous should be treated as a configuration error at minimum and a probing attempt at worst:

# KQL
user.username: "system:anonymous"

If anonymous authentication is disabled on your cluster (--anonymous-auth=false on the API server), this alert should never fire. If it does, something is wrong.

Best Practices

Always omit RequestReceived globally. This stage fires before authorization and carries no additional information over ResponseComplete for security purposes. Keeping it doubles your audit log volume without any investigation value. Set it in omitStages at the top of your policy file, not per rule.

Never apply RequestResponse to list verbs on high-volume resources. A RequestResponse audit event for a list secrets call includes the full response body — every secret in the namespace in base64. On a namespace with 50 secrets being listed every 30 seconds by a controller, that is a significant storage and security exposure. Scope RequestResponse to specific verbs (get, create, update, patch, delete) and use Metadata or Request for list and watch on non-secret resources.

Treat the audit log destination as a sensitive data store. At RequestResponse level, audit events for secret reads contain base64-encoded secret values. Your Elasticsearch index or Loki stream for audit logs needs the same access controls as the secrets themselves. Restrict read access, enable encryption at rest, and do not route audit events through a general-purpose logging pipeline with broad access.

Always include a catch-all rule at the bottom of your policy. Without it, any API group or resource not explicitly matched by your rules is silently dropped. Custom resource definitions, new API groups added by operators, and future Kubernetes API additions all fall through the gap. The Metadata catch-all at the bottom of the production policy above ensures nothing is silently ignored.

Account for managed Kubernetes audit log latency. On AKS, audit logs delivered through Azure Monitor can lag 5–15 minutes. This means your audit-based alerts are not real-time — they are delayed. Design your incident response process with this in mind and do not rely on audit log alerts as your only detection layer for active incidents. Complement them with runtime security tools like Falco for real-time detection.

Rotate and archive audit log files on the control-plane node. The --audit-log-maxage, --audit-log-maxbackup, and --audit-log-maxsize flags on kube-apiserver control local rotation. Set them explicitly: 30 days retention, 10 backup files, 100MB per file is a reasonable starting point. Without these flags, a single audit log file can grow until it fills the control-plane root volume, which will crash kube-apiserver.

Conclusion

Kubernetes audit logging is not a checkbox. Without a thoughtful policy, you either have silence where you need evidence or noise that makes the evidence unreachable. The approach in this post — suppress system traffic first, escalate to RequestResponse only for resources that carry security value, ship with Fluent Bit to a secured destination, and alert on the five patterns that actually indicate malicious activity — gives you a forensic trail you can actually use.

The policy and the Fluent Bit configuration are both starting points. Your first week of running them in production will surface internal system accounts you need to suppress and resources you want to escalate. Tune from there. Version your policy file in git alongside the rest of your cluster configuration.

Happy scripting!

OCI Vault: Secrets Management with Terraform

2026-04-06T09:00:00+00:00

If you’ve ever opened a Terraform repository and found something like db_password = "Sup3rS3cr3t!" hardcoded in a .tfvars file — or worse, in main.tf itself — you already know exactly what problem we’re talking about. Hardcoded credentials are one of the most common vulnerabilities in infrastructure-as-code projects, and the risk doesn’t stop there: even when secrets are passed correctly as variables, certain Terraform data sources write the secret value directly into the state file, which often lives in an S3 bucket or a remote backend without additional encryption.

OCI Vault solves this problem at the root. It’s Oracle Cloud’s managed service for key and secret storage, backed by HSM, with granular access control via IAM and native support in the Terraform provider for OCI. In this post we’ll build the complete infrastructure from scratch: vault, master encryption key, secrets with expiration and rotation rules, IAM policies for teams and for workloads via Instance Principal, and the verification commands to confirm everything works before trusting the system in production.

We’ll also be explicit about the state file problem and how to avoid it, because it’s the most dangerous gotcha when working with secrets in Terraform.

Architecture and key concepts

OCI Vault has an architecture with two separate planes:

Management Endpoint (control plane): used for administrative operations — creating vaults, keys, and secrets, rotating versions. All Terraform calls go here.
Cryptographic Endpoint (data plane): used for actual cryptographic operations — encrypt, decrypt, sign. Applications that need direct encryption point here.

This separation is not cosmetic. It means you can restrict access to the data plane independently from the control plane, which is relevant for IAM policy design.

┌─────────────────────────────────────────────────────────────┐
│  OCI Tenancy                                                │
│  ┌───────────────────────────────────────────────────────┐  │
│  │  Compartment: production                              │  │
│  │                                                       │  │
│  │  ┌─────────────────┐    ┌───────────────────────────┐ │  │
│  │  │   OCI Vault     │    │   Compute Instance        │ │  │
│  │  │  ┌───────────┐  │    │   (Instance Principal)    │ │  │
│  │  │  │  MEK Key  │  │    │                           │ │  │
│  │  │  └─────┬─────┘  │    │   oci secrets             │ │  │
│  │  │        │ encrypts│   │   secret-bundle get ───►  │ │  │
│  │  │  ┌─────▼─────┐  │◄───┤                           │ │  │
│  │  │  │  Secrets  │  │    │                           │ │  │
│  │  │  └───────────┘  │    └───────────────────────────┘ │  │
│  │  └─────────────────┘                                  │  │
│  │         ▲                                             │  │
│  │   Management Endpoint   Cryptographic Endpoint        │  │
│  └───────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────┘

Vault types: an irreversible decision

This is the first point where you need to think before running terraform apply, because the vault type cannot be changed after creation. The options are:

Type	HSM	Key auto-rotation	Cost	Recommended use
`DEFAULT`	Shared	No	Lower	Development, staging
`VIRTUAL_PRIVATE`	Dedicated	Yes (GA since Feb 2024)	Higher	Production
`EXTERNAL`	External (BYOK)	No	Variable	Strict regulations

For production, VIRTUAL_PRIVATE is the right answer: dedicated HSM, support for automatic key rotation, and full isolation. For development and testing environments, DEFAULT works well and is considerably more economical.

In this post we’ll use DEFAULT to keep the example deployable in any tenancy, but in the best practices section we’ll look at when and how to migrate to VIRTUAL_PRIVATE.

Keys: AES for secrets, RSA/ECDSA for signing

OCI Vault supports symmetric keys (AES) and asymmetric keys (RSA, ECDSA). The important constraint: only AES keys can encrypt secrets. RSA and ECDSA keys are for signing and asymmetric encryption, not for the vault secrets service. If you try to associate an RSA key with a secret, the operation fails.

The Terraform gotcha that bites almost everyone the first time: key length is specified in bytes, not bits. AES-256 = length = 32. If you set length = 256 you’re requesting a 2048-bit key, which isn’t even a valid AES size.

Prerequisites

To follow this post you’ll need:

OCI CLI installed and configured (oci setup config or API key in ~/.oci/config)
Terraform >= 1.3
Provider oracle/oci >= 8.0
A compartment OCID where you have manage vaults, manage keys, and manage secret-family permissions
Your tenancy OCID (needed to create Dynamic Groups, which are tenancy-level)

Verify access before starting:

# Verify the CLI is configured correctly
oci iam user get --user-id $(oci iam user list --query 'data[0].id' --raw-output)

# Verify you have access to the compartment
oci iam compartment get --compartment-id $COMPARTMENT_ID

# Verify the provider version in your project
terraform providers

Step-by-step implementation

Provider configuration

We start with the provider configuration block. Nothing special here, but it’s important to pin the provider version because the OCI Vault API changed between major versions:

terraform {
  required_providers {
    oci = {
      source  = "oracle/oci"
      version = "~> 8.0"
    }
  }
}

provider "oci" {
  region = var.region
}

The variables we’ll need throughout the example:

variable "region" {
  description = "OCI region"
  type        = string
}

variable "compartment_id" {
  description = "OCID of the compartment where resources are deployed"
  type        = string
}

variable "tenancy_ocid" {
  description = "OCID of the tenancy (required for Dynamic Groups)"
  type        = string
}

variable "db_password" {
  description = "Database admin password"
  type        = string
  sensitive   = true
}

Creating the Vault and Master Encryption Key

The vault and key are created with two separate resources. The relationship between them is that oci_kms_key requires the management_endpoint of the vault — not a hardcoded endpoint, but a reference to the vault resource’s attribute. Without depends_on, Terraform may try to create the key before the vault is fully provisioned, resulting in an unavailable endpoint error.

resource "oci_kms_vault" "app_vault" {
  compartment_id = var.compartment_id
  display_name   = "app-production-vault"
  vault_type     = "DEFAULT"

  freeform_tags = {
    "Environment" = "production"
    "ManagedBy"   = "terraform"
  }
}

resource "oci_kms_key" "app_key" {
  compartment_id      = var.compartment_id
  display_name        = "app-secrets-key"
  management_endpoint = oci_kms_vault.app_vault.management_endpoint

  key_shape {
    algorithm = "AES"
    length    = 32   # 32 bytes = AES-256 (Terraform uses bytes, not bits)
  }

  protection_mode = "HSM"

  depends_on = [oci_kms_vault.app_vault]
}

Two important decisions in this block:

protection_mode = "HSM" means the key material never leaves the HSM — OCI cannot export it and neither can you. If you use protection_mode = "SOFTWARE", the key can be exported, which expands the attack surface. For production, always HSM.

The explicit depends_on is not just best practice: it’s necessary. The vault may take a few seconds to become operational after the API reports the resource as created, and the key needs the management endpoint to be active to register.

Creating the secret with expiration rules

Now for the secret itself. The secret content must be base64-encoded — OCI Vault does not accept plain text in the API. Terraform has the base64encode() function that does exactly that:

resource "oci_vault_secret" "db_password" {
  compartment_id = var.compartment_id
  vault_id       = oci_kms_vault.app_vault.id
  key_id         = oci_kms_key.app_key.id
  secret_name    = "app-db-password"
  description    = "Database admin password for app-production"

  secret_content {
    content_type = "BASE64"
    content      = base64encode(var.db_password)
    stage        = "CURRENT"
  }

  secret_rules {
    rule_type                                     = "SECRET_EXPIRY_RULE"
    secret_version_expiry_interval                = "P90D"
    is_secret_content_retrieval_blocked_on_expiry = true
  }

  secret_rules {
    rule_type                              = "SECRET_REUSE_RULE"
    is_enforced_on_deleted_secret_versions = true
  }
}

The secret_rules are the component most often overlooked in basic implementations and that makes the biggest difference in production:

SECRET_EXPIRY_RULE with P90D makes the secret expire after 90 days. The critical part is is_secret_content_retrieval_blocked_on_expiry = true. By default this field is false, meaning that even when the secret expires, applications can still read it. That makes expiration decorative. With true, OCI blocks access to the secret bundle once it expires, forcing real rotation.

SECRET_REUSE_RULE with is_enforced_on_deleted_secret_versions = true prevents reuse of a previous secret value, even in deleted versions. This is a compliance control relevant in regulated environments.

Outputs for later reference

Outputs are important both for verification and so that other Terraform modules can reference these resources:

output "vault_id" {
  description = "Vault OCID"
  value       = oci_kms_vault.app_vault.id
}

output "vault_management_endpoint" {
  description = "Vault management endpoint (required for key operations)"
  value       = oci_kms_vault.app_vault.management_endpoint
}

output "key_id" {
  description = "Master encryption key OCID"
  value       = oci_kms_key.app_key.id
}

output "db_secret_id" {
  description = "Database secret OCID"
  value       = oci_vault_secret.db_password.id
}

The state file problem

This is the point where most projects fail silently, and it’s worth pausing.

OCI Vault exposes two data sources for reading secrets in Terraform:

oci_vault_secret — returns only metadata about the secret: OCID, name, state, dates. The secret value never appears in the state.
oci_secrets_secretbundle — returns the actual content of the secret, decoded. This value is stored in the state file.

The Terraform state file is not encrypted by default. If your backend is an AWS S3 bucket or an OCI Object Storage bucket without additional encryption, the secret is stored in plain text in the state. Anyone with access to the backend has access to the secret.

The safest pattern is to never read the secret value from Terraform. Applications should retrieve it at runtime using the OCI SDK or CLI with Instance Principal, not during apply. If you need to reference a secret’s OCID in another resource, use oci_vault_secret (metadata only) or directly reference the output of the resource that created it.

If for some operational reason you need to read the bundle in Terraform, there are three mitigations:

1. KMS-encrypted backend. If you use OCI Object Storage as a Terraform backend, you can configure it with an OCI Vault key so the state file is encrypted at rest. The secret is still in the state, but the state is encrypted with a key whose access you control with IAM.

2. Automatic secret generation. Some secret types support enable_auto_generation = true in oci_vault_secret. In that case, OCI generates the value internally and it never goes through Terraform — the state only contains the OCID, never the value. This is ideal for database passwords that you don’t need to know yourself, only the application does.

3. Separate provisioning. The vault and keys are managed with Terraform. Secret values are loaded with the CLI or a separate pipeline with limited access. Terraform manages the infrastructure, not the sensitive data.

The recommended posture: use Terraform to create the vault infrastructure (vault, key, secret resource with a placeholder value or with auto-generation), and leave injecting the real value for a separate step outside the Terraform state.

IAM Policies: granular access control

This is the component that’s hardest to get right, because OCI IAM has a verb matrix that’s not immediately obvious.

The verb matrix for secrets

Verb	Operation	Who needs it
`read secret-bundles`	GetSecretBundle — retrieve the secret value	App workloads, production instances
`read secrets`	GetSecret — view secret metadata	Audit, CI/CD pipelines that only reference OCIDs
`use secrets`	ListSecretVersions and rotation operations	Automated rotation tools
`manage secret-family`	Full control — create, delete, rotate, modify	Security administrators only

The golden rule: never grant manage secret-family to an application workload. With that verb, the application can delete secrets, create versions with arbitrary values, and modify expiration rules. The blast radius if the application is compromised extends to the entire vault.

Policies for the administrator team

resource "oci_identity_policy" "vault_admin_policy" {
  compartment_id = var.compartment_id
  name           = "vault-admin-policy"
  description    = "Allow SecurityAdmins group to fully manage vault resources"

  statements = [
    "Allow group SecurityAdmins to manage vaults in compartment id ${var.compartment_id}",
    "Allow group SecurityAdmins to manage keys in compartment id ${var.compartment_id}",
    "Allow group SecurityAdmins to manage secret-family in compartment id ${var.compartment_id}",
  ]
}

Dynamic Groups for Instance Principal

Dynamic Groups are OCI’s mechanism for compute instances to authenticate with IAM without static credentials. The instance assumes an identity based on its compartment membership, and that identity has the policies you assign to it.

An important operational detail: Dynamic Groups are created at the tenancy level, not the compartment level. The compartment_id of the oci_identity_dynamic_group resource must be the tenancy OCID, even if the matching rule filters instances from a specific compartment. If you use a child compartment OCID, the OCI provider will return an error.

resource "oci_identity_dynamic_group" "app_instances" {
  compartment_id = var.tenancy_ocid   # Always tenancy, not compartment
  name           = "app-compute-instances"
  description    = "Compute instances in the app production compartment"
  matching_rule  = "All {instance.compartment.id = '${var.compartment_id}'}"
}

resource "oci_identity_policy" "instance_secret_policy" {
  compartment_id = var.compartment_id
  name           = "instance-vault-access-policy"
  description    = "Allow app instances to retrieve secrets from vault"

  statements = [
    "Allow dynamic-group app-compute-instances to read secret-bundles in compartment id ${var.compartment_id}",
  ]
}

If you want to narrow access to a specific secret rather than the entire compartment, OCI supports conditions in IAM statements:

resource "oci_identity_policy" "instance_specific_secret_policy" {
  compartment_id = var.compartment_id
  name           = "instance-specific-secret-policy"
  description    = "Allow app instances to retrieve only the db password secret"

  statements = [
    "Allow dynamic-group app-compute-instances to read secret-bundles in compartment id ${var.compartment_id} where target.secret.name = 'app-db-password'",
  ]
}

This granularity is particularly useful in multi-application environments where different services need access to different secrets within the same compartment.

Testing and verification

With the infrastructure applied, verification has three levels: vault and key state, secret retrieval from your local machine, and retrieval from an instance using Instance Principal.

Verify vault and key state

# Verify the vault is ACTIVE
oci kms management vault get \
  --vault-id "$(terraform output -raw vault_id)" \
  --query 'data."lifecycle-state"' --raw-output

# Verify the key is ENABLED
oci kms management key get \
  --key-id "$(terraform output -raw key_id)" \
  --endpoint "$(terraform output -raw vault_management_endpoint)" \
  --query 'data."lifecycle-state"' --raw-output

The expected vault state is ACTIVE. The expected key state is ENABLED. If the vault is in CREATING or PROVISIONING, wait a few seconds and query again.

Retrieve and verify the secret

# Retrieve the secret and decode the base64
oci secrets secret-bundle get \
  --secret-id "$(terraform output -raw db_secret_id)" \
  --query 'data."secret-bundle-content".content' \
  --raw-output | base64 --decode

If the output matches the value you passed in var.db_password, the complete cycle works: Terraform created the secret, OCI encrypted it with the MEK, and the CLI retrieved it correctly.

Verify access from an instance with Instance Principal

From an instance that belongs to the compartment configured in the Dynamic Group’s matching rule:

# On the compute instance — no static credentials needed
oci secrets secret-bundle get \
  --secret-id "ocid1.vaultsecret.oc1.xxx" \
  --auth instance_principal \
  --query 'data."secret-bundle-content".content' --raw-output | base64 --decode

If this command returns the secret value without needing API keys configured on the instance, Instance Principal is working correctly. If it returns an authorization error, verify that the instance is in the correct compartment and that the Dynamic Group has the appropriate matching rule.

Verify expiration rules

# View secret metadata including rules and expiration date
oci vault secret get \
  --secret-id "$(terraform output -raw db_secret_id)" \
  --query 'data.{name:"secret-name", state:"lifecycle-state", rules:"secret-rules"}'

Best Practices

Never use VIRTUAL_PRIVATE in the same apply as the secrets if you’re just starting. The VIRTUAL_PRIVATE vault takes several minutes to provision its dedicated HSM. If Terraform tries to create keys and secrets before the vault is fully operational, the apply fails. Separating vault creation into its own module with a prior terraform apply avoids this problem.

Use protection_mode = "HSM" in production, always. With SOFTWARE, the key material can be exported. That means with the right permissions, someone can extract the key from the vault. With HSM, the material never leaves the hardware. The additional cost of HSM is marginal compared to the risk of an exportable key.

The vault type is immutable after creation. If you need to migrate from DEFAULT to VIRTUAL_PRIVATE, the process is: create a new VIRTUAL_PRIVATE vault, create new keys, rotate all secrets to the new vault, and delete the old one. There’s no in-place upgrade. Plan your vault type before the first deploy.

Enable is_secret_content_retrieval_blocked_on_expiry = true in all expiration rules. The default is false, which turns expiration into a toothless alert. With true, OCI blocks access to the secret once it expires, forcing rotation. Without this, a secret “expired” six months ago is still accessible.

Separate key management from secret management. Keys (MEK) are the responsibility of the security team. Individual secrets can be the responsibility of application teams, with the constraint that they can only use pre-approved keys. This is modeled in IAM by separating groups and policies: SecurityAdmins has manage keys, application teams have use keys and manage secret-family in their compartment.

Use encrypted backends for Terraform state. If your Terraform backend is in OCI Object Storage, configure server-side encryption with an OCI Vault key. This doesn’t eliminate the risk of secrets being in the state, but adds a layer of at-rest protection with auditable access control.

Prefer auto-generation or separate provisioning over reading secrets in Terraform. The safest pattern is for Terraform to never know the actual value of the secrets it manages. For database passwords, enable enable_auto_generation. For secrets you need to control, load them with the CLI in a separate pipeline step with reduced permissions.

Conclusion

We built the complete OCI Vault infrastructure with Terraform: vault with correctly configured type and protection mode, AES-256 master encryption key in HSM, secrets with expiration rules that actually block access, and the correct IAM policies for both administrators and workloads via Instance Principal.

The most important point is not the code itself, but the gotchas you need to know before going to production: the vault type is irreversible, key length is in bytes, oci_secrets_secretbundle writes the value to the state, and is_secret_content_retrieval_blocked_on_expiry is false by default. With that clear, the rest is configuration.

The natural next step is integrating this vault with CI/CD pipelines using OCI DevOps or GitHub Actions with OIDC, so pipelines retrieve secrets at runtime without static credentials. That’s material for another post.

Happy scripting!

Oracle Cloud Security Zones: Custom Recipes, Terraform, and Day-2 Operations

2026-03-25T01:18:34+00:00

Part 1 of this series covered the conceptual foundation of OCI Security Zones: what they are, how they enforce policy by denying API calls outright, the relationship with Cloud Guard and Security Advisor, and what the Maximum Security Recipe actually blocks. If you haven’t read it, start there.

This post picks up where Part 1 left off. It answers the next set of questions practitioners ask after they understand the concept: How do I build a custom recipe that fits my workload? How do I automate this with Terraform instead of clicking through the console? And what are the operational surprises waiting on day two?

Custom Recipes vs. Maximum Security: A Decision Framework

The Maximum Security Recipe is Oracle’s nuclear option — it enables every available Security Zone policy simultaneously and cannot be modified. In practice, most production workloads cannot tolerate it without significant architectural changes, because it blocks things like internet gateways, NAT gateways, public load balancers, volume detachment, instance termination, and OKE cluster operations.

Custom recipes let you select which Oracle-authored policies to include. You cannot write your own policy logic — the library is curated by Oracle — but you can assemble a policy set appropriate to each environment.

The most useful mental model for custom recipe construction comes from OCI’s own Landing Zone framework, which maps policies to CIS Benchmark levels:

CIS Level	Policies Included	Best For
Level 1	Deny public buckets, public subnets, internet gateway; deny databases without backup; require customer-managed encryption keys	Most production workloads
Level 2	All Level 1 + data confinement policies, Oracle-approved configurations, port restriction	Regulated data: PHI, PCI, classified

Start with CIS Level 1 for standard production compartments. Apply Level 2 selectively to compartments holding your most sensitive data. Avoid Maximum Security unless you are in a greenfield environment specifically designed around its constraints — or you are onboarding to OCI via a Landing Zone that has already pre-validated compatibility.

The categories you most frequently need to reason about when customizing:

Deny Public Access — the most operationally impactful category. Blocking internet_gateway, NAT_gateway, and public_subnets means your VCN topology must be private-only. For environments that legitimately need outbound internet access (to pull container images, reach OCI service endpoints, etc.), this either requires a shared services VCN with a NAT gateway outside the zone, or removing the NAT gateway policy from your custom recipe.

Require Customer-Managed Encryption Keys — the four vault key policies (deny block_volume_without_vault_key, deny boot_volume_without_vault_key, deny file_system_without_vault_key, deny buckets_without_vault_key) require OCI Vault to be set up and a Master Encryption Key provisioned before applying the zone. Vault is not part of Always Free — you need a standard or virtual private vault. The vault should be in the same zone or a parent compartment to avoid key access itself violating zone policies.

Oracle-Approved Configurations — this category includes policies that block compute instance termination (deny terminate_instance), volume detachment (deny detach_volume), and OKE operations (deny manage_oke_service). These are frequently too restrictive for teams that use autoscaling or perform routine maintenance. Exclude them from custom recipes unless you have a specific operational reason to include them.

Building a Custom Recipe via CLI

The CLI workflow has three steps: list available policies and collect the OCIDs you want, create a recipe from those OCIDs, then create the zone with the recipe.

Step 1: List and filter available policies

# List all available security policies in the tenancy
oci cloud-guard security-policy-collection list-security-policies \
  --compartment-id $COMPARTMENT_ID \
  --all

# Filter to find a specific policy's OCID
oci cloud-guard security-policy-collection list-security-policies \
  --compartment-id $COMPARTMENT_ID \
  --display-name "deny public_buckets" \
  --lifecycle-state ACTIVE

Important: Policy OCIDs are region-specific. You must look them up in the target tenancy and region — you cannot hardcode them from documentation or another environment.

Step 2: Create the recipe

oci cloud-guard security-recipe create \
  --compartment-id $COMPARTMENT_ID \
  --display-name "prod-cis-level-1-recipe" \
  --security-policies '["ocid1.securityzonepolicy.oc1..aaa...xyz", "ocid1.securityzonepolicy.oc1..aaa...abc"]'

Step 3: Create the zone

oci cloud-guard security-zone create \
  --compartment-id $COMPARTMENT_ID \
  --display-name "production-security-zone" \
  --security-zone-recipe-id $RECIPE_OCID

To update an existing zone to use a different recipe:

oci cloud-guard security-zone update \
  --security-zone-id $ZONE_OCID \
  --security-zone-recipe-id $NEW_RECIPE_OCID

Terraform Automation

The CLI workflow above does not scale. For any environment managed as code, there are two Terraform paths: the Oracle Landing Zone module, or native provider resources.

Path 1: Oracle Landing Zone Security Module

Oracle publishes and maintains a terraform-oci-modules-security module (github.com/oci-landing-zones/terraform-oci-modules-security) that abstracts away the policy OCID lookup problem. You specify a cis_level and it selects the appropriate policies automatically:

module "security_zones" {
  source       = "github.com/oci-landing-zones/terraform-oci-modules-security//security-zones"
  tenancy_ocid = var.tenancy_ocid

  security_zones_configuration = {
    reporting_region = "us-ashburn-1"

    recipes = {
      CIS-L1-RECIPE = {
        name           = "prod-cis-level-1-recipe"
        description    = "CIS Level 1 recipe for production workloads"
        compartment_id = var.compartment_id
        cis_level      = "1"
      }
      CIS-L2-RECIPE = {
        name           = "sensitive-cis-level-2-recipe"
        description    = "CIS Level 2 recipe for sensitive data compartments"
        compartment_id = var.compartment_id
        cis_level      = "2"
      }
    }

    security_zones = {
      PROD-ZONE = {
        name           = "production-security-zone"
        compartment_id = var.prod_compartment_id
        recipe_key     = "CIS-L1-RECIPE"
      }
      SENSITIVE-ZONE = {
        name           = "sensitive-data-security-zone"
        compartment_id = var.sensitive_compartment_id
        recipe_key     = "CIS-L2-RECIPE"
      }
    }
  }
}

Prerequisites before applying:

Terraform >= 1.3.0
Cloud Guard must be enabled in the tenancy
IAM policy: allow group to manage cloud-guard-family in tenancy

This module approach is the recommended path for teams using OCI Core or Zero Trust Landing Zones, because the module is tested against Oracle’s own reference architectures.

Path 2: Native Provider Resources

For teams that prefer direct resource control without the module abstraction:

# Fetch policy OCIDs from the tenancy — required because they are region-specific
data "oci_cloud_guard_security_policies" "all_policies" {
  compartment_id = var.tenancy_ocid
}

# Locals to extract specific policy OCIDs by name
locals {
  policy_map = {
    for p in data.oci_cloud_guard_security_policies.all_policies.security_policy_collection[0].items :
    p.display_name => p.id
  }
}

resource "oci_cloud_guard_security_recipe" "custom_recipe" {
  compartment_id = var.compartment_id
  display_name   = "custom-security-recipe"
  description    = "Custom recipe for production workloads"
  security_policies = [
    local.policy_map["deny public_buckets"],
    local.policy_map["deny public_subnets"],
    local.policy_map["deny internet_gateway"],
    local.policy_map["deny block_volume_without_vault_key"],
    local.policy_map["deny boot_volume_without_vault_key"],
    local.policy_map["deny database_without_backup"],
  ]
}

resource "oci_cloud_guard_security_zone" "production_zone" {
  compartment_id          = var.compartment_id
  display_name            = "production-security-zone"
  description             = "Security zone for production workloads"
  security_zone_recipe_id = oci_cloud_guard_security_recipe.custom_recipe.id
}

Using the data source and locals to map policy names to OCIDs avoids hardcoded OCID strings that would break across regions and environments.

The OCI Console also provides a “Save as Stack” button during recipe and zone creation wizards. This exports a Terraform configuration to Oracle Resource Manager — useful for teams bootstrapping an IaC workflow from a console-based starting point.

Operational Lifecycle: What Happens After Day One

The Cloud Guard Target Side Effect

This is the most commonly missed operational detail: when you create a security zone on a compartment, OCI deletes any existing Cloud Guard target for that compartment and replaces it with a security zone target.

If you had a manually configured Cloud Guard target with custom detector recipes on that compartment, those configurations are gone. The replacement target gets the default Oracle-managed detector recipe.

Audit your Cloud Guard targets before applying security zones to existing compartments. If you have custom detector configuration you want to preserve, document it before creating the zone and reapply it to the new target afterward.

Subcompartment Hierarchy

When a security zone is applied to a parent compartment, all subcompartments are automatically included. Subcompartments can have their own separate security zones (which creates a distinct Cloud Guard target for the subcompartment). A subcompartment can also be removed from the parent zone entirely via the Security Zones console:

oci cloud-guard security-zone remove \
  --security-zone-id $ZONE_OCID \
  --compartment-id $SUBCOMPARTMENT_OCID

The hard constraint remains: each compartment can belong to exactly one security zone. You cannot layer multiple recipes on a single compartment. If your workload needs different policy profiles within the same parent, the answer is separate child compartments with separate zones.

You also cannot move a compartment using the standard IAM console once it is part of a security zone. Use the Security Zones console for compartment operations.

Existing Resources and Policy Violations

Applying a security zone to a compartment that already contains non-compliant resources does not delete or modify those resources. Cloud Guard detects and reports the violations, but remediation is the operator’s responsibility.

The key constraint: you cannot move a non-compliant resource out of the compartment using movement-restriction policies — the movement itself would be denied by the zone. You must bring the resource into compliance in place (e.g., encrypting an unencrypted block volume with a Vault key) before the zone will treat it as fully compliant.

For the same reason, you cannot move a non-compliant resource into a security zone compartment. All policies must be satisfied before the move is permitted.

Name Immutability

Once a security zone is created, its name cannot be changed. Only the description and recipe assignment can be updated. Establish a naming convention before deployment — --security-zone works well — and document it. Renaming requires deleting and recreating the zone, which resets the Cloud Guard target again.

Common Gotchas

Root compartment warning. Oracle’s documentation explicitly cautions against assigning a security zone to the root (tenancy) compartment. Doing so applies zone policies to every resource across the entire tenancy, which blocks a wide range of routine administrative operations. Apply zones at the workload compartment level, not the root.

Database compatibility. Not all database configurations are compatible with Security Zones. Incompatible with Maximum Security Recipe: Always Free Autonomous Databases and Autonomous Database with public endpoints. Compatible (paid, private endpoint configurations): Autonomous AI Database, Bare Metal DB systems, Virtual Machine DB systems, and Exadata Cloud DB systems. Data Guard associations must be within the same security zone compartments — cross-zone Data Guard is blocked.

Vault must exist before encryption policies apply. The four deny *_without_vault_key policies will cause resource creation to fail unless you have an OCI Vault with a Master Encryption Key already provisioned and accessible. If you include encryption policies in your recipe, provision the vault as part of the same Terraform apply (with correct ordering) or as a prerequisite stack. The vault should be in the same or a parent compartment to avoid key access itself triggering zone violations.

Policy OCIDs are region-specific. Do not copy OCID values from one region’s recipe to another. Always look up policy OCIDs in the target region, either via CLI or the data source in Terraform. The module approach avoids this problem entirely by resolving OCIDs internally.

What to Cover in Part 3

The natural next topic in this series is Cloud Guard in depth: how detector recipes work, when to use the Oracle-managed recipe vs. a custom one, how auto-remediation is configured, and how to interpret Cloud Guard’s risk score output in the context of Security Zone policy violations. Zero Trust Packet Routing (ZPR) — Oracle’s newer, attribute-based network control layer — is also worth its own post as a complement to Security Zones for teams building on OCI’s security architecture.

References: Security Zone Policies — Oracle Docs · terraform-oci-modules-security · Safeguard Your Tenancy With Custom Security Zones — Oracle A-Team · oci_cloud_guard_security_zone — Terraform Registry

Happy scripting!

Building a Compliance-as-Code agent

2025-12-30T06:15:31+00:00

Manual compliance reviews are the bottleneck nobody talks about. Your infrastructure code sits in a pull request, waiting for someone to verify naming conventions, check security policies, and ensure resource configurations align with company standards. Hours or even days pass before deployment can proceed.

There’s a better way: intelligent automation that understands your policies and validates infrastructure code before it ever reaches production.

The Challenge with Traditional Compliance

Most organizations handle infrastructure compliance through one of two approaches, both flawed:

Manual code reviews consume significant engineering time and introduce human error. Reviewers might miss subtle violations or apply policies inconsistently across teams.

Static linting tools catch syntax issues but lack contextual understanding. They can’t interpret nuanced business rules or explain why something violates policy.

What we need is something that combines the intelligence of human review with the consistency and speed of automation.

Enter: Policy-Aware AI Agents

The solution leverages an AI agent specifically trained on your organization’s compliance documentation. Rather than relying on generic best practices, this agent evaluates infrastructure code against your actual internal policies.

Here’s what makes this approach powerful:

Context-aware analysis - The agent understands not just Terraform syntax, but your specific requirements around naming, tagging, regions, and resource configurations.

Structured output - Every compliance check returns a clear verdict with detailed violation descriptions and policy references.

No hallucinations - By constraining the AI to only reference provided documentation through RAG, you eliminate unreliable suggestions based on general internet knowledge.

Architecture Overview

The system consists of two primary components working together:

The Compliance Agent

Built on Microsoft Foundry, this agent serves as your automated auditor. It receives Terraform code as input and returns structured compliance verdicts.

The agent’s behavior is controlled through a carefully designed system prompt that:

Defines its role as a compliance auditor
Restricts it to only using provided policy documents
Enforces a specific JSON output format
Handles edge cases like invalid input or missing rules

Here’s a sample of what the policy documentation might include:

Resource Naming Standards:
Format: --
Example: rg-webapp-prod

Required Tags:
- Environment: must be dev, stg, or prod
- Cost-center: must match approved list

Approved Regions:
- Primary: eastus
- Secondary: westus

The CI/CD Integration

The agent plugs directly into your deployment pipeline. When developers push Terraform code, the pipeline automatically:

Extracts the infrastructure definitions
Sends them to the compliance agent
Receives a structured verdict
Blocks or approves the deployment based on results

This happens in seconds, providing immediate feedback to developers while maintaining consistent policy enforcement.

Implementation Walkthrough

Setting Up the AI Agent

Start by creating a new AI agent project in Microsoft Foundry. Select an appropriate language model variants work well for code analysis, though gpt-4.1 suffices for simpler use cases.

The critical step is crafting your system prompt. This prompt must be explicit about:

What constitutes valid input
How to structure responses
What to do when rules are ambiguous
How to cite policy violations

Your prompt should enforce a consistent output schema. Something like:

{
  "verdict": "COMPLIANT | NON-COMPLIANT | UNKNOWN | INVALID_INPUT",
  "analysis": "detailed explanation here",
  "violations": [
    {
      "description": "what's wrong",
      "policy_source": "which rule was violated"
    }
  ]
}

Connecting Policy Documents

The agent needs access to your compliance documentation. Azure AI Search provides the infrastructure for this through RAG implementation.

Upload your policy documents—security guidelines, naming conventions, network topology requirements—to Azure AI Search. These become the knowledge base the agent queries when evaluating code.

The beauty of RAG is that updating policies is straightforward. Add new documents or modify existing ones, and the agent immediately incorporates those changes without requiring prompt retraining.

You can use the tools section to upload directly files. For this example can we use the next content as a policy:

1. Naming convention for resources

All resources must follow this format: `--`
Examples:
rg-core-dev (Resource Group for developtment)
sa-1234-prod (Storage Account for production)
Supported types:
- rg: Resource Group
- sa: Storage Account
- vnet: Virtual Network
- sn: Subnet
- vm: Virtual Machine
- nic: Network Interface
Supported environments:
- dev, stg, prod

2. Tags must be applied

All resources must include this tag in Terraform:
tags = {
  env = "dev" | "stg" | "prod"
}

3. Required location

All resources must be deployed to: `eastus`
Example:
location = "eastus"

Testing and Validation

Before integrating into production pipelines, thoroughly test your agent. Create Terraform examples that intentionally violate various policies:

resource "azurerm_storage_account" "example" {
  name = "storageaccount123"
  resource_group_name = "default-rg"
  location = "centralus"
  account_tier = "Standard"
  account_replication_type = "LRS"
}

The agent should catch:

Naming convention violations
Incorrect region usage
Missing mandatory tags

Verify that it correctly references your specific policies in its violation descriptions.

Pipeline Integration

Add a compliance stage to your Azure DevOps pipeline:

- stage: InfrastructureCompliance
  jobs:
  - job: ValidateCompliance
    steps:
    - task: UsePythonVersion@0
      inputs:
        versionSpec: '3.x'
    
    - script: |
        pip install azure-ai-projects requests
      displayName: 'Install dependencies'
    
    - script: |
        python scripts/run_compliance_check.py \
          --terraform-path ./infrastructure
      displayName: 'Execute compliance validation'
    
    - script: |
        RESULT=$(cat compliance_result.json | jq -r '.verdict')
        if [ "$RESULT" != "COMPLIANT" ]; then
          echo "Compliance check failed"
          exit 1
        fi
      displayName: 'Evaluate verdict'

This stage runs before any actual infrastructure deployment, catching issues early.

Getting Started

If you’re interested in implementing something similar:

Start small with a single, well-defined policy (like resource naming)
Test extensively with both compliant and non-compliant examples
Integrate into a non-production pipeline first
Gather feedback from developers
Gradually expand to additional policies

The goal isn’t perfection from day one, but rather continuous improvement of your infrastructure governance.

Closing Thoughts

Infrastructure compliance doesn’t have to be a manual slog. By combining AI capabilities with structured policy documentation and automated pipelines, you can create a system that’s both more reliable and more efficient than traditional approaches.

The technology is mature enough for production use today. The real challenge is organizational: clearly documenting your policies, building trust in automated systems, and changing team workflows to embrace this new approach.

For teams shipping infrastructure changes daily, this investment pays dividends quickly. The alternative—scaling manual review processes—simply doesn’t work at modern deployment velocities.

Happy scripting!

GCP Agent Development Kit: Automated Compliance Reporter

2025-12-15T21:34:29+00:00

Compliance audits have a standard ritual: gather evidence, review it for policy violations, classify the findings, write a report, and hand it to someone who will ask you to do it again next quarter. The gather-and-review steps are the bottleneck nobody talks about. In a GCP environment with half a dozen services generating audit events, even a 24-hour window can produce thousands of log entries. Manually grepping Cloud Logging for SetIamPolicy calls, checking whether secretmanager.googleapis.com data access logs are even enabled, and then formatting the results into something an auditor can read is the kind of work that takes an afternoon, introduces human error, and is immediately out of date by the time it is filed. The GCP Agent Development Kit gives you a better option.

The GCP Agent Development Kit changes that equation. ADK lets you wire a Gemini model to plain Python functions and give it a compliance mandate as natural language instructions. The agent reasons over what it needs to do, calls your tools to pull audit data, analyzes the results, and produces a structured JSON report — all autonomously. This post builds exactly that: a compliance reporter that queries Cloud Audit Logs, classifies IAM mutations, secret access events, and auth failures by risk level, and delivers the report to Pub/Sub or Cloud Storage. Deployable to Cloud Run and schedulable with Cloud Scheduler, it runs daily without touching a console.

What is GCP Agent Development Kit

ADK was announced at Google Cloud Next 2025 as an open-source Python framework for building agents powered by Gemini. The core idea is deliberately minimal: you write plain Python functions, decorate them with docstrings, and pass them to an Agent. The framework handles the Gemini API calls, the tool dispatch loop, and session state. You focus on what the agent should do and what tools it has access to.

The four concepts you need to hold in your head are:

Agent — the central object. It holds the model name, a natural language instruction that defines its role and behaviour, and the list of tools it can call.
Tools — ordinary Python functions. The function’s name and docstring become the tool schema that Gemini uses to decide when and how to call the function. No decorators, no registration step.
Runner — executes the agent against a session. InMemoryRunner is the development-time runner; for production you would use a persistent session backend.
Sessions — track conversation state across multiple turns. For a compliance reporter that runs a single audit pass, one session per run is all you need.

Install the framework with:

pip install google-adk

The package includes the google.adk.agents, google.adk.runners, and supporting modules. It depends on google-genai for the underlying model calls, which is pulled in automatically.

Cloud Audit Logs as a compliance data source

Cloud Audit Logs are the authoritative record of who did what in your GCP environment and when. There are four log types, and understanding which are enabled by default is the first compliance gap to close.

Admin Activity logs record API calls that create, modify, or delete resources — CreateBucket, SetIamPolicy, CreateServiceAccountKey. They are always on, cannot be disabled, and carry no additional cost. These are your primary source for IAM and RBAC mutation events.

Data Access logs record API calls that read resource configurations or read user-provided data. They are disabled by default for every service. This is the most common compliance gap: teams assume that secret reads or BigQuery query executions are logged, but they are not unless Data Access logging has been explicitly enabled for each service. Enabling them generates significant log volume and cost on active projects, so you enable them selectively for high-value services — Secret Manager, KMS, BigQuery.

System Event logs record GCP system actions that modify resources, such as live migration of a VM. They are always on, generated by Google systems rather than user activity, and are rarely the primary focus of a compliance audit.

Policy Denied logs record when a Cloud IAM policy denies access to a resource. They are always on and are your primary source for authentication failure and unauthorized access events.

Key fields in an audit log entry that matter for compliance analysis:

protoPayload.methodName — the API method that was called (e.g., google.iam.v1.IAMPolicy.SetIamPolicy)
protoPayload.authenticationInfo.principalEmail — the identity that made the call
protoPayload.resourceName — the full resource path the call targeted
protoPayload.status.code — a non-zero value indicates a failed or denied call
protoPayload.requestMetadata.callerIp — the source IP address
timestamp — when the event occurred

Architecture

The compliance reporter follows a straightforward data flow. The ADK agent sits at the center, using Gemini to reason over which audit log queries to run, what the results mean, and where to send the report.

Cloud Audit Logs (Logging API)
         |
         | query_audit_logs()
         v
  +--------------------+
  |   ADK Agent        |
  |   (Gemini 2.5      |
  |    Flash)          |
  |                    |
  |  - query logs      |
  |  - analyze         |
  |  - classify risk   |
  |  - build report    |
  +--------------------+
         |
    +---------+
    |         |
    v         v
 Pub/Sub    Cloud
 Topic      Storage
            (GCS)

The agent issues multiple tool calls in sequence: query activity logs for IAM mutations, query data_access logs for secret access events, query policy logs for auth failures. After each query it accumulates findings. When all queries are complete it calls the delivery tool to publish the report. Gemini drives the sequencing — the instruction defines the compliance checks, not the orchestration code.

Prerequisites

You will need:

Python 3.11 or later
gcloud CLI installed and authenticated to the target project
The Cloud Logging, Pub/Sub, and Cloud Storage APIs enabled
A GCP project with audit log data to query

Verify your active project and authentication:

gcloud config get-value project
gcloud auth application-default login

Enable the required APIs:

gcloud services enable \
  logging.googleapis.com \
  pubsub.googleapis.com \
  storage.googleapis.com \
  aiplatform.googleapis.com \
  --project=PROJECT_ID

IAM roles

Create a dedicated service account with minimum required permissions:

# Create the service account
gcloud iam service-accounts create compliance-reporter-sa \
  --project=PROJECT_ID

# Assign minimum roles
for role in roles/logging.viewer roles/pubsub.publisher roles/storage.objectUser; do
  gcloud projects add-iam-policy-binding PROJECT_ID \
    --member="serviceAccount:compliance-reporter-sa@PROJECT_ID.iam.gserviceaccount.com" \
    --role="${role}"
done

roles/logging.viewer is sufficient to call list_entries against Cloud Logging. roles/pubsub.publisher allows publishing to a topic. roles/storage.objectUser allows writing objects to a GCS bucket.

Enabling Data Access logs for Secret Manager

By default, Secret Manager does not log data access events. To add secret access to your compliance audit coverage, enable Data Access logs for the service. The cleanest approach is to export the project IAM policy, add the audit configuration, and re-apply it:

gcloud projects get-iam-policy PROJECT_ID --format=json > policy.json

Add the following auditConfigs block to policy.json alongside the existing bindings array:

{
  "auditConfigs": [
    {
      "service": "secretmanager.googleapis.com",
      "auditLogConfigs": [
        { "logType": "DATA_READ" },
        { "logType": "DATA_WRITE" }
      ]
    }
  ]
}

Apply the updated policy:

gcloud projects set-iam-policy PROJECT_ID policy.json

Secret Manager data access events will now appear in cloudaudit.googleapis.com/data_access logs within a few minutes of the policy change taking effect.

Building the compliance tools

ADK tools are plain Python functions. Gemini reads the function name, the parameter names and types, and the docstring to understand when to call the function and what arguments to pass. The pattern requires nothing beyond a well-written docstring — no decorators, no schema definitions.

query_audit_logs

This is the primary data-gathering tool. It wraps the Cloud Logging Python client and returns a normalized list of audit entries:

from google.cloud import logging as gcp_logging
from datetime import datetime, timedelta, timezone


def query_audit_logs(
    project_id: str,
    log_type: str,
    hours_back: int = 24,
    max_results: int = 200,
    service_name: str = None,
) -> dict:
    """Query Cloud Audit Logs for compliance-relevant events.

    Args:
        project_id: GCP project ID to query.
        log_type: One of 'activity', 'data_access', 'system_event', or 'policy'.
        hours_back: How many hours back to search from now.
        max_results: Maximum number of log entries to return.
        service_name: Optional GCP service name to filter on (e.g.,
            'secretmanager.googleapis.com').

    Returns:
        dict with 'status', 'entry_count', and 'entries' list.
    """
    client = gcp_logging.Client(project=project_id)
    end_time = datetime.now(timezone.utc)
    start_time = end_time - timedelta(hours=hours_back)

    log_name = (
        f"projects/{project_id}/logs/cloudaudit.googleapis.com%2F{log_type}"
    )
    filter_parts = [
        f'logName="{log_name}"',
        f'timestamp>="{start_time.strftime("%Y-%m-%dT%H:%M:%SZ")}"',
        f'timestamp<="{end_time.strftime("%Y-%m-%dT%H:%M:%SZ")}"',
    ]
    if service_name:
        filter_parts.append(f'protoPayload.serviceName="{service_name}"')

    entries = []
    for entry in client.list_entries(
        resource_names=[f"projects/{project_id}"],
        filter_=" AND ".join(filter_parts),
        order_by="timestamp desc",
        max_results=max_results,
    ):
        proto = entry.payload or {}
        entries.append({
            "timestamp": entry.timestamp.isoformat() if entry.timestamp else None,
            "method_name": proto.get("methodName"),
            "principal_email": proto.get("authenticationInfo", {}).get("principalEmail"),
            "resource_name": proto.get("resourceName"),
            "status_code": proto.get("status", {}).get("code", 0),
            "caller_ip": proto.get("requestMetadata", {}).get("callerIp"),
        })

    return {"status": "success", "entry_count": len(entries), "entries": entries}

The log_type parameter maps directly to the audit log path segment: activity, data_access, system_event, or policy. The payload is a proto_struct dict when the log entry carries a protoPayload, which all audit log entries do. The function normalizes the fields Gemini will reason over into a flat dict per entry, which keeps the context window clean.

publish_report_to_pubsub

import json
from google.cloud import pubsub_v1


def publish_report_to_pubsub(
    project_id: str,
    topic_id: str,
    report: dict,
) -> dict:
    """Publish a compliance report as a JSON message to a Pub/Sub topic.

    Args:
        project_id: GCP project ID that owns the topic.
        topic_id: Pub/Sub topic ID (not the full resource name).
        report: The compliance report dict to publish.

    Returns:
        dict with 'status' and 'message_id'.
    """
    publisher = pubsub_v1.PublisherClient()
    topic_path = publisher.topic_path(project_id, topic_id)
    data = json.dumps(report).encode("utf-8")
    future = publisher.publish(topic_path, data=data)
    message_id = future.result(timeout=30)
    return {"status": "published", "message_id": message_id}

upload_report_to_gcs

import json
from datetime import datetime, timezone
from google.cloud import storage


def upload_report_to_gcs(
    bucket_name: str,
    report: dict,
    prefix: str = "compliance-reports",
) -> dict:
    """Upload a compliance report as a JSON object to Cloud Storage.

    Args:
        bucket_name: Name of the GCS bucket.
        report: The compliance report dict to upload.
        prefix: Object path prefix inside the bucket.

    Returns:
        dict with 'status' and 'gcs_uri'.
    """
    client = storage.Client()
    timestamp = datetime.now(timezone.utc).strftime("%Y%m%dT%H%M%SZ")
    blob_name = f"{prefix}/{timestamp}-report.json"
    bucket = client.bucket(bucket_name)
    blob = bucket.blob(blob_name)
    blob.upload_from_string(
        json.dumps(report, indent=2),
        content_type="application/json",
    )
    gcs_uri = f"gs://{bucket_name}/{blob_name}"
    return {"status": "uploaded", "gcs_uri": gcs_uri}

The beauty of the ADK tool pattern is that Gemini sees these three functions as a complete toolkit: one for gathering data, two for delivering results. It will call them in whatever order the instruction demands, passing arguments it infers from context — the project ID comes from the user message, the topic ID comes from the same message, the report dict is constructed from its own analysis of the log entries.

Defining the agent

Now that we have the tools, the agent definition is the interesting part. The instruction field is not just a description — it is the compliance mandate that drives the agent’s multi-step reasoning:

from google.adk.agents import Agent

COMPLIANCE_INSTRUCTION = """
You are a GCP compliance auditing agent. Your job is to:
1. Query Cloud Audit Logs for compliance-relevant events.
2. Analyze the retrieved entries for policy violations.
3. Produce a structured JSON report with HIGH/MEDIUM/LOW risk findings.
4. Deliver the report to Pub/Sub or Cloud Storage.

Compliance checks to perform:
- IAM/RBAC mutations: query 'activity' logs for SetIamPolicy and
  CreateServiceAccountKey method names.
- Secret access: query 'data_access' logs filtering on
  service_name='secretmanager.googleapis.com'.
- Auth failures: query 'policy' logs for entries with status_code != 0.

Report format: JSON with two top-level keys:
- 'summary': object with counts keyed by risk level
  (HIGH, MEDIUM, LOW, total_entries_reviewed)
- 'findings': array of objects, each with
  'risk_level', 'category', 'description', 'principal_email',
  'resource_name', 'timestamp'

Risk classification:
- HIGH: CreateServiceAccountKey calls, Policy Denied events from external IPs,
  SetIamPolicy granting roles/owner or roles/editor
- MEDIUM: SetIamPolicy calls that do not match HIGH criteria,
  data access to secrets outside business hours
- LOW: all other audit events surfaced during the checks

Always complete all three compliance checks before building the report.
Deliver the report using the available delivery tool based on user instructions.
"""

root_agent = Agent(
    name="gcp_compliance_reporter",
    model="gemini-2.5-flash",
    description="Queries GCP Cloud Audit Logs and produces structured compliance reports.",
    instruction=COMPLIANCE_INSTRUCTION,
    tools=[query_audit_logs, publish_report_to_pubsub, upload_report_to_gcs],
)

A few things to notice here. The instruction specifies the sequence of checks explicitly — IAM mutations, then secret access, then auth failures — because Gemini will follow the ordering when it reasons about what to do next. The risk classification rules are concrete enough that the model can apply them consistently across runs. And the delivery instruction is left conditional (“based on user instructions”) so the runner message controls where the report goes without changing the agent definition.

gemini-2.5-flash is the right model choice for this workload. It handles long context windows efficiently, which matters when you are passing hundreds of log entries into the reasoning loop. It is also the fastest Gemini model at the time of writing, which keeps the per-run latency reasonable for a scheduled compliance job.

Running the agent

The runner wires the agent to a session and drives the async event loop. InMemoryRunner is suitable for development and single-instance deployments:

from google.adk.runners import InMemoryRunner
from google.genai import types
import asyncio


async def run_compliance_check(
    project_id: str,
    pubsub_topic_id: str = None,
    gcs_bucket: str = None,
) -> str:
    runner = InMemoryRunner(
        agent=root_agent,
        app_name="gcp_compliance_reporter",
    )
    session = await runner.session_service.create_session(
        app_name="gcp_compliance_reporter",
        user_id="scheduler",
    )

    delivery_instruction = ""
    if pubsub_topic_id:
        delivery_instruction = f" Publish to Pub/Sub topic: {pubsub_topic_id}."
    elif gcs_bucket:
        delivery_instruction = f" Upload to GCS bucket: {gcs_bucket}."

    message = types.Content(
        role="user",
        parts=[types.Part.from_text(
            text=(
                f"Run a full compliance audit for project {project_id} "
                f"covering the last 24 hours."
                + delivery_instruction
            )
        )],
    )

    async for event in runner.run_async(
        user_id="scheduler",
        session_id=session.id,
        new_message=message,
    ):
        if event.is_final_response() and event.content and event.content.parts:
            return event.content.parts[0].text

    return ""


if __name__ == "__main__":
    import os
    result = asyncio.run(
        run_compliance_check(
            project_id=os.environ["GCP_PROJECT_ID"],
            pubsub_topic_id=os.environ.get("PUBSUB_TOPIC_ID"),
            gcs_bucket=os.environ.get("GCS_BUCKET"),
        )
    )
    print(result)

runner.run_async returns an async iterator of events. Most events are intermediate — tool call requests, tool call results, model tokens. event.is_final_response() is true only on the last event, which carries the agent’s final text output. For a compliance reporter this is either a confirmation that the report was delivered, or an error explanation if something failed.

To test locally before containerizing:

export GCP_PROJECT_ID=your-project-id
export PUBSUB_TOPIC_ID=compliance-reports
export GOOGLE_CLOUD_PROJECT=your-project-id
python main.py

Application Default Credentials handle authentication locally. The Cloud Logging and Pub/Sub clients pick up ADC automatically — no API key or service account JSON file needed in development.

Testing and Validation

Before deploying to Cloud Run, validate that each tool works in isolation and that the agent’s reasoning produces the expected report structure.

Testing tool functions directly

import asyncio
from main import query_audit_logs, publish_report_to_pubsub

# Test log query — should return a dict with 'status' and 'entries'
result = query_audit_logs(
    project_id="your-project-id",
    log_type="activity",
    hours_back=1,
    max_results=10,
)
print(f"entry_count: {result['entry_count']}")
for entry in result["entries"][:3]:
    print(entry)

If entry_count is 0 for activity logs, either no admin activity occurred in the time window or the service account running the script lacks roles/logging.viewer. If data_access queries consistently return 0 entries, Data Access logs are almost certainly not enabled for the target service.

Inspecting the agent’s reasoning trace

ADK events expose the intermediate reasoning steps. Add a loop that prints every event to see the full tool call sequence:

async def run_with_trace(project_id: str) -> None:
    runner = InMemoryRunner(agent=root_agent, app_name="gcp_compliance_reporter")
    session = await runner.session_service.create_session(
        app_name="gcp_compliance_reporter", user_id="debug"
    )
    message = types.Content(
        role="user",
        parts=[types.Part.from_text(
            text=f"Run a full compliance audit for project {project_id} "
                 f"covering the last 1 hour."
        )],
    )
    async for event in runner.run_async(
        user_id="debug", session_id=session.id, new_message=message
    ):
        print(f"[{event.__class__.__name__}] is_final={event.is_final_response()}")
        if hasattr(event, "content") and event.content:
            for part in event.content.parts:
                if hasattr(part, "text") and part.text:
                    print(f"  text: {part.text[:200]}")
                if hasattr(part, "function_call") and part.function_call:
                    print(f"  tool_call: {part.function_call.name}({part.function_call.args})")

Running this against a project with recent audit activity shows you the exact sequence of tool calls Gemini chose — which log types it queried, in what order, and what arguments it passed. This is the fastest way to catch instruction ambiguities before they appear in a production report.

Deploying to Cloud Run and Cloud Scheduler

Package the agent as a Cloud Run job. Jobs are the right Cloud Run primitive for batch workloads: they run to completion, exit cleanly, and integrate with Cloud Scheduler for recurring execution.

Create a Dockerfile at the project root:

cat > Dockerfile << 'EOF'
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
CMD ["python", "main.py"]
EOF

And a requirements.txt:

cat > requirements.txt << 'EOF'
google-adk>=1.28.0
google-cloud-logging>=3.10.0
google-cloud-pubsub>=2.21.0
google-cloud-storage>=2.17.0
EOF

Build and push the container image, then create the Cloud Run job:

export PROJECT_ID=your-project-id
export REGION=us-central1

# Build and push
gcloud builds submit \
  --tag="gcr.io/${PROJECT_ID}/compliance-reporter:latest" \
  --project="${PROJECT_ID}"

# Create the Cloud Run job
gcloud run jobs create compliance-reporter \
  --image="gcr.io/${PROJECT_ID}/compliance-reporter:latest" \
  --service-account="compliance-reporter-sa@${PROJECT_ID}.iam.gserviceaccount.com" \
  --set-env-vars="GCP_PROJECT_ID=${PROJECT_ID},PUBSUB_TOPIC_ID=compliance-reports,GOOGLE_GENAI_USE_VERTEXAI=true,GOOGLE_CLOUD_LOCATION=${REGION}" \
  --region="${REGION}" \
  --project="${PROJECT_ID}"

GOOGLE_GENAI_USE_VERTEXAI=true switches the ADK backend from the Gemini Developer API (API key) to Vertex AI (ADC). In Cloud Run the job’s service account identity is used automatically — no API key, no secret management overhead for the model credentials.

Run a manual execution to validate the deployment:

gcloud run jobs execute compliance-reporter \
  --region="${REGION}" \
  --project="${PROJECT_ID}"

# Stream the logs
gcloud run jobs executions list \
  --job=compliance-reporter \
  --region="${REGION}" \
  --project="${PROJECT_ID}"

Once the manual execution completes successfully, create the Cloud Scheduler job for daily execution:

gcloud scheduler jobs create http compliance-reporter-daily \
  --location="${REGION}" \
  --schedule="0 6 * * *" \
  --uri="https://${REGION}-run.googleapis.com/apis/run.googleapis.com/v1/namespaces/${PROJECT_ID}/jobs/compliance-reporter:run" \
  --oauth-service-account-email="compliance-reporter-sa@${PROJECT_ID}.iam.gserviceaccount.com" \
  --project="${PROJECT_ID}"

The schedule 0 6 * * * runs at 06:00 UTC daily. Adjust to suit your team’s working hours — running it before the business day starts means the report is waiting in Pub/Sub or GCS when people begin work. The service account needs roles/run.invoker to trigger the job via the Cloud Run API:

gcloud projects add-iam-policy-binding "${PROJECT_ID}" \
  --member="serviceAccount:compliance-reporter-sa@${PROJECT_ID}.iam.gserviceaccount.com" \
  --role="roles/run.invoker"

Best Practices

Data Access logs are disabled by default — and that is your biggest blind spot. Admin Activity and Policy Denied logs are always on, but if you are not explicitly enabling Data Access logs for Secret Manager, KMS, and BigQuery, you have no record of secret reads, key decryption operations, or data queries. The compliance reporter will query for those events and return 0 entries, which in a report looks exactly the same as genuine zero-activity. Enable Data Access logs for your high-value services and document which services have coverage. A finding of “0 data access events for secretmanager.googleapis.com” is only meaningful if you know logging is on.

Do not use InMemoryRunner for multi-instance deployments. InMemoryRunner stores session state in process memory. If you scale the Cloud Run job to more than one instance, or if the job is retried after a failure, sessions are not shared across instances and state is lost on restart. For production use a persistent session service backed by Firestore or Cloud SQL. ADK’s session service interface is pluggable; swapping the backend is a constructor argument change.

Pin google-adk>=1.28.0 in your requirements. ADK 1.28.0 included a fix for a prompt injection vulnerability in tool docstring handling. Pinning to this version or later ensures that a malicious string in an audit log entry cannot manipulate the tool schema seen by Gemini. This is particularly relevant for compliance workloads where the agent processes untrusted log data as part of its reasoning context.

Use Vertex AI with Application Default Credentials in production, not API keys. The GOOGLE_GENAI_USE_VERTEXAI=true environment variable routes model calls through Vertex AI, which uses the service account’s ADC identity rather than a static API key. This means no secret to rotate, no risk of the key being logged, and IAM-based access control over which identities can invoke the Gemini models. On Cloud Run this is zero-configuration — the job’s service account identity is used automatically.

Scope the service account to minimum roles. The compliance reporter needs roles/logging.viewer to read audit logs, roles/pubsub.publisher to publish reports, and roles/storage.objectUser to write to GCS. It does not need any broader project-level permissions, and it does not need any IAM mutation capabilities. A service account with owner or editor permissions running an agent that processes audit log data is a significant risk surface — if the agent’s reasoning is manipulated, it could take destructive actions. Keep the service account scoped to exactly what the delivery tools require.

Conclusion

What we built is an autonomous compliance reporter that replaces an afternoon of manual log review with a scheduled agent run. Cloud Audit Logs provide the raw evidence — Admin Activity for IAM mutations, Data Access for secret operations, Policy Denied for auth failures. ADK connects Gemini’s reasoning to plain Python tool functions, letting the agent drive the query sequence, analyze the results, classify findings by risk level, and deliver a structured JSON report to Pub/Sub or GCS. Cloud Run jobs and Cloud Scheduler handle the operational side: containerized, daily, no console required.

The agent instruction is where the compliance logic lives, which means extending coverage is a matter of adding new checks to the instruction and a new tool if the data source requires one. Adding KMS key usage analysis or VPC firewall mutation detection follows the same pattern: describe the check, describe the risk classification, add it to the instruction. The delivery infrastructure does not change.

Happy scripting!

Azure Private DNS zone fallback to internet

2025-09-28T21:34:27+00:00

When working with Azure Private Endpoints across multiple regions, you’ve likely encountered a common problem: how do you access a resource with a private endpoint from a different region that isn’t interconnected with your current virtual network? Microsoft’s offer the “Fallback to Internet” feature for Azure Private DNS zones solves this challenge elegantly.

Understanding the Challenge

Private Endpoints provide secure, private connectivity to Azure services by mapping them to private IP addresses within your virtual network. This works seamlessly within a single region or interconnected networks. However, in multi-region scenarios with isolated networks, DNS resolution fails when trying to access a private endpoint from a different region.

How Private Endpoint DNS Resolution Works

When you create a Private Endpoint for an Azure resource (like a Key Vault or Storage Account), the DNS resolution flow typically works like this:

A DNS query for resource-name.vault.azure.net reaches Azure’s DNS service
The query resolves to a CNAME: resource-name.privatelink.vaultcore.azure.net
The Private DNS zone resolves this to the private IP address (e.g., 10.0.0.7)
The client receives the private IP and connects through the private endpoint

This works perfectly when the client and the private endpoint are in the same region or connected networks. But what happens when they’re not?

The Problem: Isolated Multi-Region Architectures

Consider this scenario:

Region A has a virtual network with a Private DNS zone for Key Vault
Region B has a separate virtual network with its own Private DNS zone
A VM in Region B needs to access a Key Vault in Region A that’s behind a private endpoint
The networks are not interconnected (due to security policies, overlapping IP ranges, or architectural decisions)

Without Fallback to Internet, the DNS resolution in Region B fails because:

The query reaches the Private DNS zone in Region B
No record exists for the Key Vault in Region A
The DNS query returns empty (NXDOMAIN)
Access is blocked

Previously, you’d need to implement complex solutions like cross-region VNet peering or custom DNS forwarding. The Fallback to Internet feature provides a much simpler alternative.

Introducing Fallback to Internet

The Fallback to Internet feature adds a new DNS resolution policy: NxDomainRedirect. When enabled on a Virtual Network Link in your Private DNS zone, it changes the behavior when a DNS query doesn’t find a match: Without Fallback: DNS query fails → NXDOMAIN error → Access denied With Fallback: DNS query fails → Fallback to public DNS → Resolves to public endpoint → Access via internet (if allowed by firewall) This allows you to:

Keep private endpoint connectivity for resources in the same region
Allow public endpoint access (via firewall rules) for cross-region scenarios
Avoid complex network peering infrastructure

Real-World Use Case

Let’s say you have:

A centralized Key Vault in Region A with sensitive secrets Application VMs in multiple isolated regions (B, C, D) that need occasional access Security requirements that prevent full network interconnection

With Fallback to Internet:

Configure Private Endpoint for the Key Vault in Region A Enable Fallback to Internet on Private DNS zones in Regions B, C, D Whitelist the public IP/NAT Gateway IPs from Regions B, C, D on the Key Vault firewall VMs in Region A access via private endpoint (secure, no internet) VMs in other regions access via public endpoint (controlled by firewall)

Implementation with Terraform Let’s implement a simple example using Terraform with the AzAPI provider (since the AzureRM provider doesn’t support this feature yet).

Prerequisites

terraform {
  required_providers {
    azurerm = {
      source  = "hashicorp/azurerm"
      version = "~> 3.0"
    }
    azapi = {
      source  = "azure/azapi"
      version = "~> 1.0"
    }
    random = {
      source  = "hashicorp/random"
      version = "~> 3.0"
    }
  }
}

provider "azurerm" {
  features {}
}

provider "azapi" {}

Step 1: Create the Private DNS Zone

resource "azurerm_private_dns_zone" "keyvault" {
  name                = "privatelink.vaultcore.azure.net"
  resource_group_name = azurerm_resource_group.main.name
}

Step 2: Create Virtual Network Link with Fallback Enabled Here’s where we use the AzAPI provider to enable the Fallback to Internet feature:

"azapi_resource" "vnet_link" {
  type      = "Microsoft.Network/privateDnsZones/virtualNetworkLinks@2024-06-01"
  name      = "vnet-link-with-fallback"
  parent_id = azurerm_private_dns_zone.keyvault.id
  location  = "global"

  body = jsonencode({
    properties = {
      registrationEnabled = false
      resolutionPolicy    = "NxDomainRedirect"  # Enable fallback
      virtualNetwork = {
        id = azurerm_virtual_network.main.id
      }
    }
  })
}

Step 3: Create the Private Endpoint

"azurerm_private_endpoint" "keyvault" {
  name                = "pe-keyvault"
  location            = azurerm_resource_group.main.location
  resource_group_name = azurerm_resource_group.main.name
  subnet_id           = azurerm_subnet.private_endpoints.id

  private_service_connection {
    name                           = "psc-keyvault"
    private_connection_resource_id = azurerm_key_vault.main.id
    is_manual_connection           = false
    subresource_names              = ["vault"]
  }

  private_dns_zone_group {
    name                 = "default"
    private_dns_zone_ids = [azurerm_private_dns_zone.keyvault.id]
  }
}

Complete Example

Here’s a minimal working example:

# Resource Group
resource "azurerm_resource_group" "main" {
  name     = "rg-dns-fallback-demo"
  location = "East US"
}

# Virtual Network
resource "azurerm_virtual_network" "main" {
  name                = "vnet-demo"
  address_space       = ["10.0.0.0/16"]
  location            = azurerm_resource_group.main.location
  resource_group_name = azurerm_resource_group.main.name
}

# Subnet for Private Endpoints
resource "azurerm_subnet" "private_endpoints" {
  name                 = "snet-private-endpoints"
  resource_group_name  = azurerm_resource_group.main.name
  virtual_network_name = azurerm_virtual_network.main.name
  address_prefixes     = ["10.0.1.0/24"]
}

# Key Vault
resource "azurerm_key_vault" "main" {
  name                       = "kv-demo-${random_string.suffix.result}"
  location                   = azurerm_resource_group.main.location
  resource_group_name        = azurerm_resource_group.main.name
  tenant_id                  = data.azurerm_client_config.current.tenant_id
  sku_name                   = "standard"
  
  # Allow public access with firewall rules
  public_network_access_enabled = true
  
  network_acls {
    bypass         = "AzureServices"
    default_action = "Deny"
    ip_rules       = ["YOUR_PUBLIC_IP/32"]  # Add your IPs here
  }
}

# Private DNS Zone
resource "azurerm_private_dns_zone" "keyvault" {
  name                = "privatelink.vaultcore.azure.net"
  resource_group_name = azurerm_resource_group.main.name
}

# Virtual Network Link with Fallback
resource "azapi_resource" "vnet_link" {
  type      = "Microsoft.Network/privateDnsZones/virtualNetworkLinks@2024-06-01"
  name      = "vnet-link-fallback"
  parent_id = azurerm_private_dns_zone.keyvault.id
  location  = "global"

  body = jsonencode({
    properties = {
      registrationEnabled = false
      resolutionPolicy    = "NxDomainRedirect"
      virtualNetwork = {
        id = azurerm_virtual_network.main.id
      }
    }
  })
}

# Private Endpoint
resource "azurerm_private_endpoint" "keyvault" {
  name                = "pe-keyvault"
  location            = azurerm_resource_group.main.location
  resource_group_name = azurerm_resource_group.main.name
  subnet_id           = azurerm_subnet.private_endpoints.id

  private_service_connection {
    name                           = "psc-keyvault"
    private_connection_resource_id = azurerm_key_vault.main.id
    is_manual_connection           = false
    subresource_names              = ["vault"]
  }

  private_dns_zone_group {
    name                 = "default"
    private_dns_zone_ids = [azurerm_private_dns_zone.keyvault.id]
  }
}

# Helper resources
data "azurerm_client_config" "current" {}

resource "random_string" "suffix" {
  length  = 8
  special = false
  upper   = false
}

Testing the Configuration

To verify the Fallback to Internet is working:

# From a VM in the same VNet (should resolve to private IP)
nslookup kv-demo-xxxxx.vault.azure.net

# From a different region/VNet with fallback enabled (should resolve to public IP)
nslookup kv-demo-xxxxx.vault.azure.net

You can also use dig for more detailed DNS information:

dig kv-demo-xxxxx.vault.azure.net

Important Considerations

Security: Always configure firewall rules on your resources when using Fallback to Internet. The feature allows DNS resolution to succeed, but network access still needs to be explicitly allowed.
Cost: Traffic going through the public endpoint may incur data transfer costs, unlike private endpoint traffic within the same region. Preview Feature: As of this writing, this feature is still in preview. Check Microsoft’s documentation for GA status before using in production.

Conclusion

The Fallback to Internet feature for Azure Private DNS zones provides an elegant solution for multi-region scenarios where full network interconnection isn’t feasible or desired. By allowing DNS resolution to fall back to public endpoints when private resolution fails, it maintains the security benefits of Private Endpoints while providing flexibility for cross-region access patterns. This feature is particularly valuable when:

Operating isolated regions with centralized resources Dealing with overlapping IP address spaces Simplifying network architecture without compromising security Implementing gradual migrations to fully private networking

Combined with proper firewall configuration, it offers a practical middle ground between fully private and fully public access patterns.

Happy scripting!

OCI Cloud Guard: Excepting with custom tags [English]

2025-08-31T23:14:33+00:00

In Oracle Cloud Infrastructure (OCI) environments, it’s common to encounter scenarios where public datasets are hosted in Object Storage to facilitate access for researchers, open-source communities, or partners. However, Cloud Guard, OCI’s automated security service, can generate constant alerts about these intentionally public buckets, creating noise in the monitoring system and making it difficult to identify real threats.

The Challenge: Security vs. Public Access

Cloud Guard is designed to identify insecure configurations and potential vulnerabilities in your OCI tenancy. One of its most sensitive detectors identifies Object Storage buckets with public access, as these represent a potential risk of sensitive data exposure. But what happens when your buckets must be public by design? The Right Solution: Modify Detection Rules Of the available options to handle this scenario, the correct answer is to modify the Cloud Guard Detection Rules configuration to exclude known public buckets from security scans.

Why is this the best option?

Granularity: Allows you to keep Cloud Guard active for all other resources Security: Doesn’t compromise the overall security posture of your tenancy
Flexibility: You can apply specific exceptions without disabling important protections
Scalability: Easy to maintain as your infrastructure grows

Why aren’t the other options suitable?

Disable Cloud Guard completely: Would remove protection from your entire tenancy Convert buckets to private: Contradicts the purpose of sharing data publicly Create a separate compartment without Cloud Guard: Leaves a segment of your infrastructure unmonitored, creating a security blind spot

Implementing Exceptions with Custom Tags

The most elegant and maintainable way to implement this solution is by using custom tags combined with Detection Rules configuration.

Step 1: Create a Tag Namespace and Tag

First, create a tag namespace and a specific tag to identify authorized public resources:

# Create Tag Namespace
le

# Create Tag
oci iam tag create \
  --tag-namespace-id  \
  --name "exception" \
  --description "Indicates that the resource is an authorized exception"

Step 2: Apply the Tag to Public Buckets

Tag the buckets that are intentionally public. In our case, we’ll tag the publicBucket in the ocilabs compartment:

oci os bucket update \
  --bucket-name publicBucket \
  --namespace  \
  --freeform-tags '{"exception":"true"}'

Or using Defined Tags:

oci os bucket update \
  --bucket-name publicBucket \
  --namespace  \
  --defined-tags '{"SecurityExceptions":{"exception":"true"}}'

Step 3: Configure Cloud Guard Detection Rules

Now comes the crucial part: modifying the Detection Rule that detects public buckets so it ignores those with the appropriate tag.

Option A: Using the OCI Console

Navigate to Security → Cloud Guard → Configuration
Select the Detector Recipe you’re using
Find the “Public Bucket” rule (typically OBJECT_STORE_PUBLIC_BUCKET)
Click Edit Rule
In the Condition section, add a condition to exclude resources with the tag:

resource.type = 'Bucket' 
AND resource.publicAccessType IN ('ObjectRead', 'ObjectReadWithoutList')
AND NOT (resource.freeformTags.exception = 'true')

Option B: Using OCI CLI

# Get the current detector recipe
oci cloud-guard detector-recipe get \
  --detector-recipe-id  \
  > detector-recipe.json

# Edit the detector-recipe.json file to update the condition
# Then update the detector recipe
oci cloud-guard detector-recipe update \
  --detector-recipe-id  \
  --from-json file://detector-recipe.json

Practical Example: Configuring publicBucket in ocilabs

Let’s walk through a complete example for the publicBucket in the ocilabs compartment:

Tag the Bucket

# First, get your namespace
export NAMESPACE=$(oci os ns get --query 'data' --raw-output)

# Tag the publicBucket
oci os bucket update \
  --bucket-name publicBucket \
  --namespace $NAMESPACE \
  --freeform-tags '{"exception":"true"}' \
  --compartment-id 

Verify the Tag

oci os bucket get \
  --bucket-name publicBucket \
  --namespace $NAMESPACE \
  --query 'data."freeform-tags"'

Expected output:

{
  "exception": "true"
}

Update Cloud Guard Detector Recipe

# List detector recipes to find the one you're using
oci cloud-guard detector-recipe list \
  --compartment-id  \
  --lifecycle-state ACTIVE

Clone the Oracle-managed recipe if you haven’t already

oci cloud-guard detector-recipe create \
  --compartment-id  \
  --display-name "Custom Detector Recipe - Public Buckets Exception" \
  --source-detector-recipe-id 

Best Practices

Documentation Maintain a record of all resources tagged as exceptions:

# Cloud Guard Exceptions

| Resource | Compartment | Tag | Justification | Date | Approved by |
|----------|-------------|-----|---------------|------|-------------|
| publicBucket | ocilabs | exception:true | Public datasets for research | 2025-10-31 | Security Team |

Periodic Review Implement a quarterly review process to validate that exceptions are still necessary:

# List all buckets with the exception tag
oci search resource structured-search \
  --query-text "query bucket resources where (freeformTags.key = 'exception' && freeformTags.value = 'true')"

Custom Alerts Configure alerts when new public buckets are created without the appropriate tag:

"oci_events_rule" "public_bucket_without_tag" {
  compartment_id = var.compartment_ocid
  display_name   = "Alert on untagged public bucket"
  is_enabled     = true
  
  condition = <<-EOT
    {
      "eventType": ["com.oraclecloud.objectstorage.createbucket", "com.oraclecloud.objectstorage.updatebucket"],
      "data": {
        "additionalDetails": {
          "publicAccessType": ["ObjectRead", "ObjectReadWithoutList"]
        }
      }
    }
  EOT
  
  actions {
    actions {
      action_type = "ONS"
      is_enabled  = true
      topic_id    = oci_ons_notification_topic.security_alerts.id
      description = "Notify security team of untagged public bucket"
    }
  }
}

# Create ONS topic for alerts
resource "oci_ons_notification_topic" "security_alerts" {
  compartment_id = var.compartment_ocid
  name           = "security-alerts"
  description    = "Security alerts for Cloud Guard exceptions"
}

# Subscribe to the topic
resource "oci_ons_subscription" "security_team_email" {
  compartment_id = var.compartment_ocid
  endpoint       = "security-team@example.com"
  protocol       = "EMAIL"
  topic_id       = oci_ons_notification_topic.security_alerts.id
}

Principle of Least Privilege Ensure that only authorized users can apply the exception tag: ```hclresource “oci_identity_policy” “tag_management” { compartment_id = var.tenancy_ocid name = “security-exceptions-tag-policy” description = “Control security exception tags”

statements = [ “Allow group SecurityAdmins to manage buckets in compartment ocilabs where request.user.name != ‘unauthorized-user’”, “Allow group SecurityAdmins to use tag-namespaces in tenancy where target.tag-namespace.name=’SecurityExceptions’” ] }

5. Automation Script
Create a script to automate the tagging process for multiple buckets:
```bash
#!/bin/bash
# tag-public-buckets.sh

NAMESPACE=$(oci os ns get --query 'data' --raw-output)
COMPARTMENT_ID=""

# Array of public buckets that should be tagged
PUBLIC_BUCKETS=("publicBucket" "research-data" "open-datasets")

for bucket in "${PUBLIC_BUCKETS[@]}"; do
  echo "Tagging bucket: $bucket"
  oci os bucket update \
    --bucket-name "$bucket" \
    --namespace "$NAMESPACE" \
    --freeform-tags '{"exception":"true"}' \
    --compartment-id "$COMPARTMENT_ID" \
    --force
  
  if [ $? -eq 0 ]; then
    echo "✓ Successfully tagged $bucket"
  else
    echo "✗ Failed to tag $bucket"
  fi
done

Monitoring and Auditing

Implement logging to track changes to Detection Rules and tagged resources:

# Enable Cloud Guard logging
oci logging log create \
  --display-name "cloudguard-config-changes" \
  --log-group-id  \
  --log-type SERVICE \
  --configuration '{
    "source": {
      "sourceType": "OCISERVICE",
      "service": "cloudguard",
      "resource": "",
      "category": "write"
    },
    "archiving": {
      "isEnabled": true
    }
  }'

# Enable Object Storage logging for bucket updates
oci logging log create \
  --display-name "bucket-modification-logs" \
  --log-group-id  \
  --log-type SERVICE \
  --configuration '{
    "source": {
      "sourceType": "OCISERVICE",
      "service": "objectstorage",
      "resource": "publicBucket",
      "category": "write"
    }
  }'

Conclusion

Modifying Cloud Guard Detection Rules to exclude specific resources through custom tags is the most professional and maintainable way to manage intentionally public buckets in OCI. This strategy allows you to:

Maintain a robust security posture Reduce false positive alert noise Scale your infrastructure without compromising security Maintain visibility and control over exceptions

For the specific case of publicBucket in the ocilabs compartment with the exception:true tag, this approach ensures that your research datasets remain accessible while Cloud Guard continues to protect the rest of your infrastructure. Remember that security is an ongoing process. Establish clear procedures for exception management, document all decisions, and regularly review your configuration to ensure it remains aligned with your organization’s needs.

Happy scripting!

GCP Chronicle SIEM Detection Rules with YARA-L 2.0

2025-08-11T10:00:00+00:00

When working with GCP at any meaningful scale, you quickly realize that logs are not your problem — you have plenty of them. Cloud Audit Logs capture every IAM mutation, every API call to Secret Manager, every VPC firewall change. VPC Flow Logs record every accepted and rejected connection. Cloud Armor logs tell you what traffic was blocked at the edge. The problem is that none of those logs, on their own, will tell you that something is wrong. A single SetIamPolicy event binding roles/owner to an external email address looks exactly like a routine administrative change unless something correlates it, evaluates it against a policy, and raises an alert.

That gap between raw log data and actionable detection is where a SIEM lives. GCP’s native answer is Chronicle — a petabyte-scale security analytics platform built on Google’s infrastructure, with a normalized data model and a purpose-built detection rule language called YARA-L 2.0. This post covers the full path: ingesting GCP logs into Chronicle, understanding the Unified Data Model (UDM) that normalizes those logs, and writing three concrete detection rules that cover the most common GCP security incidents.

What Chronicle Is and How It Differs

Chronicle started as an internal Google project called Backstory before becoming a generally available GCP service. It is not a log aggregation tool with a search UI bolted on — it is built specifically for security analytics, and that design decision shows up in a few important ways.

Unified Data Model (UDM) is the normalization layer at the center of everything. When a Cloud Audit Log entry arrives in Chronicle, it is parsed and mapped to a standardized schema. An IAM change becomes a USER_RESOURCE_UPDATE_PERMISSIONS event with a principal, a target, and structured security_result fields. A network connection becomes a NETWORK_CONNECTION event with network.ip_protocol, principal.ip, and target.port fields. Every event type, regardless of the originating product, maps to the same field names. This is what makes detection rules portable and readable — you write rules against UDM fields, not against the raw JSON structure of a specific product’s log format.

Petabyte-scale retention at a flat rate is the other structural difference. Chronicle’s default retention is one year with no per-GB ingestion cost for a set of natively supported log types, including Cloud Audit Logs. The cost model is per-user rather than per-volume, which changes the calculus around what you can afford to keep searchable.

Google Threat Intelligence is built in. Chronicle can automatically correlate IOCs — IPs, domains, file hashes — against Google’s threat intelligence feed without a separate connector or add-on license.

For readers coming from other SIEMs, here is a quick orientation:

Feature	Chronicle	Splunk	Microsoft Sentinel
Query language	YARA-L 2.0 (rules) + UDM Search	SPL	KQL
Data model	UDM (normalized)	Raw + CIM	ASIM + raw
Retention	1 year default, petabyte-scale	License-dependent	Log Analytics workspace
GCP log integration	Native	Via HEC/syslog	Via connector
Threat intel	Google TI built-in	ThreatIntelligence add-on	MDTI connector

YARA-L 2.0 is closer in feel to a structured rule language (like Sigma) than to a query language (like SPL or KQL). You declare what events you are looking for, define how to group them over time, and specify the condition under which the rule fires. If you have written Sigma rules or Snort/Suricata rules before, the pattern will be familiar. If you come from Splunk, the shift from “search and transform” to “declare and match” takes a little adjustment but makes rules easier to audit and version-control.

Architecture: Getting GCP Logs into Chronicle

The ingestion path from Cloud Logging to Chronicle has two options. The newer path is a direct Chronicle export configured in the Chronicle UI under Settings > Feeds, using Google Cloud Pub/Sub as the transport. The older (and still fully supported) path uses a Log Router sink to push to Pub/Sub, and then a Chronicle Pub/Sub feed pulls from that topic. Both paths land log data in Chronicle as UDM events within a few minutes of the original API call.

Cloud Logging (Cloud Audit Logs, VPC Flow Logs, Cloud Armor)
         |
         | Log Router sink
         v
     Pub/Sub topic (chronicle-gcp-logs)
         |
         | Chronicle Pub/Sub feed
         v
    Chronicle (UDM normalization + retention)
         |
         +--- YARA-L 2.0 detection rules
         |
         +--- Alerts / Findings

The Log Router sink approach gives you the most control over which log entries flow to Chronicle, because the sink filter is a full Cloud Logging filter expression. You can scope it to specific services, specific resource types, or specific severity levels, and you can tune it later without touching Chronicle’s configuration.

Prerequisites

You will need:

A GCP project with Owner or Security Admin access
gcloud CLI installed and authenticated
A Chronicle tenant provisioned (Chronicle is a separate license — contact your Google Cloud rep or check the Chronicle trial program)
Cloud Audit Logs enabled for the services you want to monitor (Admin Activity is always on; Data Access must be explicitly enabled)

Verify your active project and authentication:

gcloud config get-value project
gcloud auth list

Enable the required APIs:

gcloud services enable \
  logging.googleapis.com \
  pubsub.googleapis.com \
  cloudresourcemanager.googleapis.com \
  secretmanager.googleapis.com \
  compute.googleapis.com \
  --project=PROJECT_ID

Setting Up Log Ingestion

Creating the Pub/Sub Topic and Log Router Sink

The sink filter below covers the three services we will write detection rules for: IAM (via Cloud Resource Manager), Secret Manager, and Compute Engine (for VPC firewall rules). You can extend the filter to include additional services as you add rules.

# Create the Pub/Sub topic that Chronicle will pull from
gcloud pubsub topics create chronicle-gcp-logs \
  --project=PROJECT_ID

# Create the Log Router sink with a filter scoped to the services we care about
gcloud logging sinks create chronicle-sink \
  pubsub.googleapis.com/projects/PROJECT_ID/topics/chronicle-gcp-logs \
  --log-filter='protoPayload.serviceName=("cloudresourcemanager.googleapis.com" OR "secretmanager.googleapis.com" OR "compute.googleapis.com")' \
  --project=PROJECT_ID

The sink creates a dedicated service account (serviceAccount:...@gcp-sa-logging.iam.gserviceaccount.com) that needs publish rights on the topic. Retrieve it and grant the permission:

# Get the sink's writer identity
SINK_SA=$(gcloud logging sinks describe chronicle-sink \
  --project=PROJECT_ID \
  --format='value(writerIdentity)')

echo "Sink service account: ${SINK_SA}"

# Grant publish rights on the topic
gcloud pubsub topics add-iam-policy-binding chronicle-gcp-logs \
  --member="${SINK_SA}" \
  --role="roles/pubsub.publisher" \
  --project=PROJECT_ID

Configuring the Chronicle Pub/Sub Feed

With the topic receiving log data, open the Chronicle UI and navigate to Settings > Feeds > Add Feed. Select Google Cloud Pub/Sub as the source type, choose Google Cloud Audit Logs as the log type, and enter your project ID and the topic name chronicle-gcp-logs. Chronicle will use its own service account to subscribe to the topic — copy the Chronicle service account email shown in the UI and grant it the subscriber role:

# Replace CHRONICLE_SA with the service account shown in the Chronicle feed UI
gcloud pubsub topics add-iam-policy-binding chronicle-gcp-logs \
  --member="serviceAccount:CHRONICLE_SA" \
  --role="roles/pubsub.subscriber" \
  --project=PROJECT_ID

Once the feed is saved and active, Cloud Audit Log entries will begin arriving in Chronicle within a few minutes. You can validate ingestion in the Chronicle UI via UDM Search — search for metadata.product_name = "Cloud Audit Logs" and confirm events are appearing.

YARA-L 2.0 Detection Rules

Now that we have log data flowing, let’s implement the detection logic. YARA-L 2.0 rules have a fixed structure with five sections. Understanding each section before looking at full rules makes the syntax click faster.

Rule Structure

rule rule_name {
  meta:
    author = "Victor Silva"
    description = "What this rule detects"
    severity = "HIGH"      // CRITICAL, HIGH, MEDIUM, LOW, INFORMATIONAL
    priority = "HIGH"
    type = "ALERT"         // ALERT fires in the Alerts view
                           // RULE_TYPE_UNSPECIFIED creates informational findings

  events:
    // UDM field predicates — all must match for the rule to consider an event
    $e.metadata.event_type = "USER_RESOURCE_UPDATE_PERMISSIONS"
    $e.target.resource.type = "GCP_IAM_POLICY"

  match:
    // Optional — used for multi-event rules to define the grouping key and
    // time window. For single-event rules this section is omitted.
    $e.principal.user.userid over 1h

  condition:
    // Specifies when the rule fires. "$e" means "at least one matching event".
    // For multi-event rules you can write "#e > 5" or combine variables.
    $e

  outcome:
    // Variables available in the alert details. These surface in the alert
    // and can be used for triage without opening the raw log.
    $risk_score = 85
    $principal_email = $e.principal.user.userid
}

A few UDM field namespaces you will use constantly:

$e.metadata — event type, product name, log type, timestamps
$e.principal — who initiated the action (user, service account, IP)
$e.target — what resource was acted on
$e.network — network connection details (protocol, ports, IPs)
$e.security_result — outcome, threat indicators, verdict

Single-event rules match on one event at a time — the match section is omitted and condition is just $e. Multi-event rules correlate multiple events within a time window, grouping them by a key field (for example, $e.principal.user.userid over 1h fires when a single user matches the event predicate more than a threshold number of times in an hour).

Rule 1: IAM Privilege Escalation

This is the highest-priority rule to have active. Granting roles/owner or roles/editor to any principal — especially an external one or a service account that should not have project-wide permissions — is one of the most reliable signals of a compromised account or an insider threat.

The rule fires on a single event, because even one such IAM binding change is worth immediate investigation.

rule gcp_iam_privilege_escalation_owner_editor {
  meta:
    author = "Victor Silva"
    description = "Detects when owner or editor role is granted to any principal"
    severity = "HIGH"
    priority = "HIGH"
    type = "ALERT"

  events:
    $e.metadata.event_type = "USER_RESOURCE_UPDATE_PERMISSIONS"
    $e.target.resource.type = "GCP_IAM_POLICY"
    (
      re.regex($e.target.resource.attribute.labels["role"], `roles/owner`) or
      re.regex($e.target.resource.attribute.labels["role"], `roles/editor`)
    )

  condition:
    $e

  outcome:
    $principal_email = $e.principal.user.userid
    $project = $e.target.resource.name
    $role_granted = $e.target.resource.attribute.labels["role"]
}

The re.regex() function is used here rather than a direct equality check because the role value in the UDM label may contain additional context in some log formats. Using a regex anchored to roles/owner ensures the rule catches the binding regardless of surrounding characters.

The outcome variables $principal_email, $project, and $role_granted will appear directly in the Chronicle alert details, giving the analyst the three facts they need to start triage without having to dig into raw log data.

Rule 2: Secret Manager Anomalous Access

Secret Manager access patterns are a reliable detection surface. In a well-governed project, the set of service accounts that legitimately read secrets is small and known. Any access from outside that approved set warrants investigation — it could indicate a compromised application service account, lateral movement, or exfiltration of credentials.

This rule uses a not re.regex() predicate to implement a simple allowlist approach. You will customize the regex to match your project’s naming convention for approved service accounts.

rule gcp_secret_manager_anomalous_access {
  meta:
    author = "Victor Silva"
    description = "Detects secret access from service accounts not in the approved list"
    severity = "MEDIUM"
    priority = "MEDIUM"
    type = "ALERT"

  events:
    $e.metadata.product_name = "Secret Manager"
    $e.metadata.event_type = "USER_RESOURCE_ACCESS"
    re.regex($e.metadata.product_event_type, `AccessSecretVersion`)
    not re.regex($e.principal.user.userid, `approved-sa@my-project\.iam\.gserviceaccount\.com`)
    not re.regex($e.principal.user.userid, `another-approved-sa@my-project\.iam\.gserviceaccount\.com`)

  condition:
    $e

  outcome:
    $principal_email = $e.principal.user.userid
    $secret_name = $e.target.resource.name
}

A few implementation notes for this rule in practice. First, make sure Data Access logs are enabled for Secret Manager — Admin Activity logs do not capture AccessSecretVersion calls, only Data Access logs do. Enable them with:

# Export current IAM policy
gcloud projects get-iam-policy PROJECT_ID --format=json > policy.json

Add the Secret Manager Data Access audit config to policy.json and re-apply:

{
  "auditConfigs": [
    {
      "service": "secretmanager.googleapis.com",
      "auditLogConfigs": [
        { "logType": "DATA_READ" }
      ]
    }
  ]
}

gcloud projects set-iam-policy PROJECT_ID policy.json

Second, the allowlist in the rule above is a starting point. As you expand the rule to cover multiple projects or a more complex service account naming scheme, consider using re.regex() with a pattern that matches your entire approved namespace (for example, ^(app-backend|app-worker)-sa@my-project\.iam\.gserviceaccount\.com$) rather than listing each approved account individually.

Rule 3: Overly Permissive VPC Firewall Rule

Firewall rules allowing ingress from 0.0.0.0/0 are a routine audit finding that rarely gets caught at creation time. By the time a security reviewer looks at the firewall configuration, the rule has been in place for weeks and removing it requires coordination with application teams. This rule catches the problem the moment the firewall rule is created.

rule gcp_vpc_firewall_open_ingress {
  meta:
    author = "Victor Silva"
    description = "Detects VPC firewall rules allowing ingress from 0.0.0.0/0"
    severity = "HIGH"
    priority = "HIGH"
    type = "ALERT"

  events:
    $e.metadata.event_type = "USER_RESOURCE_CREATION"
    $e.target.resource.type = "GCP_VPC_FIREWALL_RULE"
    $e.target.resource.attribute.labels["direction"] = "INGRESS"
    $e.target.resource.attribute.labels["source_ranges"] = "0.0.0.0/0"

  condition:
    $e

  outcome:
    $principal_email = $e.principal.user.userid
    $firewall_rule = $e.target.resource.name
    $network = $e.target.resource.attribute.labels["network"]
}

This rule also captures USER_RESOURCE_CREATION events — meaning it fires when the firewall rule is first created, not only on subsequent modifications. If your environment has existing open ingress rules that you want to detect in historical data, the retroactive search approach covered in the next section will surface them without waiting for a new creation event.

One refinement worth considering: if your environment legitimately uses 0.0.0.0/0 ingress for certain ports (like port 80/443 for public-facing load balancers), add a predicate to exclude those specific port combinations, or adjust the severity to MEDIUM and route it to an informational finding queue for human review rather than an automated alert.

Deploying Rules in Chronicle

With the rule text ready, deploying to Chronicle takes a few steps in the UI.

Navigate to Detection Engine > Rules and click New Rule. Paste the rule text into the YARA-L editor. Chronicle validates the syntax inline — if any field names or function calls are incorrect, the editor highlights the error and shows the expected format. Fix any validation errors before saving.

Once the rule validates cleanly, configure two settings:

Alert vs. Informational: Rules with type = "ALERT" in the meta section create entries in the Alerts view, trigger notification integrations, and are tracked through Chronicle’s case management workflow. Rules with type = "RULE_TYPE_UNSPECIFIED" create informational findings that appear in the Rules view but do not create alerts. Start new rules as informational until you have validated them against real traffic, then promote them to alert.

Enabled vs. Disabled: Rules do not evaluate incoming events until they are explicitly enabled. After saving, toggle the rule to Enabled using the status switch in the Rules list.

Chronicle evaluates enabled rules against incoming UDM events in near-real-time — new events that match an enabled rule create findings within a few minutes of the original log event.

Testing Rules with Retroactive Search

One of Chronicle’s most practical features for detection engineering is the ability to run a rule against historical data. This lets you validate that a new rule would have fired on past events (useful for confirming it catches real threats) and estimate its alert volume before enabling it on live data.

To run a retroactive search, open the rule in the Rules editor and click Run Retroactive Search. Set the time range (up to the retention window — one year by default) and submit. Chronicle processes the historical UDM events against the rule and shows you a list of matches with timestamps and outcome variable values.

This workflow is where the outcome variables pay off. A retroactive search on the IAM privilege escalation rule will show you every $principal_email, $project, and $role_granted value from the past year — you can immediately see whether the rule would have caught real events or whether it is firing on expected administrative activity that needs to be excluded.

For the Secret Manager rule, run a retroactive search over a 30-day window and review the $principal_email values in the results. Any service account identity you do not recognize should be investigated; any known-good identity that appeared should be added to the allowlist in the rule before you enable it on live data.

Best Practices

Start with the UDM field reference, not trial and error. Chronicle’s documentation includes a complete UDM field reference that lists every available field, its type, and which event types populate it. Before writing a new rule, look up which UDM event type corresponds to the action you want to detect and which fields are populated for that event type. Writing rules against unpopulated fields produces rules that silently never match — the field predicate evaluates as false because the field does not exist in the event, not because events are not arriving.

Manage alert fatigue before it becomes a problem. A detection rule that fires 200 times per day for expected activity is worse than no rule at all — it trains analysts to ignore the alert queue. Before enabling a rule in alert mode, run a retroactive search over two weeks of historical data and count the matches. If the volume is too high, tune the rule with additional predicates, add an allowlist for known-good identities, or run it as an informational finding for a week to collect baseline data before deciding on the right threshold.

Version-control your rules. YARA-L rule text is plain text — store it in a Git repository alongside your other infrastructure code. Chronicle’s API allows programmatic rule management (create, update, enable, disable) via the Chronicle REST API, so you can integrate rule deployment into a CI/CD pipeline with peer review and change tracking. Treat detection rules with the same engineering discipline as Terraform modules: they have the same blast radius when they go wrong.

Use outcome variables to make alerts self-contained. Every field you expose in outcome appears directly in the Chronicle alert details without requiring the analyst to open the raw log. The more context you surface in outcomes — principal identity, resource name, affected project, IP address — the faster triage goes. Think of outcome variables as the executive summary of the alert.

Separate rule type from severity. A rule can have severity = "HIGH" and type = "RULE_TYPE_UNSPECIFIED" — high severity, informational mode. Use this during the tuning phase for rules that detect genuinely high-risk behaviors but have not yet been validated against your specific environment’s baseline. It gives you visibility into the events without generating alert noise while you tune.

Conclusion

What we built here is the foundation of a detection engineering practice on GCP. Cloud Audit Logs flowing through a Log Router sink into Chronicle give you a normalized, petabyte-scale searchable corpus of everything happening in your GCP environment. YARA-L 2.0 rules — one for IAM privilege escalation, one for Secret Manager anomalous access, one for overly permissive firewall creation — give you concrete detections for three of the most common GCP security incidents. Retroactive search lets you validate those rules against real historical data before they start generating live alerts.

The next step is expanding coverage. The same YARA-L pattern applies to Cloud Storage public bucket access, GKE workload identity escalation, service account key creation, and Cloud Run deployments from unverified container images. Each new rule follows the same structure: understand the UDM event type, identify the fields that distinguish the suspicious behavior from normal activity, write the predicate, validate with retroactive search, and enable.

If you are working on the underlying infrastructure security that feeds into these detections, the posts on GCP Secret Manager with Terraform and GCP VPC Service Controls with Terraform cover the preventive controls that reduce the surface area these detection rules are monitoring. For runtime threat detection at the workload layer, Falco runtime security for Kubernetes complements Chronicle’s API-level visibility with in-cluster syscall-level detection. For the GKE workload identity escalation scenario mentioned above, GCP Binary Authorization for GKE with Terraform provides the admission-time control that pairs with Chronicle’s post-deployment detection.

Happy scripting!

Oracle Cloud Security Zones [English]

2025-08-05T22:36:18+00:00

In today’s cloud-first world, security isn’t just about monitoring threats—it’s about preventing them from happening in the first place. Oracle Cloud Infrastructure (OCI) Security Zones provide exactly this capability: proactive, policy-driven security enforcement that prevents misconfigurations before they can become vulnerabilities. This comprehensive guide will walk you through implementing Security Zones with extensive code examples, Terraform configurations, CLI commands, and interactive demonstrations.

What Are OCI Security Zones? A Technical Overview

Security Zones in OCI are compartment-level security boundaries that enforce predefined security policies. They act as a “security firewall” for your infrastructure-as-code deployments, automatically validating every resource creation request against established security rules.

┌─────────────────────────────────────────────────────────┐
│      OCI Tenancy                                        │
│  ┌─────────────────────────────────────────────────┐    │
│  │      Compartment                                │    │
│  │  ┌─────────────────────────────────────────┐    │    │
│  │  │      Security Zone                      │    │    │
│  │  │  ┌─────────────────────────────────┐    │    │    │
│  │  │  │      Security Recipe            │    │    │    │
│  │  │  │       • Network Rules           │    │    │    │
│  │  │  │       • Storage Rules           │    │    │    │
│  │  │  │       • Compute Rules           │    │    │    │
│  │  │  │       • IAM Rules               │    │    │    │
│  │  │  └─────────────────────────────────┘    │    │    │
│  │  └─────────────────────────────────────────┘    │    │
│  └─────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────┘

Setting Up Your Development Environment

Before we dive into code examples, let’s set up the necessary tools:

OCI CLI installed and configured: To install the OCI CLI, follow the official documentation: Installing the CLI

Cloud Guard enabled in your OCI compartment:

# To check the status of Cloud Guard
oci cloud-guard configuration get --compartment-id 

We could obtain the compartment ID using the CLI:

#Replace the compartment name before running the command
COMPARTMENT_ID=$(oci iam compartment list \
                  --name "ocilabs" \
                  --query "data[?contains(\"id\",'compartment')].id | [0]" \
                  --raw-output)

First, let’s see all the policies available in the Security Zone. We can do this using the OCI CLI:

oci cloud-guard security-policy-collection list-security-policies \
  --compartment-id $COMPARTMENT_ID \
  --query "data.items[*]".{"category:category,name:\"display-name\""} \
  --output table

With all the prerequisites in place, we can now create a Security Zone in Oracle Cloud Infrastructure (OCI). A Security Zone is a compartment that enforces security policies to ensure compliance with best practices.

oci cloud-guard security-policy-collection list-security-policies --compartment-id $COMPARTMENT_ID --query "data.items[?contains(\"display-name\", 'public_subnets')]"

And add some manipulation to get only the id:

DENY_PUBLIC_SUBNET_POLICY_ID=$(oci cloud-guard security-policy-collection list-security-policies \
  --compartment-id $COMPARTMENT_ID \
  --query "data.items[?contains(\"display-name\", 'public_subnets')].id | [0]")

The lasts steps are to create a Security Recipe that will use the policy we just found. A Security Recipe is a collection of security policies that define the security posture for resources created within a Security Zone.

oci cloud-guard security-recipe create \
  --compartment-id $COMPARTMENT_ID \
  --display-name "fromCLI" \
  --security-policies '['$DENY_PUBLIC_SUBNET_POLICY_ID']'

Now, I’ll try to create a public subnet in the VCN.

# Get the Virtual Cloud Network (VCN) ID
VCN_ID=$(oci network vcn list \
          --compartment-id $COMPARTMENT_ID \
          --query "data[?contains(\"id\",'vcn')].id | [0]" \
          --raw-output)

# Try to create a public subnet
oci network subnet create \
  --cidr-block "10.0.1.0/24" \
  --compartment-id $COMPARTMENT_ID \
  --vcn-id $VCN_ID

After that, we will see the following error message:

Perfect! OK, return an error message, but it’s the behavior we expect. The Security Zone prevents the creation of a public subnet, as it violates the security policies defined in the Security Recipe.

The same action, but using the web portal return ths message:

Resuming, Security Zones in OCI are a powerful feature that helps enforce security best practices across your cloud environment. By defining Security Recipes, you can ensure that resources created within a Security Zone comply with your organization’s security policies.

Happy scripting!

Azure Functions development on macOS [English]

2025-07-16T23:51:48+00:00

As cloud development continues to evolve, more developers are embracing cross-platform solutions. While Azure Functions traditionally felt more at home in Windows environments, macOS has become a first-class citizen for serverless development. Whether you’re a Mac user diving into Azure or a Windows developer switching platforms, this guide will get you up and running with Azure Functions on macOS.

The beauty of serverless computing lies in its platform agnostic nature. With Azure Functions, you can write code in PowerShell, Python, C#, Java, and JavaScript, and deploy it without worrying about the underlying infrastructure. But what about the development experience on macOS? Let’s explore how to set up a productive Azure Functions development environment on your Mac.

Prerequisites: Setting up your Mac

Before we dive into Azure Functions, we need to ensure our environment is properly configured. The good news is that Microsoft has invested heavily in cross-platform tooling, making the experience quite seamless.

First, let’s install the essential tools:

Azure CLI The Azure CLI is your gateway to managing Azure resources from the command line. Install it using Homebrew:

brew install azure-cli

PowerShell Yes, PowerShell runs natively on macOS! (since 2018 but don’t mind):

brew install --cask powershell

Azure Functions Core Tools This toolkit provides the runtime and templates for creating, debugging, and deploying Azure Functions:

brew tap azure/functions
brew install azure-functions-core-tools@4

Visual Studio Code While not mandatory, VS Code provides an excellent development experience for Azure Functions:

brew install --cask visual-studio-code

After installation, add the Azure Functions extension for VS Code to enhance your development workflow.

Creating your first PowerShell Azure Function on macOS

Now that we have our tools ready, let’s create our first Azure Function. We’ll use PowerShell as our runtime since it’s particularly powerful for automation and Azure management tasks.

First, let’s authenticate with Azure:

# Connect to Azure
Connect-AzAccount

# List available subscriptions
Get-AzSubscription

# Select your target subscription
Select-AzSubscription -SubscriptionId "your-subscription-id"

Create a new function app locally:

# Create a new directory for our function and move into it
mkdir PoShFunction && cd $_

# Initialize a new function app with PowerShell runtime
func init --worker-runtime powershell

This command creates the basic structure for a PowerShell-based function app, including the host.json, local.settings.json, and other configuration files.

Let’s create our first HTTP-triggered function:

func new --name HttpTriggerDemo --template "HTTP trigger"

This generates a new folder called HttpTriggerDemo with the function code. Let’s examine and modify the generated PowerShell script:

# HttpTriggerDemo/run.ps1
using namespace System.Net

# Input bindings are passed in via param block.
param($Request, $TriggerMetadata)

# Write to the Azure Functions log stream.
Write-Host "PowerShell HTTP trigger function processed a request on macOS."

# Interact with query parameters or the request body
$name = $Request.Query.Name
if (-not $name) {
    $name = $Request.Body.Name
}

$body = "Hello, $name! This Azure Function was developed on macOS and is powered by PowerShell."

# Associate values to output bindings by calling 'Push-OutputBinding'.
Push-OutputBinding -Name Response -Value ([HttpResponseContext]@{
    StatusCode = [HttpStatusCode]::OK
    Body = $body
})

Testing and debugging locally

One of the great advantages of the Azure Functions Core Tools is the ability to run and test functions locally. This works seamlessly on macOS:

# Start the function runtime locally
func start

You’ll see output similar to this:

Azure Functions Core Tools
Core Tools Version:       4.0.5030 Commit hash: N/A  (64-bit)
Function Runtime Version: 4.21.3.20404

Functions:
        HttpTriggerDemo: [GET,POST] http://localhost:7071/api/HttpTriggerDemo

Test your function using curl or your browser:

curl "http://localhost:7071/api/HttpTriggerDemo?name=MacOS"

Advanced PowerShell scenarios

Let’s create a more practical example - a function that manages Azure resources using PowerShell. This showcases the real power of combining PowerShell with Azure Functions on macOS:

Create a new timer-triggered function:

func new --name ResourceMonitor --template "Timer trigger"

Here’s a more advanced PowerShell function that monitors resource group usage:

# ResourceMonitor/run.ps1
# Input bindings are passed in via param block.
param($Timer)

# Get the current universal time in the default string format.
$currentUTCtime = (Get-Date).ToUniversalTime()

# Write an information log with the current time.
Write-Host "PowerShell timer trigger function started at: $currentUTCtime"

try {
    # Connect using Managed Identity (when deployed) or local credentials
    if ($env:MSI_ENDPOINT) {
        Connect-AzAccount -Identity
    } else {
        # For local development, use stored credentials
        Write-Host "Using local Azure credentials for development"
    }

    # Get all resource groups
    $resourceGroups = Get-AzResourceGroup
    
    $report = @()
    
    foreach ($rg in $resourceGroups) {
        # Get resources in each resource group
        $resources = Get-AzResource -ResourceGroupName $rg.ResourceGroupName
        
        $rgInfo = [PSCustomObject]@{
            ResourceGroupName = $rg.ResourceGroupName
            Location = $rg.Location
            ResourceCount = $resources.Count
            CreatedTime = $rg.Tags.CreatedTime
            LastChecked = $currentUTCtime
        }
        
        $report += $rgInfo
    }
    
    # Log the summary
    Write-Host "Resource Group Summary:"
    $report | ForEach-Object {
        Write-Host "  - $($_.ResourceGroupName): $($_.ResourceCount) resources in $($_.Location)"
    }
    
    # In a real scenario, you might want to:
    # - Send this data to Azure Monitor
    # - Store it in a database
    # - Send alerts for specific conditions
    
} catch {
    Write-Error "Error monitoring resources: $($_.Exception.Message)"
    throw
}

Write-Host "PowerShell timer trigger function completed at: $currentUTCtime"

Deployment from macOS

Deploying your Azure Function from macOS is straightforward. First, create the necessary Azure resources using PowerShell:

# Variables for our deployment
$resourceGroupName = "rg-functions-macos-demo"
$functionAppName = "func-macos-demo-$(Get-Random)"
$location = "East US"
$storageAccountName = "stamacosdemfunc$(Get-Random)"

# Create resource group
New-AzResourceGroup -Name $resourceGroupName -Location $location

# Create storage account (required for Azure Functions)
$storageParams = @{
    ResourceGroupName = $resourceGroupName
    Name = $storageAccountName
    Location = $location
    SkuName = "Standard_LRS"
    Kind = "StorageV2"
}
New-AzStorageAccount @storageParams

# Create the function app
$functionParams = @{
    ResourceGroupName = $resourceGroupName
    Name = $functionAppName
    StorageAccountName = $storageAccountName
    Location = $location
    Runtime = "PowerShell"
    RuntimeVersion = "7.2"
    FunctionsVersion = "4"
}
New-AzFunctionApp @functionParams

Deploy your function using the Azure Functions Core Tools:

# Deploy to Azure
func azure functionapp publish $functionAppName

Whether you’re automating infrastructure tasks, building APIs, or creating scheduled jobs, Azure Functions on macOS with PowerShell gives you the flexibility to work in your preferred environment while leveraging the power of Azure’s serverless platform.

Ready to start building? The tools are installed, the examples are tested, and Azure is waiting for your next serverless creation!

Happy scripting!