MD/Blog

HowTo give Kiro safe access to your AWS environment

2026-03-07T10:20:55+00:00

Kiro is AWS’s agentic coding tool, available both as an IDE and a CLI. I’ve been using the CLI a lot lately, including for AWS work where I let it run aws commands on my behalf. It works well. But the moment you point it at a real AWS account from a laptop that already has prod credentials configured locally, there’s a gap nobody talks about.

You probably have a dozen profiles in ~/.aws/config. Setting up a dedicated IAM role for Kiro and pointing AWS_PROFILE at it doesn’t actually stop Kiro from running aws --profile prod-admin ec2 terminate-instances. The CLI reads the whole config file and that profile is right there. AWS_PROFILE only sets a default. It restricts nothing.

This post is CLI-first. The IDE works the same way for steps 1 and 2, and I’ll note the differences at the end. Production accounts only — if you’re playing in a sandbox, none of this matters much.

Step 1: A read-only IAM role for Kiro

Start from the AWS managed policy ReadOnlyAccess and strip what you don’t want the agent reading. At minimum, deny reads on Secrets Manager, SSM SecureString parameters, and anything storing customer data.

Use IAM Identity Center to access the role via SSO. No long-lived access keys. Credentials are short-lived, they expire on their own, and CloudTrail records the actual user who assumed the role.

A permission boundary on top of this role is a good idea, but out of scope for this post. The role’s permissions policy is your ceiling for now.

Step 2: Isolate Kiro’s AWS config from your shell’s

This is the part that matters.

The AWS CLI and SDK read ~/.aws/config and ~/.aws/credentials by default. Both file locations can be overridden with environment variables:

export AWS_CONFIG_FILE=~/.aws/kiro/config
export AWS_SHARED_CREDENTIALS_FILE=~/.aws/kiro/credentials

When these are set, the SDK ignores the default files entirely. Profiles that aren’t in the file you specified simply do not exist as far as aws is concerned. This is the only mechanism that actually prevents Kiro from seeing your other profiles. Filtering file reads at the agent level (which we’ll add in step 3) is not enough on its own, because the AWS SDK reads ~/.aws/config directly without going through the agent’s fs_read tool.

Create the isolated config. Put every Kiro profile you’ll need in here. Start with read-only and add a write role for the dev account. Add a prod read-only role too if you want Kiro to be able to look at prod without being able to change it:

# ~/.aws/kiro/config
[profile kiro-readonly]
sso_session = my-sso
sso_account_id = 123456789012
sso_role_name = KiroReadOnly
region = us-east-1

[profile kiro-dev-write]
sso_session = my-sso
sso_account_id = 123456789012
sso_role_name = KiroDevWrite
region = us-east-1

[profile kiro-prod-readonly]
sso_session = my-sso
sso_account_id = 999999999999
sso_role_name = KiroReadOnly
region = us-east-1

[sso-session my-sso]
sso_start_url = https://my-org.awsapps.com/start
sso_region = us-east-1
sso_registration_scopes = sso:account:access

Touch an empty credentials file alongside it so the SDK doesn’t fall back to anything else:

touch ~/.aws/kiro/credentials
chmod 600 ~/.aws/kiro/credentials

Now you need to make sure Kiro is launched with these env vars set, plus a default profile. Kiro doesn’t have a way to set env vars for its shell tool yet (kirodotdev/Kiro#40), so the vars have to be set in the parent process. The cleanest way I’ve found is a shell alias in ~/.zshrc:

alias kiro-aws='AWS_CONFIG_FILE=~/.aws/kiro/config \
  AWS_SHARED_CREDENTIALS_FILE=~/.aws/kiro/credentials \
  AWS_PROFILE=kiro-readonly \
  kiro-cli chat --agent aws-readonly'

Always launch Kiro for AWS work with kiro-aws instead of kiro-cli directly. From inside the session, ask it to run aws configure list-profiles. You should see only the kiro-* profiles. If prod-admin shows up, the isolation didn’t take.

This step alone takes Kiro from “could touch anything you have credentials for” to “physically cannot see your other accounts”.

Step 3: A custom agent and a steering document

Kiro CLI lets you define custom agents in ~/.kiro/agents/.json with their own tool settings. The CLI has a dedicated use_aws tool that calls AWS APIs directly, separate from the general shell tool. Define a custom agent for AWS work:

{
  "name": "aws-readonly",
  "description": "AWS infrastructure assistant. Default read-only access via kiro-* profiles. Write operations require explicit user confirmation and a profile switch. Never attempt to use credentials outside the isolated AWS config.",
  "tools": [
    "read",
    "shell",
    "use_aws",
    "thinking",
    "grep",
    "glob",
    "delegate"
  ],
  "toolsSettings": {
    "shell": {
      "autoAllowReadonly": true
    }
  },
  "resources": [],
  "hooks": {},
  "includeMcpJson": true,
  "model": null
}

A few things going on here. The description field doubles as a system-level hint, so I put the key constraints right in there. I dropped write from the tools list — this agent is for AWS investigation, not editing your codebase. If you need file edits in the same session, add it back. delegate is included so Kiro can spin up sub-tasks for longer investigations (e.g., “check all Lambda functions in this account for public URLs”). The shell tool with autoAllowReadonly lets read-only shell commands run without prompting every time.

The isolated config from step 2 is doing the heavy lifting regardless. Since it only contains kiro-* profiles, any aws --profile something-else would just fail with “profile not found.” The whole point of isolating the config is so we don’t have to filter every shell command for safety.

The steering document lives at ~/.kiro/steering/AGENTS.md. This is where you give the agent enough context to make good decisions on its own, not just a list of don’ts:

# AWS Access Rules for Kiro

## Context

This environment uses isolated AWS credentials. The only profiles available
are in `~/.aws/kiro/config`, set via `AWS_CONFIG_FILE`. The standard
`~/.aws/config` is not visible. Do not assume other profiles exist.

## Available profiles

| Profile                | Account     | Access level | Use case                          |
|------------------------|-------------|--------------|-----------------------------------|
| `kiro-readonly`        | dev         | read-only    | Default. Inspecting resources.    |
| `kiro-dev-write`       | dev         | read-write   | Making changes in dev.            |
| `kiro-prod-readonly`   | production  | read-only    | Investigating prod issues.        |

`kiro-readonly` is the default (set via `AWS_PROFILE`). You do not need to
pass `--profile` for it.

## Rules

### Read operations
- `describe`, `get`, `list`, and similar read calls can run without asking.
- If you need to look at prod, use `--profile kiro-prod-readonly` and tell
  the user you're switching to the prod account before running the command.

### Write operations
- Any mutating call (`create`, `update`, `delete`, `put`, `modify`,
  `terminate`, `stop`, `start`, `reboot`, `invoke`) requires explicit
  user confirmation before execution.
- Before asking for confirmation, state: what you intend to do, which
  resources will be affected (ARN or name), and which profile you'll use.
- Only use profiles ending in `-write` for mutations. Never attempt a write
  call with a read-only profile.

### Credential boundaries
- If a call returns `AccessDenied` or `UnauthorizedAccess`, stop and ask
  the user. Do not:
  - Try a different profile hoping it has more permissions
  - Look for credentials in the filesystem
  - Suggest the user re-authenticate or assume a different role
  - Attempt to modify IAM policies to grant yourself access
- The permissions you have are intentional. Treat access errors as a signal
  to check with the user, not a problem to solve.

### General behavior
- When running multiple AWS commands in sequence, prefer combining related
  reads into a single investigation before presenting findings.
- If a task requires both read and write operations, do all the reads first
  with the read-only profile, present your plan, then switch to the write
  profile only after the user confirms.
- Always include `--output json` for programmatic parsing. Use `--output
  table` only when the user asks for a summary view.

The “credential boundaries” section is what prevents the drift behavior. Without it, when Kiro hits an AccessDenied it will try to find another way around it. The agent is helpful to a fault.

Step 4: Verify activity in CloudTrail

Every API call the agent makes shows up in CloudTrail, tied to the SSO user who assumed the role. The role session name looks something like KiroReadOnly/your_user or KiroDevWrite/your_user, so you can tell at a glance which role was used and by whom.

You can filter for this in the CloudTrail console under Event history by searching for the username KiroReadOnly (or whatever your role is named). No extra setup required — standard CloudTrail covers this.

If you have CloudTrail Lake set up, a saved query makes this even faster:

SELECT eventTime, eventName, requestParameters
FROM <your_event_data_store>
WHERE userIdentity.sessionContext.sessionIssuer.userName LIKE 'Kiro%'
  AND eventTime > current_timestamp - interval '24' hour
ORDER BY eventTime DESC

Either way, when something looks off in the account, you want CloudTrail to answer “what did the agent do” without having to ask the agent itself.

Working with write access

When you actually need Kiro to make changes, just tell it to use the write profile:

Use --profile kiro-dev-write to update the IAM policy attached to the lambda role.

The steering doc kicks in. Kiro asks for confirmation, explains what it’s about to do, and only then runs the command. The IAM role enforces what kiro-dev-write can actually touch, so even if the agent misreads your prompt the blast radius is bounded by the role’s permissions.

For prod writes, the same pattern. Make a KiroProdWrite IAM role with very narrow permissions, add a kiro-prod-write profile to the isolated config, and ask Kiro to use it explicitly when you need it. Roles are cheap. Blast radius is not.

A note on the IDE

The Kiro IDE doesn’t have a use_aws tool. All AWS commands go through the shell tool, which means the config isolation from step 2 and the steering document are your primary guardrails. Steps 1 and 2 apply identically: create the role, isolate the config, then launch the IDE from a terminal where the env vars are already exported so the IDE’s agent inherits them. Step 3’s custom agent JSON is CLI-only, but the steering doc at ~/.kiro/steering/AGENTS.md is read by both.

That’s it

The IAM role gets all the airtime, but the config isolation in step 2 is what actually keeps Kiro from drifting into the wrong account when you’re moving fast.

HowTo use Kiro hooks to enforce IaC standards

2025-10-12T10:20:55+00:00

Kiro hooks are actions that trigger at specific points in the agent’s lifecycle: before or after a tool call, when a file is saved, when the agent stops, or when you manually fire them. A hook can either run a shell command or send a prompt back to the agent. They’re how you get the agent to actually follow your team’s rules, not just write code that compiles.

When Kiro writes Terraform, the default loop is: agent writes the .tf, you eyeball it, you push, CI complains, you go back to the agent. Hooks shorten that loop by running validators during the agent’s own turn. Bad code never makes it to the PR.

This post shows how to wire up four checks for Terraform work and back them with a steering document. The setup assumes apply happens in CI (Atlantis or similar). Hooks are about the local feedback loop, not gating production.

The four checks:

terraform validate — catches invalid HCL and schema errors
terraform fmt — small formatting issues, auto-fix
tflint — bigger linter issues (provider-specific best practices, deprecations)
trivy — security misconfigurations (which absorbed tfsec)

Step 1: Where things live

Kiro hooks are .kiro.hook files that live in .kiro/hooks/ inside your workspace (or ~/.kiro/hooks/ for user-level hooks that apply everywhere). Each hook is its own file with a when/then structure — what event triggers it, and what action to take.

There are two action types: runCommand executes a shell command, and askAgent sends a prompt back to the agent so it can act on the result. Some checks work better as agent prompts than shell scripts — more on that below.

Steering documents go in .kiro/steering/. We’ll create one of those too.

Here’s what we’re building:

.kiro/
├── hooks/
│   ├── terraform-fmt-validate.kiro.hook
│   ├── terraform-lint-scan.kiro.hook
│   ├── terraform-lint-scan.sh
│   └── block-terraform-apply.kiro.hook
└── steering/
    └── terraform.md

Step 2: Auto-format and validate on save

The first hook fires whenever you save a .tf file. It uses askAgent to tell the agent to run terraform fmt and terraform validate and surface any errors.

.kiro/hooks/terraform-fmt-validate.kiro.hook:

{
  "enabled": true,
  "name": "Validate",
  "description": "Runs terraform fmt to format and terraform validate to check syntax whenever a .tf file is saved.",
  "version": "1",
  "when": {
    "type": "fileEdited",
    "patterns": ["**/*.tf"]
  },
  "then": {
    "type": "askAgent",
    "prompt": "A .tf file was saved. Run 'terraform fmt' on the saved file, then run 'terraform validate' in its parent directory. Surface any errors."
  }
}

Why askAgent and not runCommand? A shell script would just print errors. I like using askAgent here because the agent can read the output and fix issues in the same turn, which is the whole point. The enabled flag also lets you toggle hooks on and off without deleting them.

Step 3: Lint and security scan at end of turn

Running tflint and trivy after every single file save is overkill — they’re slower and the agent often writes multiple files in a turn. Better to run them once when the agent finishes using the agentStop event, which fires at the end of every agent response.

This hook uses askAgent to check whether any .tf files were touched, and if so, calls a shell script that runs both linters. The script lives alongside the hook files.

.kiro/hooks/terraform-lint-scan.sh:

#!/usr/bin/env bash
set -euo pipefail

# Find all directories containing .tf files
tf_dirs=$(find . -name '*.tf' -not -path '*/.terraform/*' -exec dirname {} \; | sort -u)

if [ -z "$tf_dirs" ]; then
  echo "No .tf files found, skipping."
  exit 0
fi

for dir in $tf_dirs; do
  echo "==> Running tflint in $dir"
  tflint --chdir="$dir"

  echo "==> Running trivy config scan on $dir"
  trivy config "$dir"
done

And the hook definition at .kiro/hooks/terraform-lint-scan.kiro.hook:

{
  "enabled": true,
  "name": "Terraform Lint & Scan",
  "description": "Runs tflint and trivy config scan on all directories containing .tf files once the agent finishes its turn.",
  "version": "1",
  "when": {
    "type": "agentStop"
  },
  "then": {
    "type": "askAgent",
    "prompt": "Check if any .tf files were created or modified during this session. If yes, run: bash .kiro/hooks/terraform-lint-scan.sh and surface the findings. If no .tf files were touched, do nothing and skip silently."
  }
}

If either linter finds something, the agent surfaces the findings and can fix them in the same turn.

Very handy.

I tested this with tflint v0.55+ and trivy v0.58+. Both have to be installed locally. On macOS: brew install tflint trivy.

Step 4: Discourage local apply

Local apply skips CI, skips review, skips state locking conventions. This hook is advisory, not a hard block — see the note at the end — but it catches the common case.

It uses preToolUse on shell commands with an askAgent prompt. The agent checks the command before it runs and refuses if it’s an apply or destroy.

.kiro/hooks/block-terraform-apply.kiro.hook:

{
  "enabled": true,
  "name": "Block Terraform Apply",
  "description": "Prevents terraform apply and terraform destroy from being run locally. Push the PR and let CI handle it.",
  "version": "1",
  "when": {
    "type": "preToolUse",
    "toolTypes": ["shell"]
  },
  "then": {
    "type": "askAgent",
    "prompt": "Check if the command being executed contains 'terraform apply' or 'terraform destroy'. If it does, STOP and do NOT run the command. Instead, tell the user: 'terraform apply/destroy is blocked locally. Push the PR and let CI handle it.' If the command does not contain terraform apply or destroy, proceed normally."
  }
}

terraform plan still works, so the agent can preview changes. Hopefully Kiro will add a proper denyCommand action at some point so we don’t have to rely on the agent being well-behaved for this.

Step 5: A steering document with your team’s IaC standards

Hooks catch mechanical errors. The steering document is where you encode the things linters can’t catch — your team’s conventions, naming patterns, what modules to use. Put it at .kiro/steering/terraform.md in your workspace.

A starting point for what to put in there. Adjust to your team’s actual standards:

# Terraform standards

## File and resource conventions
- One resource per file when possible. Group tightly-related resources
  (e.g., a Lambda function and its IAM role) in the same file.
- Resource names use snake_case and start with a noun describing what
  the resource is, not what it does. `vpc_main`, not `main_vpc`.
- All variables have `type` and `description`. Sensitive variables have
  `sensitive = true`.
- Provider versions are pinned with `~>` in `versions.tf`. Never `>=`.

## Modules
- Use the company `terraform-aws-eks` module for EKS, do not declare
  raw `aws_eks_cluster` resources.
- Use the company `terraform-aws-s3-secure` module for any S3 bucket.
- Modules sourced from the internal registry, never from `github.com`
  directly.

## Required tags
- Every resource that supports tags must have: `Owner`, `Environment`,
  `ManagedBy = "terraform"`, `CostCenter`, `Repository`.
- Use the `default_tags` block in the provider when possible.

## Things to never do
- No public ACLs on S3 buckets (`acl = "public-read"` etc.).
- No `0.0.0.0/0` ingress on security groups except for resources
  explicitly tagged `Public = "true"`.
- No hardcoded AWS account IDs or ARNs. Use `data "aws_caller_identity"`
  and `data` sources for ARNs.
- No `aws_iam_policy` declared inline outside of approved modules.

## Workflow
- After modifying any .tf file, validate with `terraform validate`.
  If it fails, fix it before reporting back to the user.
- After making meaningful changes to a directory, run `tflint` and
  `trivy` on it and surface findings.
- Never run `terraform apply` or `terraform destroy` locally.
  Apply is handled in CI.

Without a steering doc the agent writes generic Terraform. With one it writes like someone who actually read your wiki.

A note on CI

Run terraform fmt, tflint, and trivy in CI as a backstop. Hooks can be bypassed: they can be disabled with the enabled flag, they only fire when changes go through Kiro’s tools or editor saves, and an agent following askAgent prompts is ultimately advisory — it’s an LLM, not a hard gate. CI is the boundary that actually has to hold.

The point of hooks isn’t to replace CI. It’s faster feedback. Catch the issue during the agent’s turn while it’s still in context, not 20 minutes later in PR review when you’ve moved on to something else.

That’s it.

HowTo Install and Use Karpenter in EKS

2025-01-17T10:20:55+00:00

Karpenter is an open-source Kubernetes node autoscaler created by AWS, designed to improve efficiency and cost savings by provisioning and de-provisioning nodes dynamically based on workloads. I have been using karpenter since its first few versions when it was quite limited in features but after the v1.0 release I believe it has become the major player in this field. This article is intended for folks who have traditionally used cluster autoscaler to show how to install and do basic operations with karpenter.

But first, why Karpenter over Cluster Autoscaler?

Cluster Autoscaler (CA) works by scaling node groups based on pod scheduling failures, but it has a few limitations:

Tied to ASGs: CA relies on AWS Auto Scaling Groups (ASGs), making scaling decisions slower.
Fixed node types: Nodes in CA are predefined, limiting flexibility.
Inefficient resource allocation: CA does not dynamically optimize instance selection, potentially leading to wasted resources.

Advantages of Karpenter:

Direct EC2 instance provisioning: Karpenter does not require ASGs and provisions instances directly through EC2.
Faster scaling: Unlike CA, Karpenter reacts immediately to pending pods, minimizing scheduling delays.
Flexible instance selection: It chooses the most efficient EC2 instance type based on workload requirements.
Automatic node termination: When nodes are idle, Karpenter deprovisions them automatically, reducing costs.
Better spot instance utilization: Karpenter supports mixed instance types and Spot Instances more effectively.

How Karpenter Works

Karpenter listens for unscheduled pods and provisions the best-fitting compute capacity in real-time:

Pod watch: It monitors the Kubernetes API for unscheduled pods.
Instance selection: Karpenter selects the most cost-effective instance type based on constraints and requirements.
Instance provisioning: It launches the instance directly using the EC2 API.
Node registration: The new node joins the cluster, and Karpenter binds pods to it.
Node de-provisioning: When nodes become unnecessary, Karpenter automatically removes them.

Next, let’s look at how to install and use karpenter; the first decision is where to install it: in its own node group or using Fargate; hopefully AWS will have soon a managed solution for karpenter so we don’t have to worry about this at all. I like to use the fargate solution as it is simple and it doesn’t require a node-group just for karpenter.

Install Karpenter on AWS Fargate

Karpenter can be installed on an AWS Fargate node to manage compute resources dynamically. Follow these steps to set it up:

Create a Fargate Profile for Karpenter

Ensure that your EKS cluster has a Fargate profile for the karpenter namespace:

aws eks create-fargate-profile --cluster-name my-cluster --fargate-profile-name karpenter-profile \
  --pod-execution-role-arn arn:aws:iam::ACCOUNT_ID:role/AmazonEKSFargatePodExecutionRole \
  --selectors namespace=karpenter

Create a Custom values.yaml File

Create a file named karpenter-values.yaml with the following content:

serviceAccount:
  create: true
  name: karpenter

controller:
  clusterName: my-cluster
  aws:
    defaultInstanceProfile: KarpenterInstanceProfile
    interruptionQueue: karpenter-interruption-queue

nodeSelector:
  eks.amazonaws.com/fargate-profile: karpenter-profile

settings:
  consolidation:
    enabled: true
  ttlSecondsAfterEmpty: 300

Replace my-cluster with your EKS cluster name and karpenter-profile with your Fargate profile name.

Install Karpenter Using Helm with the Custom values.yaml

Add the Karpenter Helm repository:

helm repo add karpenter https://charts.karpenter.sh
helm repo update

Install Karpenter using the custom values.yaml file:

helm install karpenter karpenter/karpenter --namespace karpenter --create-namespace -f karpenter-values.yaml

Customizing Bin Packing Behavior

There are many values that can be set to control how karpenter works (see the helm defaults for all of those); here I just configured some to enable consolidation and how long to wait to delete an idle node. Customize them per your needs.

settings.consolidation.enabled: Default is false. When set to true, Karpenter bin packs workloads by terminating underutilized nodes and rescheduling pods.
settings.ttlSecondsAfterEmpty: Default is 30. This defines how long an empty node should remain before being deprovisioned.
settings.limits.resources.cpu: No default limit. This should be set based on your cluster’s resource constraints.

Once you have it installed it will start doing its thing; watching k8s events and taking action as needed.

Troubleshooting Karpenter

To diagnose scaling issues, monitor the following logs: Check Karpenter logs:

kubectl logs -n karpenter -l app.kubernetes.io/name=karpenter -f

Inspect the provisioner status:

kubectl get provisioners -o wide

Check pending pods:

kubectl get pods --field-selector=status.phase=Pending -A

Inspect event logs for scheduling failures:

kubectl describe pod

Check for insufficient capacity errors:

kubectl get events --sort-by=.metadata.creationTimestamp | grep -i karpenter

That’s it. You have successfully installed and configured Karpenter in your Amazon EKS cluster. With Karpenter, your cluster will automatically scale nodes dynamically based on workload demand, improving cost efficiency and performance.

Managing AWS Resources in Kubernetes with ACK

2024-09-20T11:20:55+00:00

Kubernetes apps often require a number of supporting resources like databases, message queues, and object stores. AWS provides a set of managed services that we can use to provide these resources for our apps, but provisioning and integrating them with Kubernetes is usually complex and time-consuming. Traditionally what I have done in the past, I used Terraform for this, but this was disconnected from the application and required careful orchestration and/or ordering (aka you need to run Terraform first, then deploy the app next, then run terraform one more time, etc.)

ACK (AWS Controllers for Kubernetes) is an open-source project from AWS that was launched a few years ago to solve exaclty this problem. Besides ACK there are a few other solutions for this:

Crossplane: An open-source Kubernetes project that allows you to manage cloud infrastructure and services declaratively. Unlike ACK, Crossplane provides a more cloud-agnostic approach and supports multiple cloud providers.
Kubernetes service catalog: This is a native Kubernetes extension that enables integration with cloud services using the Open Service Broker API. However, it requires cloud providers to implement a broker.
Terraform Kubernetes Operator: This approach uses Terraform to manage cloud resources through Kubernetes. It provides more flexibility but requires Terraform state management.

These are all good solutions, but let’s see where ACK shines:

Tightly integrated with AWS: ACK is developed and maintained by AWS, ensuring compatibility and updates aligned with AWS services.
Declarative Kubernetes approach: Resources are managed using Kubernetes manifests, aligning with Kubernetes-native workflows.
IAM permissions management: Handles permissions automatically via IAM, making it secure and manageable.

Next, we are going to look at how to install and use ACK in an existing Kubernetes cluster. We are going to use the helm-based installation so you will need that available in case you don’t have it already.

Install ACK Controllers

ACK provides controllers for multiple AWS services. We can either install them one by one (only the one we intend to use), or we can install all available controllers. Here are the basic steps:

Add the ACK Helm repository:

helm repo add ack https://aws-controllers-k8s.github.io/helm-charts
helm repo update

Install all available ACK controllers:

helm install ack-all ack/ack-all --namespace ack-system --create-namespace

This will deploy controllers for all supported AWS services within the ack-system namespace.

If you want to install only specific controllers, you can replace ack-all with the name of the specific controller. For example, to install both the S3 and SQS controllers:

helm install ack-s3 ack/s3-controller --namespace ack-system --create-namespace
helm install ack-sqs ack/sqs-controller --namespace ack-system --create-namespace

ACK currently supports controllers for many AWS services:

Amazon S3
Amazon SQS
Amazon SNS
Amazon RDS
Amazon DynamoDB
Amazon ElastiCache
etc.

To see the full list of supported services, see: https://aws-controllers-k8s.github.io/community/docs/community/services/

Next, let’s verify the installation:

kubectl get pods -n ack-system

and ensure all controllers you installed are up and running.

Great, let’s use ACK to create some resources to see how easy it is to do that:

Deploy an SQS Queue

Create a YAML manifest for the SQS queue:

apiVersion: sqs.services.k8s.aws/v1alpha1
kind: Queue
metadata:
  name: test-queue
spec:
  queueName: test-queue

Apply the configuration:

kubectl apply -f test-queue.yaml

Verify the queue creation:

kubectl get queues.sqs.services.k8s.aws

You should see test-queue in the output.

Deploy an S3 Bucket

Create a YAML manifest for the S3 bucket:

apiVersion: s3.services.k8s.aws/v1alpha1
kind: Bucket
metadata:
  name: test-bucket-unique-name
spec:
  name: test-bucket-unique-name

Apply the configuration:

kubectl apply -f test-bucket.yaml

Verify the bucket creation:

kubectl get buckets.s3.services.k8s.aws

You should see test-bucket-unique-name in the output.

And that’s it. It is as simple as that.

Note: normally these definitions would be part of the helm deploy of your application (in the same repository) and deployed by a common pipeline that deploys your application (using argo-cd or similar).

HowTo Migrate your EKS cluster to Graviton2

2024-02-18T10:52:55+00:00

In this article, I’m going to show you how you can migrate an existing EKS cluster to Graviton2 workers in a few simple steps. I’m going to assume you already have an existing EKS cluster working on an Intel based infrastructure (Intel and/or AMD instance types).

Step 1: add Graviton2 worker nodes in your cluster.

This depends on how your EKS cluster is configured; you could have one of the following scenarios:

NodeGroup workers: this is a typical EKS install where you have worker nodes deployed in a nodegroup.
No nodegroups or autoscaling workers; these would be managed by something like Karpenter where we need to add the Graviton2 nodes like that.

NodeGoup setup

In this case, you will have to add a new nodegroup that has support for Graviton2 instances. Depending on how you created your nodegroups you would use a similar method. For example, I will show how to do this when using the terraform open-source EKS module; this is a snippet of our cluster definition:

module "eks_cluster" {
  source  = "terraform-aws-modules/eks/aws"

  cluster_name      = "my-cluster"
  cluster_version   = "1.28"
  subnets           = ["subnet-xxxxxxx", "subnet-yyyyyyy"] 
  vpc_id            = "vpc-xxxxxxx"
  manage_aws_auth   = true
  node_groups = [
    {
      desired_capacity = 2
      max_capacity     = 10
      min_capacity     = 1
      instance_type    = "m6i.xlarge"
      name             = "intel-node-group"
    },
...
  ]
}

in this case, if we want to add a graviton2 nodegroup we would just have to define a new one similar to the intel one but with graviton2 instances. For ex:

    {
      desired_capacity = 2
      max_capacity     = 10
      min_capacity     = 1
      instance_type    = "m6g.xlarge"
      name             = "arm64-node-group"
    },

and reapply the terraform for the cluster definition.

Karpenter

If your EKS cluster is using Karpenter to manage worker nodes then it will not use nodegroups and we will need to tell Karpenter to add these into the cluster. I’ve written a dedicated article about this use case that you can follow along here.

Step 2: Convert your workloads to graviton2.

Now that we have graviton2 worker nodes in the cluster we can convert all our workloads to arm64. We have two special cases that we need to handle:

daemonsets; these are special k8s workloads that run on every single node. These are normally monitoring or logging agents, security agents, and other global infrastructure software that we usually don’t write ourselves and we install and use in our cluster from either open source or commercial vendors.
custom services; these are our own software that we have built and run in our clusters.

Daemonsets

If any of your daemonsets doesn’t have support for arm64 you will see that immediately after you add your first graviton2 worker in the cluster as those will fail to run there. Most of the time we install these with something like helm charts from an open source repository or vendor.

What is needed for this to work is to have all the containers used by the daemonset available for both intel and arm64. This can be seen on the image source in dockerhub or amazon ECR, etc. If there is no support for it we will probably need to work with the vendor or the open source maintainers to have them add support to build a multi-arch containers and have these available.

Assuming that the container registry has support for multi-arch and we have available the arm64 image, EKS nodes are capable of downloading from the registry the appropriate architecture the node is running automatically without any sepecial configuration. Very cool! This is the happy path; where upstream already has support for it or we can get this in the upstream and we don’t have to change anything on our end.

Unfortunately, this doesn’t work all the time. If that is the case, you will have to build the multi-arch containers yourself. Hopefully, the project has definitions for the dockerfile and it is not so difficult to rebuild the image with something like buildx for multi-arch. Once we have that, we can push it to an internal repo, and then use it in the helm chart instead of the official image. Most helm charts will allow us to redefine the image in use with your own custom images; look for that in the values.yaml. This would look something like:

image:
  # -- image registry
  registry: "docker.io"
  # -- Image repository.
  repository: project/agent
  # -- image tag: latest or defined.
  tag: null
  # -- image pull policy.
  pullPolicy: IfNotPresent
  # -- Optional set of image pull secrets.
  pullSecrets: []

Where we would replace with the custom registry (maybe a private ECR) and the repository (our custom project in ecr)

Custom software

The rest of the deployments running in the EKS cluster should be our custom software that we should have full control over and be able to compile and build for multi-arch format. While we are in the transition phase and we don’t have arm64 images for some deployments we need to make sure we are using a selector to have those run on intel nodes as if not they will fail. We can do this with something like:

      nodeSelector:
        intent: apps
        kubernetes.io/arch: amd64

inside the definition of the service. This will force it to run on the original intel worker nodes.

Once we have built the arm64 image we can just remove that selector and allow it to be deployed on any node type, or we can change it to kubernetes.io/arch: arm64 and force it on graviton2 nodes.

Depending on how many services you have deployed in your cluster and how unique they are, this might be a tedious process until you can switch all of those to graviton2 nodes, but this should be worth it, bringing great performance and cost savings along the way.

Once you have migrated all your workloads to graviton2 you can just retire the intel nodegroups and after that, your EKS cluster will be running fully on graviton2. Boom!

Running Graviton2 workloads on EKS clusters with Karpenter

2023-12-10T10:52:55+00:00

Amazon Elastic Kubernetes Service (EKS) provides a managed Kubernetes service, allowing users to deploy, manage, and scale containerized applications using Kubernetes on AWS. With the introduction of Graviton2 processors, AWS offers enhanced performance and cost savings.

Karpenter is an open-source node lifecycle management project built for Kubernetes that was created by AWS as an alternative for the cluster autoscaler project.

In this article, we are going to look into what steps are needed to run graviton2 (arm64) based workloads in a EKS cluster that is managed with Karpenter. I’m going to assume you have a running EKS cluster and karpenter is properly configured in the cluster; if you need help setting up a new cluster with karpenter follow along with the documentation at the official site

NodePool

The NodePool sets constraints on the nodes that can be created by Karpenter and the pods that can run on those nodes. The NodePool configures things like:

Define taints to limit the pods that can run on nodes Karpenter creates
Define any startup taints to inform Karpenter that it should taint the node initially, but that the taint is temporary.
Limit node creation to certain zones, instance types, and computer architectures (like arm64 or amd64)

You can get the active karpenter nodepools in your cluster with:

kubectl describe nodepool

Let’s say that in our case this is driven by a configuration that looks like this:

apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: default 
spec:  
  template:
    metadata:
      labels:
        intent: apps
    spec:
      requirements:
        - key: kubernetes.io/arch
          operator: In
          values: ["amd64"]
        - key: "karpenter.k8s.aws/instance-cpu"
          operator: In
          values: ["4", "8", "16", "32"]
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot", "on-demand"]
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ["c", "m", "r"]
      nodeClassRef:
        name: default
      kubelet:
        containerRuntime: containerd
        systemReserved:
          cpu: 100m
          memory: 100Mi
  disruption:
    consolidationPolicy: WhenUnderutilized

Here we can see that this nodepool allows only amd64 (intel or amd) type of instances. If we want to support graviton2 (arm64) instances we would need to either update this definition to support that or create a new separate nodepool. Let’s just add support in the existing one by adding this key to the requrements:

       - key: kubernetes.io/arch
          operator: In
          values: ["amd64", "arm64"]

and then re-apply it with kubectl apply. Now our nodepool supports both intel and graviton2 instance types.

NodeClass

Another important concept for Karpenter is the EC2NodeClass. Node Classes enable configuration of AWS specific settings. Each NodePool must reference an EC2NodeClass using spec.template.spec.nodeClassRef. Here we configure things like subnets, security groups, and what AMIs to use for the instances.

The configuration for this might look something like:

apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
  name: default
spec:
  role: "${local.node_iam_role_name}"
  amiFamily: AL2 
  securityGroupSelectorTerms:
  - tags:
      karpenter.sh/discovery: ${local.name}
  subnetSelectorTerms:
  - tags:
      karpenter.sh/discovery: ${local.name}
  tags:
    IntentLabel: apps
    KarpenterNodePoolName: default
    NodeType: default
    intent: apps
    karpenter.sh/discovery: ${local.name}

Where ${local.name} is the name of the cluster and ${node_iam_role_name} is the name of the IAM role used for the ec2 instances. A configuration like this where we don’t define any of the AMIs and only use the amiFamily: AL2 (Amazon Linux 2) will automatically detect and use the latest ami for each of the available architectures we have in our nodepool; so we would not have to change anything in this case!!! ;)

You can see the compiled form with the actual AMIs using:

kubectl describe ec2nodeclass

Still, in some cases, folks will prefer to control this and define manually AMIs like this:

status:
  amis:
    - id: ami-01234567890123456
      name: custom-ami-amd64
      requirements:
        - key: kubernetes.io/arch
          operator: In
          values:
            - amd64

and if that is the case we need to make sure we have a similar definition for a valid arm64 ami:

    - id: ami-01234567890123456
      name: custom-ami-arm64
      requirements:
        - key: kubernetes.io/arch
          operator: In
          values:
            - arm64

That’s it; it is as simple as that; we have a nodepool that supports arm64 instances and a nodeclass that defines a proper ami to be used by those.

You can test this with a simple deployment like:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: workload-graviton
spec:
  replicas: 5
  selector:
    matchLabels:
      app: workload-graviton
  template:
    metadata:
      labels:
        app: workload-graviton
    spec:
      nodeSelector:
        intent: apps
        kubernetes.io/arch: arm64
      containers:
      - name: graviton2
        image: public.ecr.aws/eks-distro/kubernetes/pause:3.7
        imagePullPolicy: Always
        resources:
          requests:
            cpu: 512m
            memory: 512Mi

and apply it with kubectl:

kubectl apply -f workload-graviton.yaml

Give it a couple of minutes and you can see the new node in the cluster:

kubectl get nodes -L karpenter.sh/capacity-type,beta.kubernetes.io/instance-type,karpenter.sh/nodepool,topology.kubernetes.io/zone -l karpenter.sh/initialized=true

the output will look something like:

NAME                          STATUS   ROLES    AGE   VERSION               CAPACITY-TYPE   INSTANCE-TYPE   NODEPOOL   ZONE
ip-10-0-62-224.ec2.internal   Ready       60s   v1.28.5-eks-5e0fdde   spot            c6g.xlarge      default    us-east-1a
ip-10-0-79-148.ec2.internal   Ready       87m   v1.28.5-eks-5e0fdde   on-demand       c6g.xlarge      default    us-east-1b

You can also check the karpenter logs with:

kubectl -n karpenter logs -l app.kubernetes.io/name=karpenter --all-containers=true -f

Note: if you played along to test this, please don’t forget to clean up and delete the resources you no longer need.

Finally, I wanted to point out that because karpenter is automatically choosing the most cost-effective instances for your configuration (on-demand vs spot, or graviton2 vs intel) your instances might tilt automatically towards graviton2. You can still control your deployment if you want to run them on amd64 instances (for ex. if you don’t have arm64 versions available) using the kubernetes.io/arch spec config.

HowTo Migrate a managed AWS Elasticsearch cluster to graviton2

2023-01-14T10:52:55+00:00

Amazon Web Services (AWS) offers managed Amazon Elasticsearch clusters (or OpenSearch as they call their fork of the open source Elasticsearch) for deploying and managing search applications. Lately, I’ve been managing several self-managed clusters and showed how I’ve migrated those to graviton2; for newer clusters, I’ve started exploring the AWS managed solution for Elasticsearch. This has several advantages:

Simplified management: With Amazon OpenSearch, AWS takes care of the infrastructure management tasks, such as deploying and configuring the Elasticsearch cluster, patching the underlying operating system, and handling backups and disaster recovery. This allows you to focus on using Elasticsearch to build your search applications, rather than worrying about the underlying infrastructure.
Scalability: Amazon OpenSearch makes it easy to scale your Elasticsearch cluster up or down to meet changing demands. You can add or remove nodes to the cluster with just a few clicks, or use features like auto-scaling to automatically adjust the size of the cluster based on workload patterns.
Availability: Amazon OpenSearch is designed for high availability, with built-in replication and failover capabilities. This helps ensure that your search application remains available even in the face of hardware failures or other issues.
Security: Amazon OpenSearch provides a range of security features to help protect your data, including encryption at rest and in transit, access controls, and integration with AWS Identity and Access Management (IAM) for fine-grained access control.
Integration with other AWS services: Amazon OpenSearch integrates with other AWS services, such as Amazon CloudWatch for monitoring, AWS CloudTrail for auditing, and AWS Identity and Access Management (IAM) for access control. This makes it easy to build end-to-end search applications on AWS.
Cost optimization: Using Amazon OpenSearch can help reduce costs by eliminating the need to manage and maintain your own Elasticsearch cluster infrastructure. Additionally, AWS offers Graviton2-based instances that are optimized for running Amazon OpenSearch and provide better performance and cost efficiency compared to traditional x86-based instances.

In this post, I’ll walk you through the process of upgrading an existing managed Amazon Elasticsearch cluster to Graviton2. We will use the same cluster to perform the upgrade, which means we will upgrade the cluster in place, without creating a new cluster.

Step 1: Change the instance types to Graviton2

This migration is much easier than the self-managed one; it requires only one step ;). We first need to figure out the graviton2 instance type we want to use for our migration. As like with regular ec2 instances AWS provides a wide range of instances for Elasticsearch. The main ones are the “m” instances which are general-purpose instances, and the “r” instances are memory-optimized. The number after the instance type (e.g. “2xlarge” or “16xlarge”) indicates the number of vCPUs and the amount of memory available on the instance:

The available Amazon OpenSearch-optimized Graviton2 instances are:

m6g.medium.elasticsearch
m6g.large.elasticsearch
m6g.xlarge.elasticsearch
m6g.2xlarge.elasticsearch
m6g.4xlarge.elasticsearch
m6g.8xlarge.elasticsearch
m6g.12xlarge.elasticsearch
m6g.16xlarge.elasticsearch
r6g.large.elasticsearch
r6g.xlarge.elasticsearch
r6g.2xlarge.elasticsearch
r6g.4xlarge.elasticsearch
r6g.8xlarge.elasticsearch
r6g.12xlarge.elasticsearch
r6g.16xlarge.elasticsearch

Note: we will also need to make sure we run a supported version of the managed AWS Elasticsearch/OpenSearch that supports graviton2 instances. For the older Elasticsearch anything newer than 7.8 should work, and if you are using the OpenSearch version then any version would work as this has been available since version 1.0.0. If you are running an older version you will need first to upgrade to a supported version before moving forward.

The actual migration only requires us to change the instance type. This can be done in the AWS console, using the AWS cli, or a tool like terraform. Since I use terraform to manage all the cloud assets I will show how this is done with terraform; this would look something like:

resource "aws_opensearch_domain" "elasticsearch_domain" {
  domain_name           = "search-domain"
  elasticsearch_version = "7.10"

  cluster_config {
    instance_type           = "m6g.large.elasticsearch" # this replaces the previous m5 type of instance we had
    instance_count          = 3
    dedicated_master_enabled = true
    dedicated_master_count   = 3
  }

  ebs_options {
    ebs_enabled = true
    volume_type = "gp3"
    volume_size = 1000
  }
... # other elasticsearch cluster configs
}

I want to point out also that we are now able to use gp3 for the ebs volume which allows for much better performance and increased size allowed per data node. This is great optimization that can make the cluster much faster and reduce the need for extra data nodes (we were able to cut our nodes in half from this combination: graviton2 for better performance and gp3 for higher storage capacity per node)

Once you run terraform apply with the new instance type this will kick in the automatic blue-green deployment from AWS managed Elasticsearch that will spin up a new set of nodes and migrate the data to the new nodes; once this is done the original nodes are automatically removed. Depending on the size of your data in the cluster this might take a long time and terraform might time out (60m by default). If this happens, you can use the AWS console or cli to monitor the status of the migration.

aws es describe-upgrade --domain-name

should show the status of the upgrade for the specified domain. You can also check the health of the cluster after the upgrade:

aws es describe-elasticsearch-domain --domain-name  --query 'DomainStatus.ClusterStatus.Health'

This command will return the current health status of the Elasticsearch cluster. If the upgrade has been completed successfully, the cluster should have a green health status. If there are any issues with the upgrade, the cluster may have a yellow or red health status, indicating that there are problems that need to be addressed

Note: theoretically there should be no downtime during the process, but the performance might be slightly impacted during the blue-green migration.

As you can see there is a huge advantage while performing such a migration using a managed service compared with the self-managed solution where we had to handle and take care of everything ourselves.

Conclusion

Upgrading a managed Amazon Elasticsearch cluster to Graviton2 is a straightforward process that can provide significant benefits. By upgrading to Graviton2 instances, you can improve performance, reduce costs, and increase the efficiency of your infrastructure. AWS offers several Graviton2 instance types optimized for Amazon OpenSearch, each with its own set of advantages.

In this post, I have walked you through the process of upgrading an existing managed Amazon OpenSearch cluster to Graviton2 instances. We have used the same cluster to perform the upgrade, which means we have upgraded the cluster in place, without creating a new cluster. I have also provided examples and command-line steps to help you through the process.

Overall, upgrading your managed Amazon OpenSearch cluster to Graviton2 instances is a great way to take advantage of the latest technology and improve the performance and cost efficiency of your search application.

HowTo Migrate a self managed Elasticsearch cluster to graviton2 instances

2022-11-12T10:52:55+00:00

Elasticsearch is an open-source search engine that enables you to store, search, and analyze big data in real time. It is a distributed and scalable search engine that can be used to index and search large volumes of data across multiple nodes. I’m currently managing several Elasticsearch clusters running on AWS EC2 instances. AWS offers EC2 instances powered by Graviton2 processors (their custom arm processors) that offer significant performance and cost benefits compared to traditional x86 instances (up to 40% based on AWS benchmarks, with 20% from pure cost savings and 20% from performance improvements compared to similar intel processors). In this blog post, I’ll walk you through the process of how we migrated our Elasticsearch clusters to run on Graviton EC2 instances.

The first Elasticsearch version that added support for ARM processors was Elasticsearch 7.8. This version introduced official support for ARM64 architecture and was released on May 26, 2020. Before this release, Elasticsearch was only officially supported on x86-based platforms. So in our case, this required us to migrate to a supported version first. We were running an older version in the stable branch 7.x and we upgraded to 7.17 using the standard Elasticsearch rolling upgrade docs.

Here are the steps needed for this migration:

Step 1: Create a new Graviton2-based EC2 instances

The first step in the migration process is to create new Graviton2-based EC2 instances. You can do this using the AWS Management Console or the AWS CLI, or even better use terraform as I do. Various Linux distributions run on ARM, but I have chosen to use an Amazon Linux 2 AMI because this is very well supported by AWS. We can use the AWS console and use the filter for “Architecture” to be set to “arm64” for AMI and find the latest Amazon Linux 2 AMI. Or use a simple aws cli command like:

aws ssm get-parameters --names /aws/service/ami-amazon-linux-latest/amzn2-ami-hvm-arm64-gp2 --region us-east-1

This will return the Graviton2 AMI for the specific region we are using. We would use this in our terraform code to create the new Graviton2 instances; for ex:

# Elasticsearch nodes
resource "aws_instance" "es_nodes" {
  count         = 3
  ami           = "ami-XXX" # Replace with the AMI we found above
  instance_type = "c6g.large"
  security_groups = [aws_security_group.es_node_sg.name]
  
  user_data = <<-EOF
              #!/bin/bash
              echo "cluster.name: es-cluster" >> /etc/elasticsearch/elasticsearch.yml
              echo "node.name: ${format("es-node-%02d", count.index+1)}" >> /etc/elasticsearch/elasticsearch.yml
              echo "network.host: [_ec2_:privateIpv4_, _local_]" >> /etc/elasticsearch/elasticsearch.yml
              systemctl restart elasticsearch
              EOF
  
  tags = {
    Name = "es-node-${count.index+1}"
  }
}

Step 2: Install Elasticsearch on the new instance

Normally we would install Elasticsearch on the nodes using the user_data script, but during this migration, we went with a more manual method; you can install Elasticsearch using the RPM or DEB packages provided by Elasticsearch. Here is an example command to install Elasticsearch on an Amazon Linux 2 instance:

sudo rpm --import https://artifacts.elastic.co/GPG-KEY-elasticsearch
sudo tee /etc/yum.repos.d/elasticsearch.repo <[elasticsearch-7.x]
name=Elasticsearch repository for 7.x packages
baseurl=https://artifacts.elastic.co/packages/7.x/yum
gpgcheck=1
gpgkey=https://artifacts.elastic.co/GPG-KEY-elasticsearch
enabled=1
autorefresh=1
type=rpm-md
EOF

sudo yum install -y elasticsearch

Step 3: Configure Elasticsearch on the new instances

After installing Elasticsearch on the new Graviton2-based EC2 instance, the next step is to configure Elasticsearch to use the existing data and settings from the old Elasticsearch cluster. You can do this by copying the Elasticsearch configuration files from the old cluster to the new instance.

This might look something like this:

rsync -avz --progress --delete /path/to/old/cluster/config/ ec2-user@new-instance-ip:/etc/elasticsearch/

Step 4: Start Elasticsearch on the new instance

Finally, once the Elasticsearch configuration files are copied to the new Graviton2-based EC2 instance, the next step is to start Elasticsearch on the new instance. You can do this using the Elasticsearch service command. Here is an example command to start Elasticsearch on the new instance:

sudo service elasticsearch start

Step 5: Verify the migration

The final step in the migration process is to verify that the data and settings from the old Elasticsearch cluster have been successfully migrated to the new Graviton2-based EC2 instance. You can do this by checking the Elasticsearch logs and running some search queries on the new instance.

Here is an example command to check the Elasticsearch logs on the new instance:

sudo tail -f /var/log/elasticsearch/elasticsearch.log

This command shows the Elasticsearch logs on the new instance, and you can use it to check if any errors or warnings are reported during the migration process.

Step 6: Remove original nodes

After all the new Graviton2 instances are in sync in the cluster you can go ahead and remove the old intel instances one by one and allow the cluster to rebalance.

Conclusion

Migrating an Elasticsearch cluster to running on Graviton EC2 instances can provide significant performance and cost benefits. In this blog post, I walked you through the process of migrating an existing Elasticsearch cluster to new Graviton2-based EC2 instances. By following the steps outlined in this post, you can easily migrate your Elasticsearch cluster to Graviton2-based EC2 instances and take advantage of the cost/performance improvements they offer.

Speedup MySQL InnoDB shutdown

2014-01-19T09:52:55+00:00

Depending on the size of the databases you have in mysql innodb the time it takes mysql to restart can be horribly slow (for very big innodb databases). There are some tricks that can speed this up, but the most effective one that I’ve found and use all the time in similar situations is to pre-flush the dirty pages right before shutdown; this can be done like this:

mysql> set global innodb_max_dirty_pages_pct = 0;

You can check the number of dirty pages with the command:

mysqladmin ext -i10 | grep dirty

Let the server run like this for a while and after you see it settle in, the restart (or stop) should be much faster.

HowTo Migrate to Chef 11

2013-03-05T00:00:00+00:00

Chef 11 was released earlier in February and it is awesome! Like most people, I love the new features like partial search, chef-apply and knife-essentials inclusions, awesome formatted output, etc. Of course the open source chef 11 server was rewritten completely in erlang with postgresql/mysql support replacing the ruby/couchdb backend stack. solr and rabbitmq are still there ;)… There are many breaking changes meaning you will want to make sure that you fix your cookbooks before upgrading.

When you are ready to upgrade, you will notice that unfortunately there is no official migration path. This howto will document what I’ve used myself for such migrations and hopefully will help you too if you are trying to perform a similar upgrade.

Opscode has done an amazing job with the omnibus installers and starting with Chef 11, the chef server has support for this also. Meaning you can install a new chef server simply by installing the rpm or deb for your platform and everything should be installed for you (ruby/gems, chef, rabbitmq, solr, erlang, postgresql, nginx). Just head over to http://www.opscode.com/chef/install/ and from the chef-server tab download the version for your OS.

In order to migrate to a new chef server we need to migrate from the old server:

clients
nodes
roles
environments
data bags
cookbooks (with all the versions used in each environment)

It is important to have all the clients with their proper public keys because if not we would have to re-register each one of them.

Personally, I’ve migrated using this process several servers from open source chef 0.10.x to chef 11, but theoretically this should work from any chef server implementation (hosted, private, etc.) because we are downloading and uploading the assets using the api calls.

Backup the data from the existing server

You can use my knife-backup plugin for this. Once you install the gem you can just run it and it will backup all the objects from the existing server:

gem install knife-backup
knife backup export

This might take a while depending on your number of nodes/clients, cookbooks, etc. you have. Once completed you will have in .chef/chef_server_backup all the needed files.

Optional: if you have many unused cookbook versions you might want to clean them first before the backup. You can use my knife-cleanup plugin for this:

gem install knife-cleanup
knife cleanup versions -D

Install the new Chef 11 server

I would recommend to setup a new server as this would be the safest approach in case something doesn’t work out well and you don’t have to mess with your current environment. As mentioned earlier you can install the new server very easy with the omnibus installer. For example for Ubuntu 12.04 this would look like:

wget https://opscode-omnitruck-release.s3.amazonaws.com/ubuntu/12.04/x86_64/chef-server_11.0.6-1.ubuntu.12.04_amd64.deb
dpkg -i chef-server*
sudo chef-server-ctl reconfigure

You can also use the chef-server cookbook to install your new server if you prefer that.

Once you have the new chef server up and running, you will need to setup a new admin account and a new knife config. I would recommend to use a special user for this to not interfere with the users that we are trying to import from the old server. I would call it transfer. From the local server this would look like:

mkdir -p ~/.chef
sudo cp /etc/chef-server/chef-webui.pem ~/.chef/
sudo cp /etc/chef-server/chef-validator.pem ~/.chef/

marius@chef:~# knife configure -i
WARNING: No knife configuration file found
Where should I put the config file? [/marius/.chef/knife.rb]
Please enter the chef server URL: [http://localhost:4000] https://localhost
Please enter a clientname for the new client: [transfer]
Please enter the existing admin clientname: [chef-webui]
Please enter the location of the existing admin client's private key: [/etc/chef/webui.pem] ~/.chef/chef-webui.pem
Please enter the validation clientname: [chef-validator]
Please enter the location of the validation key: [/etc/chef/validation.pem] ~/.chef/chef-validator.pem
Please enter the path to a chef repository (or leave blank):
Creating initial API user…
Created client[transfer]
Configuration file written to /marius/.chef/knife.rb

Note: the default server keys will be located in /etc/chef-server/ and not in /etc/chef like they used to be, and this is definitely a welcome change. Also the default server url will still look for http and port 4000, but with chef 11 this works behind a nginx load balancer and listens by default on standard https port.

Restore the data from the old server

Finally, now we can restore all the data from the old server. You can transfer it from the backup and for simplicity drop it in your user .chef folder under .chef/chef_server_backup; be sure to install the knife-backup gem to the server and you should be able to run:

marius@chef:~# knife backup restore
WARNING: This will overwrite existing data!
WARNING: Backup is at least 1 day old
Do you want to restore backup, possibly overwriting exisitng data? (Y/N) y
Restoring clients
...

And this should restore all the data in the new server. Final step would be to regenerate the indexes:

chef-server-ctl reindex

Note: I want to point out that currently knife-backup will skip any clients that already exist on the server as I could not find a way to overwrite them using the api calls. This means that most certainly the validation key will need to be changed as that is a user that for sure will exist on the newly installed server.

Final touches

After the data migration is completed you will probably just have to point your DNS alias to the new server. One issue I’ve noticed is that the chef server when installed will use the local dns record in various places in its config files. When working on a temporary server this has caused problems once changing the dns and activating the server. The chef server will send to the client links from where to download the assets (cookbook parts for ex) and if this was unconfigured at install time then you might have to fix it and correct it to the dns entry the clients can download correctly; check it out:

grep s3_url /var/opt/chef-server/erchef/etc/app.config

and restart the chef server after correcting the s3_url:

chef-server-ctl restart

Hopefully this post will help you migrate to Chef 11. Feel free to let me know in the comments bellow if you had any issues following this process, or if it worked without any problems. Also if you find any problems with the tools used here knife-cleanup or knife-backup please open a ticket on github or submit a patch. Good luck!

knife-backup

2013-03-04T00:00:00+00:00

While working on migrating a chef server from 0.10.x to version 11, I ended up extending the BackupExport and BackupRestore plugins written by Steven Danna and Joshua Timberman and added support for cookbooks and clients. Currently knife-backup has support for the following objects:

clients
nodes
roles
environments
data bags
cookbooks and all their versions.

knife-backup will backup all cookbook versions available on the chef server. Cookbooks are normally available in a repository and should be easy to upload like that, but if you are using various cookbook versions in each environment then it might not be so trivial to find and upload them back to the server; downloading them and having them available to upload like that is simple and clean. If you have too many cookbook versions then you might want to cleanup them first using something like knife-cleanup.

If you want to check it out, just install the gem:

gem install knife-backup

and then just point it to an existing chef server to backup all its objects with:

knife backup export

If you need to restore then it is simple as:

knife backup restore [-d DIR]

Hope you will find this useful and looking forward for your feedback.
Patches are welcome: knife-backup on github

knife-cleanup

2013-02-26T00:00:00+00:00

I’m working on many projects where we have a process that will make sure that every change we introduce in the cookbooks enters as a new version and where we use extensively environments to select what versions of cookbooks we want to use in each environment. This sounds like a great idea, and a workflow that I would highly recommend to anyone for sure. Still, after a while, the result is that you will end up with hundreds maybe even thousands of cookbook versions and most of them are unused (besides the few ones that you are referencing in each environment and maybe the latest ones). Normally I would not care about this and as long as it is not causing performance issues you should not care about it either. Still you must admit that when debugging any problems, it will make it more complex with all those versions everywhere; see bellow an example.

hadoop 0.1.118 0.1.116 0.1.115 0.1.114 0.1.113 0.1.111 0.1.109 0.1.108 0.1.106 0.1.105 0.1.104 0.1.103 0.1.102 0.1.101 0.1.99 0.1.98 0.1.97 0.1.96 0.1.95 0.1.94 0.1.93 0.1.92 0.1.91 0.1.90 0.1.89 0.1.88 0.1.87 0.1.86 0.1.85 0.1.84 0.1.83 0.1.82 0.1.81 0.1.80 0.1.79 0.1.78 0.1.77 0.1.76 0.1.75 0.1.74 0.1.73 0.1.72 0.1.71 0.1.70 0.1.69 0.1.68 0.1.67 0.1.66 0.1.65 0.1.64 0.1.63 0.1.62 0.1.61 0.1.60 0.1.59 0.1.58 0.1.57 0.1.56 0.1.55 0.1.54 0.1.53 0.1.52 0.1.51 0.1.50 0.1.49 0.1.48 0.1.47 0.1.46 0.1.45 0.1.44 0.1.43 0.1.42 0.1.41 0.1.40 0.1.39 0.1.38 0.1.37 0.1.36 0.1.35 0.1.34 0.1.33 0.1.32 0.1.31 0.1.30 0.1.29 0.1.28 0.1.25 0.1.24 0.1.23 0.1.22 0.1.21 0.1.20 0.1.19 0.1.18 0.1.17 0.1.16 0.1.15 0.1.13 0.1.12 0.1.11 0.1.10 0.1.9 0.1.8 0.1.7 0.1.6 0.1.5 0.1.4 0.1.3 0.1.2 0.1.0

(and this was the cookbook with the least versions that I’ve found to paste here).

While working on knife-backup I realized what a huge waste this was, and decided that I needed a way to clean these and keep on the server just the relevant ones.

To solve this problem I wrote knife-cleanup and if you have similar needs you might find it useful. It will cleanup all unused versions of the cookbooks you have on your chef server (this might be hosted opscode platform or open source server). Before doing any deletion it will backup the version it touches (just in case).

If you want to check it out, just install the gem:

gem install knife-cleanup

and assuming you have a working knife config you can run it with:

knife cleanup versions

and this will output the versions it would delete.

If you are ready to delete, you can do that with:

knife cleanup versions -D

and you can find the backups of the versions deleted under .cleanup/cookbook_name

Notes: I’ve seen various cases where it is impossible to download a cookbook version (and knife will error out). From my experience there is not much we can do about that, so the script will just ignore the backup, but will delete the corrupt version. You might want to have a full chef server backup before (see knife-backup) just in case. The way how I’m using this is with exact version pining of cookbooks in environments (for more details see chef-jenkins); if you are using environments and cookbook versions in a different way, then this might not make sense for you.

Hope you will find this useful and looking forward for your feedback.
Patches are welcome: knife-cleanup on github

Bay Area Chef User Group Update - After One Year

2013-02-04T00:00:00+00:00

It’s been a little more than a year since I stepped up and became one of the organizers of the Bay Area Chef user group, trying to help my good friend Rob Berger as he was getting swamped with work and could not dedicate as much time to this, as he used to in the past. This post is meant to be a quick review on what happened during this time, what worked well and of course some ideas on how we can improve this in the future. I’m also hoping to get feedback from our members on what we can do differently in the future to better serve them and make this an even better group.

One of the first things we’ve done last year was to introduce the Chef Cafes. These are small events (we have a max limit of 10 people set for them) done consistently at the same time (1st and 3rd Thursday of the month) at the best coffee in Mountain View (Red Rock Coffee) with the intent to facilitate the interaction between people, give them a place where they can regularly meet and discuss about chef, ask questions and also try to help other members in the spirit of the open source community. The first Chef Cafe was on March 1st 2012 and it was just me and Rob (we had a good time preparing the future events and just catching up). But after that, we had 16 Chef Cafe’s all year long and many of them had 10 or even more people, and each one of them was unique and special in its own way. We had some, where we had new chef users that had various questions on how to use chef and we tried to help them and resolve their blocks in understanding and getting up to speed with chef. On the other hand we had other cafes where we had really advanced uses that brainstormed about various unresolved problems and what was their take on things like cookbook testing, workflow or orchestration. Overall, I think it was a great success and allowed us to be more connected with members, and also more open and helpful to new chef users.

In 2013 we look forward to your suggestions on how we can improve the Chef Cafes and we will try to keep these going. We hope to be able to move one in San Francisco and keep the other one in the South Bay as we had various requests for that. So if you are in the City and you want to get involved with this please ping me.

One other thing we have tried to do was to bring consistency and have at least one meetup every month with an awesome presentation on some hot topic in the chef community. This ended up being a little too optimistic :(. Still, we had 6 cool meetups with speakers like:

Flip Kromer on Ironfan
Jim Hopp on Test-Driven Development
Jesse Robbins - Hacking culture & Being a force for Awesome
Daniel DeLeo on Whyrun mode
Nati Shalom on Cloudify

and we also had Aaron Peterson running an introductory Chef Workshop; considering the big and diverse audience I think we have done quite a great job with that.

With the experiences we had last year, we are more confident that this year we will be able to run one meetup every month, but we need your help: we are always looking for great speakers and interesting topics; if you want to present at one of our meetups please let us know; also if you know someone that we should invite to present to a meetup please let us know.

Most of our meetups last year were hosted by Survey Monkey in Palo Alto and we can’t thank them enough for their support (special thanks to Tim Sabat for making them possible). We also had one meetup in San Francisco hosted at Scalr offices (thanks Sebastian). This year, we hope to diversify and run each meetup in a different place to make things more interesting; and hopefully more meetups in the City. If you are interested in hosting and sponsoring one of our future meetups please contact me privately and let me know.

During last year, our group has grown a lot. We started with 132 members in the first day of January 2012 and ended up the year with more than 400 members. This shows that the interest in Chef is obviously growing and hopefully the events we have been organizing are helping grow our local chef community.

If you have any suggestions on what you would like us to do in the future, please let us know. Use the comments bellow, send us a message, whatever works for you; we would love to hear from you and see how we can serve you better. Overall 2012 was great and with your help we can make 2013 even better!

Finally Migrated to Octopress

2012-11-14T00:00:00+00:00

For a while now, I wanted to migrate my blog from Wordpress to Octopress, but for some reason I kept putting it on the shelf and not doing it. (let’s say because of all those client related projects…). Finally last weekend I’ve completed the migration and I’m really excited to get back to blogging after this. This post is meant to capture some of the issues I’ve encountered during the migration and how to fix them. This is not a full how to migrate post, as there are many such great articles available already.

Migrate old blog posts.

Believe it or not, I had 364 blog posts when I started the migration. Meaning a lot of energy was spent in importing those old articles. I’ve used exitwp to convert the wordpress-xml export of the blog posts; and this produced a reasonably good result. Still I had to run some fixes…

for code blocks:

 perl -pi -e 's/([^\`]|^)(\`)([^\`]|$)/$1\n\`\`\`\n$3/g' *

to enable comments (as ‘comments: true’ was missing from all posts)

find source/_posts/ -type f -print0 | xargs -0 -I file sed -i '' '2 i \
  comments: true' file

Categories/Tags/URLs

Enabled the octopress category list plugin and tags plugin, that you can see in the sidebar. Since I had already tags and categories on all posts it was very important to keep the same urls and not break them. Same thing for regular posts urls. Here are the relevant settings form the octopress config file:

root: /
permalink: /:year/:month/:day/:title/
category_dir: category
tag_dir: "tag"

Just keep in mind that if you have many tags as I do, the generation of the pages will increase a lot after you enable the tags plugin. You’ve been warned!

Disqus comments

Not working at all… I’ve wrote a post specifically about this; check it out here

Feed Url

My wordpress blog has been around for a while (6years more or less) and even if I’ve always used feedburner for my feed, but for some strange reason I’ve always used my own feed url. This of course was no longer working with octopress, hence I had to setup a rewrite rule to not break everyone’s feed reader:

RewriteEngine On
Options +FollowSymLinks -Multiviews

# Feed url
RewriteRule ^feed/?$ atom.xml [QSA,L]

Rewrite non-www to www

This was done automatically by wordpress, but octopress will serve just fine the non-www domain. This can cause issues with search engines and such, so I wanted the same behaviour. Apache again to the rescue:

RewriteCond %{HTTP_HOST} !^www [NC]
RewriteRule $ http://www.%{HTTP_HOST}%{REQUEST_URI} [L,R]

Apache optimizations, caching, compression, etc

After you generated your octopress site, everything is static and fast by default. Still, you want to make sure that apache has some basic caching and compression settings to make it even better. Here are the relevant parts from my config:

#### CACHING ####
 mod_expires.c>
ExpiresActive On

# 1 MONTH
 "\.(ico|gif|jpe?g|png|flv|pdf|swf|mov|mp3|wmv|ppt)$">
  ExpiresDefault A2419200
  Header append Cache-Control "public"


# 3 DAYS
 "\.(xml|txt|html|htm|js|css)$">
  ExpiresDefault A259200
  Header append Cache-Control "private, must-revalidate"


# NEVER CACHE
 "\.(php|cgi|pl)$">
  ExpiresDefault A0
  Header set Cache-Control "no-store, no-cache, must-revalidate, max-age=0"
  Header set Pragma "no-cache"



### Compression ####
 mod_deflate.c>
     mod_setenvif.c>
        BrowserMatch ^Mozilla/4 gzip-only-text/html
        BrowserMatch ^Mozilla/4\.0[678] no-gzip
        BrowserMatch \bMSIE !no-gzip !gzip-only-text/html
        BrowserMatch \bMSI[E] !no-gzip !gzip-only-text/html
    
     mod_headers.c>
        Header append Vary User-Agent env=!dont-vary
    
     mod_filter.c>
        AddOutputFilterByType DEFLATE text/css application/x-javascript text/x-component text/html text/richtext image/svg+xml text/plain text/xsd text/xsl text/xml image/x-icon

Isolated when working on a new post

If you have many posts, the generation of the octopress site will be extremely slow (in my case it takes about 2mins for a full generate) and this makes it basically impossible to work with any new post and see the feedback locally with preview. The solution is well documented and it works by isolating your single post while working on it, and when you are done you integrate back all the other posts before publishing them:

rake new_post['Finally Migrated to Octopress']
rake isolate[finally-migrated-to-octopress]

and now rake generate and rake preview will only work with the new post. Finally when done and ready to publish the awesome new post on the internets:

rake integrate
rake generate
rake deploy

Others

some small customizations to the theme (colors and such)
about me and contact custom asides.
fix the github aside (updated to work with their latest api version and actually return the repos)
and of course the contact form (using a wufoo form)

Disqus comments not visible in Octopress

2012-11-12T00:00:00+00:00

After completing the migration of my blog from Wordpress to Octopress I had the surprise that Disqus comments were not showing up on the site. I’ve already migrated in advance to Disqus and the Wordpress blog was working just fine with the new format. However, once switched to Octopress there were no comments active on the site. Strangely, the total number of comments for each post on the index page was showing just fine, but once you clicked on any post there were no comments. I tested adding new comment and it did show up correctly in Disqus.

Trying to understand and debug this issue, I looked in source/_includes/disqus.html and found the code that is generating the javascript variable disqus_identifier for the posts:

and looking in the html generated by some blog posts the variables disqus_url and disqus_identifier looked ok, like this:

var disqus_identifier = 'http://www.ducea.com//2012/11/12/disqus-comments-not-visible-in-octopress/';
var disqus_url = 'http://www.ducea.com//2012/11/12/disqus-comments-not-visible-in-octopress/';
var disqus_script = 'embed.js';

Still at a closer look I was able to identify the issue; if you look closer at the url above, it has a double / in the url, and even if that should not cause any issues and identify the same url, Disqus was actually seeing it as a separate identifier and hence not showing the comments associated with it. Once I figured it out it was very simple to see where it came form (the site url from _config.xml) was:

url: http://www.ducea.com/

and fixing it, by removing the trailing slash:

url: http://www.ducea.com

Regenerating and deploying the site:

rake generate
rake deploy

fixed the issue and the comments are now back on the site. (you can even try it out here on this post ;)

Hopefully this will help others that are in the same situation… if you just added an extra slash to the Octopress site url config and didn’t realize this brake the Disqus comments.

ChefConf 2012 - San Francisco

2012-05-18T19:09:10+00:00

This week Opscode hosted its inaugural user conference here in San Francisco, and it was an awesome event enjoyed by all chef fans. Even if this was the first one (they are already planing for the future ones), this was by no means a small event, with more than 400 people attending and the workshops that ran on Tuesday sold out.

Even if I have not attended any workshop (they had 2 flavors, one targeted towards a sysadmin workflow and one for developers) the general feeling from people I talked with and attended them was that it was a very good experience, with a lot of hands-on practical examples. Tuesday afternoon, myself I attended the “ChefConf Pre-event Hackday: TEST ALL THE THINGS!!!” organized by Bryan Berry and it was great, and showed how many people are interested in testing their infrastructure as code; it was focused on cookbook testing (unit testing and integration testing), continuous integration with jenkins, and other things like that ;)

The first full day of ChefConf was Wednesday. The conference was structured with main presentations during the mornings and breakout sessions in the afternoon (with 2 main tracks and also a vendor one). From the beginning you could tell that this will be a very well run conference, and even if this was the first one, people like Jesse Robbins have a lot of experience running such events. Not surprisingly ChefConf kicked off with Adam Jacob’s “State of the Union Part 1: Chef, Past and Present” (video) ; Jesse Robbins talked about the community around chef and how this is a key part of Opscode strategy and their efforts to take this to the next level. He showed this very nice visualization of the commits to the chef github repo.

There were many interesting talks during the day, and they recorded most of them and hopefully will make them available online soon so you can see them if you didn’t had the chance to be here (or you want to review them again). I particularly enjoyed:

Ron Vidal - Operations Secret Sauce: Incident Management (video); similar to Jesse Robbins GameDay talk and it was a very nice addition, inspirational and full of interesting points.
Jim Hopp’s - Test-driven Development for Chef Practitioners (video); very well prepared and presented. I hope to have Jim to our Chef Bay Area meetup group to present something similar on the subject and run a testing hackaton.
Patrick McDonnell’s - Lessons from Etsy: Avoiding Kitchen Nightmares; people seem to love everything Etsy is doing and they are sharing a lot of their workflow with chef and open sourcing various tools they write.
and many others…

In the evening we had a great Ignite event ran by Andrew Shafer in his unconfundable way. We had 10 ignite speakers and in the middle there was a fun karaoke ignite that had 10 volunteers rambled on some slides they never sow before. If they recorded this, and will show it online look up the ones by Stephen Nelson-Smith and John Vincent as they were very entertaining.

The second day of the conference started with Christopher Brown’s “State of the Union Part 2: Chef, the Future” where he outlined some of the future features and main focuses of Opscode for Chef: becoming easier to install and use (omnibus installer), enterprise ready, focus on Windows and also a lot of focus on quality. Opscode is working on a project called kitchen chef that will allow to test the functionality of cookbooks on various environments and platforms, and quickly ensure the quality of the cookbook is maintained during various iterations. Also a lot of work has been put into reporting and handlers. The server side also has been completely rewritten in erlang and sql (from ruby and couchdb) and we should see this soon in the open-source and the private chef server. From the work done you can easily tell that a lot of work has been done on private chef and this is quickly becoming an important asset for Opscode going forward.

There were many great talks during the day from speakers like Artur Bergman, Ben Rockwood, Jason Stowe, John Esser, Rob Hirschfeld, Theo Schlossnagle, etc. I finished my day just like I started Tuesday with another event focused on testing: “Test Driven Development Roundtable”, ran by Stephen Nelson-Smith on a panel with Seth Chisamore, Jim Hopp and my friend Rob Berger. They went over the tools people are using these days and what are the things that are still missing and need to be worked on regarding testing.

Overall, I think this was an awesome event and I hope to be able to attend the next one also (hopefully at the same place). My impression is that Opscode is ready to move forward and make the next step and grow the community even bigger: “The revolution will not be televised - it will be coded with chef”.

HowTo completely remove a file from Git history

2012-02-07T11:40:06+00:00

I just started working on a new project and as you would expect one of the first things I did was to download its git repository from github. These were just some scripts and should have been very small ~5M, but the clone from gitbhub took about one hour as the full repo folder was 1.5G… (with the biggest size under .git/objects/pack) Crazy… What was in the git repository history that would cause something like this? I assumed that at some point in time the repository was much bigger (probably from some file/s that don’t exist anymore), but how could I find out what were those files? And more important howto remove them from history? Well if you came here from a google search on “how to remove a file from git history” then you probably know there are plenty of docs and howtos on how to achieve this but from my experience none of them really worked. This is why I decided to document the steps needed to identify the file from the git repo history that is using all that space and to have it removed fully and bring the repository to a manageable size.

First we need to identify the file that is causing this issue; and for this we will verify all the packed objects and look for the biggest ones:

git verify-pack -v .git/objects/pack/*.idx | sort -k 3 -n | tail -5

(and grab the revisions with the biggest files). Then find the name of the files in those revisions:

git rev-list --objects --all | grep

Next, remove the file from all revisions:

git filter-branch --index-filter 'git rm --cached --ignore-unmatch '
rm -rf .git/refs/original/

Edit .git/packed-refs and remove/comment any external pack-refs. Without this the cleanup might not work. I my case I had refs/remotes/origin/master and some others branches.

vim .git/packed-refs

Finally repack and cleanup and remove those objects:

git reflog expire --all --expire-unreachable=0
git repack -A -d
git prune

Hopefully these steps will help you completely remove those un-wanted files from your git history. Let me know if you have any problems after following these simple steps.

Note: if you want to test these steps here is how to quickly create a test repo:

# Make a small repo
mkdir test
cd test
git init
echo hi > there
git add there
git commit -m 'Small repo'
# Add a random 10M binary file
dd if=/dev/urandom of=testme.txt count=10 bs=1M
git add testme.txt
git commit -m 'Add big binary file'
# Remove the 10M binary file
git rm testme.txt
git commit -m 'Remove big binary file'

Getting ready for LISA11 - Boston

2011-12-03T22:53:45+00:00

I’m packing for Boston and will be there next week for LISA11. This will be my second year as part of the LISA blogging team, and after how much I enjoyed LISA last year in San Jose I wouldn’t miss this one even if it is on the other side of the country. I’ve tried to finish as much work as possible to be able to focus on the conference ;) but for various reasons of course this was not quite possible, and actually during the first days I will even be on call… In anycase, I’m sure this is going to be a great week full of awesomeness. I will be blogging for the USENIX blog every day, so be sure to follow that for fresh articles from me and the other memebers of our team (Ben, Rikki and Matt).

If you are going to LISA11 in Boston next week, we should definitely meetup. Contact me on twitter or email.

The Limoncelli Test, was a very interesting presentation by Tom Limoncelli based on a blog post he wrote earlier this year. If you haven’t done it already I would strongly recommend to take the test and see how does your sysadmin team rank on “The Limoncelli Test”.

Recovering From Linux Hard Drive Disasters is Theodore Ts’o signature training material on what to do if you have any sort of hard drive failure and covers in depth details on how to recover from such disasters caused by software or hardware failures.

GameDay: Creating Resiliency Through Destruction (slides): I enjoyed very much Jesse Robbins presentation, where he draws parallels between two of his greatest passions: firefighting and operations. Watch the video.

SRE@Google: Thousands of DevOps Since 2004: Tom Limoncelli, describes the technologies and policies that Google uses to do what is (now) called DevOps. Watch the video.

Interview with LISA11 Program Co-Chairs: Tom Limoncelli and Doug Hughes

2011-11-30T22:31:35+00:00

One of the advantages of being a member of the LISA11 Blog Team is that I was able to talk and interview this year program co-chairs: Tom Limoncelli and Doug Hughes. This was a great honor for me especially since I’ve been a big fan of Tom’s work for many years. The full article is available on the USENIX blog: “Tom Limoncelli and Doug Hughes Interview”

Also my colleagues from the LISA11 blogging team (Ben, Rikki and Matt) have done some very interesting interviews with some key people from LISA11 to get you prepared for the event. Check out the USENIX blog for more from us in the next week.

Here is also a quick intro of our team: “LISA11 Next Week – Meet your blog team!”

Build your own packages easily with FPM

2011-08-31T15:13:02+00:00

Building packages is a task that every system administrator will end up doing. Most of the time this is not a very interesting task but someone has to do it, right? Normally you will end up modifying and tweaking based on your own needs an existing package that was built by the maintainers of the Linux distribution that you are using. In time you might even become familiar with the packaging system you are using (rpm, deb, etc.) and you will be able to write a spec file and start from scratch and build a new package if you need to. Still, this process is complicated and requires a lot of work.

Luckily, Jordan Sissel has built a tool called FPM (Effing Package Management), exactly for this: to ease the pain of building new packages; packages that you will use for your own infrastructure and you want them customized based on your own needs; and you don’t care about upstream rules and standards and other limitations when building such packages. This can be very useful for people deploying their own applications as rpms (or debs) and can simplify a lot of the process of building those packages.

FPM can be easily installed on your build system using rubygems:

gem install fpm

Once installed you can use fpm to build packages (targets):

deb
rpm
solaris

from any of the following sources:

directory (of compiled source of some application)
gem
python eggs
rpm
node npm packages

Use the command line help (fpm --help) or the wiki to see full details on how to use it. I’ll show some simple examples on how to build some packages from various input sources that I’ve found useful myself.

1. Package a directory - output of a ‘make install’ command

This is how you would usually package an application that you would install with:
./configure; make; make install
For example, here is how you can create an rpm of the latest version of memcached:

wget http://memcached.googlecode.com/files/memcached-1.4.7.tar.gz
tar -zxvf memcached-1.4.7.tar.gz
cd memcached-1.4.7
./configure --prefix=/usr
make

so far everything looks like a normal manual installation (that would be followed by make install). Still we will now install it in a separate folder so we can capture the output:

mkdir /tmp/installdir
make install DESTDIR=/tmp/installdir

and finally using fpm to create the rpm package:

fpm -s dir -t rpm -n memcached -v 1.4.7 -C /tmp/installdir

where -s is the input source type (directory), -t is the type of package (rpm), -n in the name of the package and -v is the version; -C is the directory where fpm will look for the files. Note: you might need to install various libraries to build your package; for ex. in this case I had to install libevent-dev.

If you are packaging your own application you can do this just by pointing to your build folder and set the version of the app. Here is an example for an deb package:

fpm -s dir -t deb -n myapp -v 0.0.1 -C /build/myapp/0.0.1/

There are various other parameters that you can use but basically this is how simple it is to build a package from a directory. Here is an example on how to define some dependencies on the package you are building (using -d; repeat it as many times as needed):

fpm -s dir -t deb -n memcached -v 1.4.7 -C /tmp/installdir \
-d "libstdc++6 (>= 4.4.5)" \
-d "libevent-1.4-2 (>= 1.4.13)"

2. Ruby gems or python egg - converted to packages

You can create a deb or rpm from a gem very simple with fpm:

fpm -s gem -t deb

this will download the gem and create a package named rubygem- For example:

fpm -s gem -t deb fpm

will create a debian package for fpm: rubygem-fpm_0.3.7_all.deb

You can inspect it with dpkg –info and you can notice that in this case it will fill nicely all the fields with the maintainer, and dependencies on various other gems. Very cool.

If you use python and want to package various python eggs this will work exactly the same and you will use -s python (it will download the python packages with easy_install first).

Overall FPM is a great tool and can help you simplify the way you are building your own packages. Check it out and let me know what you think and if you found it useful. And if you found this useful don’t forget to thank Jordan for his great work on this awesome tool.