Paul Lam. Founding engineer for data-driven enterprise startups.

Simplifying Step Functions and Stepwise: Lessons Learned and a New Approach

2023-05-11T00:00:00-04:00

At Motiva, we use AWS Step Functions to manage all of our workflows. To make this simpler, we developed Stepwise, an open-source Clojure library that helps us coordinate various tasks like business processes, data pipelines, and even our machine learning workflows. By using Step Functions, we can effortlessly handle complex event-driven processes and monitor our operations. Although we have been using Step Functions and Stepwise in production for a couple years, we identified some areas where we can improve the developer experience. This post will share some bottlenecks we discovered and propose a new and improved version of a Step Functions library, but with Clojure all the way.

What do we mean Clojure all the way? We have been designing a new interface to define Step Functions state machines, which we will illustrate with an example of a pizza-making workflow. To create this state machine on AWS Step Functions, we would use the following code:

(def pizza-making-state-machine
  (sfn/->> request
           (sfn/parallel (make-dough)
                         (make-sauce)
                         (sfn/map {:iterate-over :ingredients} prepare-ingredients))
           (put-ingredients-on-dough)
           (bake)
           (sfn/wait 2 :minutes)
           (sfn/choice (comp not is-pizza-acceptable?)
                       ;; branch off to this fn if condition is true
                       (sfn/fail))
           (serve)))

(sfn/ensure-state-machine client :pauls-pizza-making-machine pizza-making-state-machine)

This looks like normal Clojure code, but it is actually a workflow. This is because AWS Step Functions keeps the state of your workflow at all times. So, if your server, processes, or any of your workers go down during any workflow execution, your execution will continue where it left off once your system comes back online.

We can define the workers for each step of the workflow in Clojure as follows:

(defn make-dough
  ;; SFN error handling configuration as metadata
  {:retry [{:error-equals     [:sfn.timeout]
            :interval-seconds 60
            :max-attempts     3}]}
  [coll]
  coll)

(defn put-ingredients-on-dough [coll] coll)

(defn bake [coll] coll)

(defn make-sauce [coll] coll)

(sfn/def-choice-predicate-fn is-pizza-acceptable? [m]
  true)

These functions define the steps in the pizza-making workflow. We can choose where to run these workers at runtime using the following code:

(sfn/run-here client
              {make-dough               {:concurrency 2}
               put-ingredients-on-dough {:concurrency 1}
               bake                     {:concurrency 1}})


(sfn/run-on-lambda client
                   {make-sauce {:timeout         40
                                :memory-size     512
                                :max-concurrency 50}})

Since the workers are defined as Clojure functions, we can choose to run them in containers or serverless functions at runtime.

Say, what if make-dough turns out to be an infrequent but bursty process that would be more suitable to run on serverless. But make-sauce takes too long before Lambda times out. We can switch the two like so:

(sfn/run-here client
              {make-sauce               {:concurrency 5}
               put-ingredients-on-dough {:concurrency 1}
               bake                     {:concurrency 1}})


(sfn/run-on-lambda client
                   {make-dough {:timeout         30
                                :memory-size     512
                                :max-concurrency 100}})

At Motiva, we have state machines to manage email delivery, data integration, and machine learning decisions, among others. However, we only have four developers on our team, and we want to improve our speed in delivering quality products with operational excellence. As demand grows and customers ask for more, we're seeing that our speed to orchestrate new business workflows as state machines is crucial to our competitiveness.

To achieve this goal, we plan to simplify our development process by using only one tool, Clojure, instead of using multiple tools like Amazon States Language and Terraform. By doing so, we can focus on delivering value to our customers.

So, what do you think? We're still in the design phase of this new library. If this is something that's of interest to you, please get in touch with me.

Moving from Lambda to SageMaker for inference

2023-04-07T00:00:00-04:00

A few years ago, we moved our machine learning inference endpoints to serverless in order to reduce operations upkeep. At that time, AWS Lambda seemed like an obvious choice, as many of our Clojure services were already running on Lambda, and we had a deployment process in place. We used our continuous integration platform to package our applications, upload them to S3, and then deployed resources with Terraform.

However, we faced a new challenge as our machine learning models grew beyond the 250 MB Lambda artifact limit. To resolve this, we needed to update our deployment pipeline to either pack our models as container images on Lambda (which has a more generous 10 GB limit) or find a new solution.

Enter SageMaker Serverless Inference. It offers all the benefits of Lambda with additional conveniences tailored for ML model deployments. One drawback is cost: according to our calculations, 1000 requests on a 2GB memory instance each running for 1 second cost 3.3 cents on Lambda, but 4 cents on SageMaker Serverless Inference. It's a 21% increase, but for a small company like us, the absolute amount is negligible compared to developer time saved.

In this blog post, I'll guide you through the steps of creating and using a SageMaker Serverless Inference endpoint, which we found to be a seamless experience. We're continuously impressed with the ecosystem developing around the MLOps community.

Creating a SageMaker endpoint

We decided to try SageMaker Serverless Inference and re-deployed one of our smallest machine learning models on SageMaker using the instructions from this Hugging Face notebook.

create an IAM role with AmazonSageMakerFullAccess policy on your AWS account (ref). Let's call it my_sagemaker_execution_role in this example.
create a S3 bucket for SageMaker to upload data, models, and logs:

import sagemaker

sess = sagemaker.Session()
sagemaker_session_bucket = None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

ensure execution IAM role and default S3 bucket:

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client("iam")
    role = iam.get_role(RoleName="my_sagemaker_execution_role")["Role"]["Arn"]

sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)
print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")

create a HuggingFaceModel instance by passing in env as the environment variables. It is important to note that we need to set the optional HF_API_TOKEN value because our model is private on Hugging Face. Hence, we must pass an API token to the environment for the container to successfully pull our private model.

env = {
    "HF_MODEL_ID": "MoritzLaurer/multilingual-MiniLMv2-L6-mnli-xnli",
    "HF_TASK": "zero-shot-classification",
    # "HF_API_TOKEN": HF_API_TOKEN,
}

huggingface_model = HuggingFaceModel(
    env=env,                      # configuration for loading model from Hub
    role=role,                    # iam role with permissions to create an Endpoint
    transformers_version="4.26",  # transformers version used
    pytorch_version="1.13",       # pytorch version used
    py_version="py39",            # python version used
)

deploy!

serverless_config = ServerlessInferenceConfig(
    memory_size_in_mb=2048,
    max_concurrency=10,
)

classifier = huggingface_model.deploy(
    serverless_inference_config=serverless_config, endpoint_name="my_demo_sagemaker_endpoint"
)

We really liked the fact that this can all be done in the same Python environment as our model development. This saves us from context switching between Python, Docker, and Terraform. However, we are unsure about using the SageMaker Hugging Face Inference Toolkit, which hides some of the nice features of SageMaker endpoints, such as support for production variants to perform A/B testing on models.

Requesting an inference

Wait for the SageMaker endpoint to be up and running. It takes a few minutes. After this, we can make a request. If you followed through further from the upstream instructions linked above, you can make a request immediately using the object returned from HuggingFaceModel.deploy(). However, in most cases, requests are made from a different process and at a later time. Therefore, we will not have access to the deploy() object at that point. Fortunately, this is easily achievable.

We have a couple of options. We can use the HTTP endpoint directly or leverage SageMakerRuntime. We opted to use SageMakerRuntime because we're already in Python and do not want to go through the process of writing an AWS authentication header for a HTTP request.

import boto3

client = boto3.client("sagemaker-runtime")

# request body for a zero-shot classifier
body = {
    "inputs": "A new model offers an explanation for how the Galilean satellites formed around the solar system’s largest world.",
    "parameters": {
        "candidate_labels": ["space", "microbiology", "robots", "archeology"],
    },
}

response = client.invoke_endpoint(
    EndpointName="my_demo_sagemaker_endpoint", # replace with your endpoint name
    ContentType="application/json",
    Body=json.dumps(body),
)

print(json.loads(response["Body"].read()))

This example request body schema is for a zero-shot classifier. Request bodies for other transformers.pipeline tasks are shown in this Hugging Face notebook.

Lastly, note that SageMaker Serverless endpoint has a concurrent invocation limit of 200 invokes per endpoint, whereas Lambda can handle tens of thousands.

Paul as a manager

2022-01-18T00:00:00-05:00

This is a live cheatsheet to give you an idea of what I offer and what I value as an engineering manager. I'll refine this post over time as I continue to grow.

What you can expect from me

My job as a manager is to enable you to do your best work so that we can grow the company and your career together. That means for me to understand your aspirations and to maximize your personal goals with that of the team's. I will support you by doing the chores that you don't want to do and push you hard on the work that you want to do.

There are many ways to getting things done. I don't expect you to adopt to my way but I do expect mutual understanding of how we each prefer to work. In particular, I push the team to be conscious of how they spend their time by focusing on what matters to the product and our customers. On any given week, I typically spend my time as follows:

40% on supporting the team. This could be anything from collaborating on technical designs, code reviews, completing unwanted tickets, or just getting to know you.
25% on development and operations. I pick up any gaps to keep our systems running smoothly.
25% on engineering management. I regularly consider these 3 aspects: organization, project, and technical. For organization management, that means working with the product team to ensure that our engineering strategy has alignment within the organization. For project management, ensuring that we are achieving our sprint goals. For technical management, keeping an eye on our technical debt and prioritizing longer term behind-the-scene work.
10% on personal growth. I am focusing on growing the skills that I need to be successful as a team lead and improving as a human being.

To enable product and customer mindfulness, it is my job to articulate the bigger picture to you so that you have the context to fully understand the problems that you're addressing. For example, I often like to collaborate on a one-pager problem statement to start a project before doing anything else.

I am tremendously focused and result-driven. I have an affinity for pragmatic, minimal effort to get just enough of a job done. I sometimes struggle when working with others who do not share these traits. Some things that are helpful for you to know about me:

I have low ego. I don't know what I'm doing more often than I realize. Sometimes I may inadvertently inconvenience you. When that happens, please accept my apology and help me improve.
I am driven by impact. Whether it is impact to the team, to customers, to the society, or to my family, I am proudest when seeing my effort make an impact in other people's lives, no matter how small.
I value open and transparent communication. I am a straightforward communicator. You can count on me to tell you what's on my mind. I will not take it personally if you disagree with me, but make sure that it is done openly so that we can discuss it.
I value a well-rested mind. Nothing can come between me and my sleep (except my kids 😅). When I don't sleep well, I don't perform well and get grumpy. I practice meditation, bodyweight training, and eating a pescatarian diet to keep my mind and body in shape.
I am a slow thinker. I'll often ask to sleep on an idea before getting back to you with a response.
I value understanding a problem deeply over clever solutions. I don't want to get ahead of ourselves and end up solving for the wrong thing.
I value feedback. Let me know what I've done well and what I can improve. I try my best to give others feedback immediately in context as I find that it is the most relevant.

What I expect from you

I trust you with the work; we hire smart people to solve complex problems. I will give you context and feedback, but you need to become the expert on the problem and own your solution. I expect you to own the work, to build relationships with our team, and to come to work with an empathatic attitude.

Support each other. Support can come in many forms, such as offering help on a difficult task, providing constructive feedback, or simply lending a listening ear. By supporting each other, we can develop a sense of trust and camaraderie.
Build relationships. We seek people who can develop deep relationships with others. You can be introvert, extrovert, or anything else, but the ability to build and maintain human relationships is key.
Humility. There are lots of smart, talented people in the world, and there’s usually someone smarter than you. If you can’t accept that, you’re going to have a tough time listening to others, developing healthy relationships, or understanding others.
Make an impact. You're hired not just to code but to make a sizable business impact. Coding is a means to an end.
Balance. I like my job but I don't tie my identity to it. If you're feeling sick or unwell, take some time off. If it's a nice day outside and you'd rather not work, take some time off.
Make decisions. If you are stuck, figure out what you need to unblock yourself. When you make a decision, document your rationality and communicate to the rest of the team.
Make the work visible. Make sure that there is a shared understanding of the problem that we are solving, any key deliverables have due dates established, and you are communicating the progress of the work. Capture feedback from stakeholders on your ticket and close the loop.
Ask questions. If you don’t know something, ask. If something is unclear, ask. Asking questions is the best way for us to come to a shared understanding as a team.

Dealing with On-Call Anxiety at a Startup

2021-12-25T00:00:00-05:00

Being on-call 24/7 for 18 months straight was one of the most anxiety-inducing responsibilities that I've ever had. I had to make sure that I always had my phone within reach, my laptop accessible, and my mind ready to fix any critical system failure. I sometimes jolted up in the middle of the night, believing there was a page on my phone when there wasn't. It wasn't until months later that I realized I had a problem, when my partner accused me of being irritable lately and it dawned on me that she was right.

If there's one piece of advice that I would give to fellow software developers hesitant about performing on-calls, it is this; don't do it! Don't take a job that requires it if you're feeling anxious about on-calls. We're blessed to be in a hot IT labour market now, so I'm sure that you have plenty of options available. However, if you're proceeding with being on-call (there are some good reasons to) and want to minimize the drain, continue reading.

What is it?

On-Call duty

I define on-call as being on pager duty for a period of time, expected to respond to urgent system issues. Typically, developers would be on rotation for 7 consecutive days of on-call duty out of every 4-6 weeks. On-call is a common practice with software teams that practice DevOps such that "you build it, you run it" is the expectation. In such teams, on-call duty is part of the development process, as opposed to a traditional separation of development team versus operations team.

Anxiety

The weird thing about my anxiety is that I find myself less anxious when there are more alerts happening regularly. One of the worst times was when I was on a family vacation for the first time in a long while, and there hadn't been any alert for weeks prior.

Anxiety is distress or uneasiness of mind caused by a fear of danger or misfortune. Wikipedia

According to the American Psychological Association, anxiety differs from stress in that anxiety is defined by a persistent worry that won't go away, whereas stress is typically caused by an external trigger. I am anxious about the prospect of our system going down. But once it's down, I get stressed about debugging the failed system while our customers are pounding on our support team. In fact, I gave a talk about the mental stress of debugging production applications.

Why on-call?

Motiva A.I. orchestrates digital marketing campaigns for our enterprise clients. Imagine a scenario where if you sign up for a newsletter at Verizon for Business (one of our customers), you'd expect to receive an email from them. Well, not if Motiva is down. Establishing an on-call practice ensures that we can immediately respond to any service interruptions.

"I can't wait to spend my weekend waiting for a system outage!", said nobody.

Keep in mind that on-call is a last line of defense to keep things running. If high-availability is a business requirement, then that consideration should be an integral part of the product development process right from the start.

For the individual software developer, on-call can be a helpful practice to hone your debugging and software engineering skills. Bugs happen when an unexpected event pushes your known system into an unknown state. Debugging in such an environment can be a helpful process to push your understanding of your systems and your tools. Moreover, I personally adhere to a pain-driven development mentality. Developers that feel the pain caused by the system they built will build better systems.

Dealing with on-call anxiety

As a developer

Seek professional help

If you're feeling anxious about your on-call duty, my first advice is to seek a workplace therapist. Second to that, I recommend working through this Anxiety & Worry Workbook by Clark and Beck.

Accept failures

What ultimately helped me mentally is accepting the fact that the worst that can happen if I miss an alarm or two isn't so bad afterall. We might lose a customer or two, which is pretty bad for an enterprise startup with only a handful of customers. But what's worse is getting burnt out and not being able to function anymore. Growing a startup is a marathon, and I shouldn't lose the race over any single bump. Having said that, I understand that the ability to step back is easier said than done. That's why I recommend seeking a therapist or working through that anxiety workbook to ground yourself first. It took me months of actively working on easing my mind before finding my peace.

Personal tips

While I worked on my mental health, here are some small adjustments that I made to improve my well-being.

changed my pager tone to something more relaxing
took it easy when I got paged, resolved it when I could and did not fret about jumping onto my laptop immediately
blocked out certain hours in our on-call scheduling system during expected slower weeknights so that I could rest easy. Getting woken up in the middle of the night was my biggest anxiety. This gave me peace of mind.
this wasn't an option for me, but I wish I could have opted out of on-call duties for a couple months when I was going crazy

Operational tips

In terms of operation, it would suck to keep getting paged for the same issue. Make sure that any incident is followed by a retrospective with actions to mitigate the problem in the future. We write up a one-page incident report for each incident, highlighting who it affected, why it happened, and suggestions for future prevention with actual tickets scheduled to be worked on. This give visibility to the product team around delivery expectation.

A second operational tip I find useful is that during an incident, don't try to do too much. Focus on getting the system back up, or at least the mission critical parts, even if they will be in a limping state. Leave the fixing for the team during work hours. For example, a couple times I would just pause our affected workers before spending any more time debugging. It's easier to explain to our customers that the system stopped working than to explain that it did the wrong thing. This is a debugging tip but it helped with my anxiety because I lowered the bar for what needs to be done substantially, i.e. from fixing to pausing the system gracefully.

As a manager

Be wary of your bus factor for mission critical components

As a serial technical co-founder, I often find myself taking on-call responsibility for months at a time, simply because there's no one else available. What is different this time is that we had been around for 4 years already when this long on-call stretch started unexpectedly. It happened because our only other senior backend developer left us in mid-2020.

People come and go. It was my fault for letting myself be stuck in this situation. We have other developers on the team, but they've been working on more interesting and valuable parts of our product. I've let myself be left behind as the only one responsible for our legacy, but still mission critical, components.

the more mission critical a component is, the more eyes should be on it

We categorize our application workflows as i) mission critical, ii) essential, or iii) everything else. Only a couple mission critical workflows on our system are allowed to raise on-call alarms outside of work hours. This reduces support fatigue and focuses our attention on what matters.

Distributed workforce

Motiva has been a remote-first company from the start. One thing we did well early on is to hire across time zones for our engineering team. My previous senior engineer was on the other side of the world from me (not deliberately to such an extreme; suitable partners are just hard to find). That worked wonders for our on-call and support needs. However, that obviously comes with its challenges as we had to rely heavily on asynchronous collaboration methods for actual development work.

Cover each other

I left myself behind in my case. But having lived through this trauma, I wouldn't put this on anyone else on my team. In addition to sharing knowledge across critical components, nobody should be on-call alone. I make sure that we have a clear escalation procedure in place so that whoever is on-call feels supported.

In reality for a startup, there's only so many people that can play musical chair in our team of ten. That's why we built semi-automated diagnostic tools for non-developers on our team to help out. The more issues that our customer support team can resolve, the fewer tickets will need to be escalated to our devs.

On the other end of the escalation ladder, I make sure that our team knows that I'm always available to whoever is on-call. That's the unfortunate responsibility of being the tech lead at a small startup. Luckily, my team only had to call me once outside of my on-call schedule this year.

Set clear expectations

Just like any team process, we make sure that a new engineering team member taking on-call duty knows that they are not expected to fix the world, knows that it's ok to take it easy, and knows that they have our support. If anything, err on the side of escalating early for their first few incidents so that we can pair on the incident with them.

Early on in this post, I mentioned that I wouldn't recommend taking on a job requiring on-call. If you don't think that on-call is for you, then don't do it. However, this is an unavoidable responsiblity if you want to work in a small startup. Something mission critical will eventually fail, and somebody needs to fix it. For such times, I'd rather that we have a clear on-call schedule with clear expectations than scramble to fight the fire.

Build robust and maintainable systems, but we don't have time!

The lasting remedy for my on-call anxiety was building confidence in our systems so that I know we'll be fine if I'm unavailable. We dug ourselves out of that technical debt. For two quarters in 2021, we directed all engineering resources to rebuilding the most vulnerable parts of our system as high-availability workflows.

Developers that feel the pain caused by the system they built will build better systems.

Hold on a second! Did I mention that I got stuck with this endless on-call stint because one of our two backend developers left? Between scaling our system to handle the 400% increase in system load this year, hiring for a replacement, building new features, reviewing code, mentoring, ... how did we find time to address this rotting technical debt?

Engineering management in a startup setting is still something that I'm learning even after 10 years. In this case, I couldn't take it anymore and had to convince the rest of the company that we had to stop everything to focus on addressing some technical debt known for years. The fact that it got to the point where my health was at stake until I prioritized is a failure on my part. Having said that, I learned a lot over these past few years in terms of maintaining an enterprise system and engineering management. I've been heads-down building Motiva for the past 6 years. If you find this post useful, take a look at my blog. I plan to write regularly over 2022 to capture some of my learnings.

Orchestrating Pizza-Making: A Tutorial for AWS Step Functions with Stepwise

2021-10-26T00:00:00-04:00

In this post, I will step through creating a AWS Step Functions state machine using Stepwise for a hypothetical pizza-making workflow. Before we get to that though, in case you're wondering why would you care about AWS Step Functions, or application workflow orchestration in general, I discussed our 4-year journey towards an message-based, orchestrated architecture previously.

Let's make some pizzas! Suppose these are the 4 steps to making a tasty Pizza Margherita:

make dough
make sauce
put ingredients on dough
bake

And suppose you built some robots to perform each of the step. How would you orchestrate your robots to work together? Well, one option is to use an application workflow orchestration engine. We will be using AWS Step Functions via Stepwise.

Why Stepwise?

Stepwise is our open source library to manage Step Functions using Clojure. Why would you use Stepwise instead of an infrastructure as code (IoC) tool, like CDK or Terraform, to manage Step Functions?

If your workers are written in Clojure, then using Stepwise would help keep your state machine definitions, which is essentially business logic, and application code in one place. Thus, making your development and maintenance experience more seamless. And as you will see in this post, Stepwise has a few features that will make your time working with Step Functions easier too.

Creating a state machine

Let's start easy and create a simple Step Functions (SFN) state machine. Launch a REPL from your local Stepwise repository,

paul@demo:~/stepwise$ lein repl

...

stepwise.dev-repl=>

Paste the following code onto your REPL to create your first state machine on SFN. This assume that your environment is already setup with awscli and that your account has permission to manage SFN.

(require '[stepwise.core :as stepwise])

(stepwise/ensure-state-machine

  :make-pizza ;; name of your state machine

  {:start-at :make-dough

   :states
   {:make-dough {:type     :task
                 :resource :make-pizza/make-dough
                 :next     :make-sauce}

    :make-sauce {:type     :task
                 :resource :make-pizza/make-sauce
                 :next     :put-ingredients-on-dough}

    :put-ingredients-on-dough {:type     :task
                               :resource :make-pizza/put-ingredients-on-dough
                               :next     :bake}

    :bake {:type     :task
           :resource :make-pizza/bake
           :end      true}}})

; output
; "arn:aws:states:us-west-2:111111111111:stateMachine:make-pizza"

We called ensure-state-machine to create a state machine named :make-pizza, specifying 4 states. The states are connected serially with the :next keyword. For information on the important concepts of SFN, AWS provides helpful documentation here.

To see what you created on AWS, navigate to your SFN dashboard on AWS and find your make-pizza state machine. The diagram below (from my SFN dashboard) shows the corresponding layout of your make-pizza state machine.

This state machine doesn't do anything yet. We still need to:

assign workers to each of the step
start an execution

Assign workers to the steps

This is where the magic of Stepwise comes into play. If you're using an IoC tool to manage Step Functions, then you'd have to either create a Step Functions Activity or a AWS Lambda to run your application logic. But with Stepwise, we simply use the stepwise/start-workers! function to subscribe your Clojure functions to the SFN state machine.

Note in the code block below that the key names match the resource names defined in the state machine. Stepwise will automatically create the Step Functions Activity and run your workers in a core.async background process to respond to your Step Functions jobs.

(def workers
  (stepwise/start-workers!
    {;; key names must match one of your state machine resources
     :make-pizza/make-dough
     (fn [_] (println "making dough..."))

     :make-pizza/make-sauce
     (fn [_] (println "making sauce..."))

     :make-pizza/put-ingredients-on-dough
     (fn [_] (println "putting on ingredients..."))

     :make-pizza/bake
     (fn [_]
       (println "baking...")
       :done)}))

; output
; #'stepwise.dev-repl/workers

Run an execution

WARNING: there might be some AWS cost incurring to follow through with this tutorial from this point forward.

We have our pizza-making state machine defined. We have our pizza-making workers ready. Let's run this thing once on AWS.

(stepwise/start-execution!! :make-pizza {:input {}})

; output
;{:arn "arn:aws:states:us-west-2:111111111111:execution:make-pizza:242630d3-1b1c-4a76-b3c0-1bac266c29b2"
; :input {}
; :name "242630d3-1b1c-4a76-b3c0-1bac266c29b2"
; :output "done"
; :start-date #inst "2021-10-25T01:28:07.873-00:00"
; :state-machine-arn "arn:aws:states:us-west-2:111111111111:stateMachine:make-pizza"
; :status "SUCCEEDED"
; :stop-date #inst "2021-10-25T01:28:08.807-00:00"}

For Stepwise API, *-!! suffix to a function name denotes blocking calls. *-! denotes non-blocking calls. Similar to core.async convention. We're using the blocking version of start-execution here. There is a corresponding non-blocking start-execution! available.

You might have noticed that we passed in an empty map as input. For sake of simplicity, we're ignoring input and output entirely in this tutorial. Everybody is getting the same Pizza Margherita.

Operational features

As discussed in my previous post, one of the main benefits of using an application workflow orchestration engine instead of rolling your own are the provided operational features. I'll show a couple examples that I find the most useful. Execution history and error handling.

Execution history

Step Functions keep track of every one of your executions. One of the ways to access that logs is to use the AWS Console. On the make-pizza state machine view, you can see your list of execution history.

Not only do you get a high-level list of executions, you have access to detailed history of each of your execution with input and output information for each step within the state machine, as well as information of each state transitions.

Handling errors

Our robots can be prone to failure. What happens if one of them fail? Let's say the dough-making robot fails some of the time. One of the ways to handle that is to retry the task upon failure. Let's add a retry logic to our state machine definition.

{:make-dough {:type     :task
              :resource :make-pizza/make-dough
              :retry [{:error-equals     :States.TaskFailed
                       :interval-seconds 30
                       :max-attempts     2}]
              :next     :make-sauce}}

We've configured :make-pizza/make-dough worker to retry a second attempt after 30 seconds if a task fail.

State machine control flow

Guess what? Your Pizza Margheritas are in high demand. We want to save some time in your workflow by making the dough and sauce simultaneously. We can do that with a Parallel state provided by SFN.

Putting all of these together, here is our final state machine definition with re-try for make-dough worker and simultaneous runs for make-dough and make-sauce workers.

(stepwise/ensure-state-machine
  :make-pizza
  {:start-at :make-base

   :states
   {:make-base
    {:type     :parallel
     :branches [{:start-at :make-dough
                 :states   {:make-dough
                            {:type     :task
                             :resource :make-pizza/make-dough
                             :retry    [{:error-equals     :States.TaskFailed
                                         :interval-seconds 30
                                         :max-attempts     2}]
                             :end      true}}}

                {:start-at :make-sauce
                 :states   {:make-sauce
                            {:type     :task
                             :resource :make-pizza/make-sauce
                             :end      true}}}]
     :next :put-ingredients-on-dough}

    :put-ingredients-on-dough {:type     :task
                               :resource :make-pizza/put-ingredients-on-dough
                               :next     :bake}

    :bake {:type     :task
           :resource :make-pizza/bake
           :end      true}}})

This definition is getting a bit involved. The flow diagram representation is easier to understand what's happening.

Worker concurrency

Let's scale this workflow further. You built more robots to make pizzas. Let's increase the number of workers for this state machine to make use of the increased capacity. You can define worker concurrency with Stepwise when you call start-workers! by passing in a second argument map containing a task-concurrency key like so.

(def workers
  (stepwise/start-workers!
    {:make-pizza/make-dough
     (fn [_] (println "making dough..."))

     :make-pizza/make-sauce
     (fn [_] (println "making sauce..."))

     :make-pizza/put-ingredients-on-dough
     (fn [_] (println "putting on ingredients..."))

     :make-pizza/bake
     (fn [_]
       (println "baking...")
       :done)}

    {:task-concurrency
     {:make-pizza/make-dough               2
      :make-pizza/make-sauce               2
      :make-pizza/put-ingredients-on-dough 1
      :make-pizza/bake                     4}}))

Suppose your ensemble of robots can make 2 doughs, 2 sauces, and bake 4 pizzas at a time. We've set the worker concurrency to reflect that capacity. So now, any simultaneous executions happening will try to fill up those workers before queueing starts.

Summary

In this tutorial, we've set up a make-pizza state machine on AWS Step Functions using Stepwise. We demonstrated the core API of Stepwise: ensure-state-machine, start-workers!, and start-execution!!. We also ran through some features of Step Functions: handling errors and parallel flow. As well as a couple features of Stepwise such as seamless state machine and workers development, and worker concurrency.

If you want to find out more about Step Functions or Stepwise, check out the AWS Step Functions web site or Stepwise repository.

Took me 4 years to realize we'd been orchestrating workflows

2021-10-10T00:00:00-04:00

It's been more than 5 years since we founded Motiva. What we do fundamentally has been the same; helping companies send the most useful messages to the right people at the most convenient time. In this post, I'd like to discuss workflow orchestration and why I wish I'd known about it earlier.

Here's a screenshot of one of our workflow execution histories (an execution = a specific workflow run). Not only is there status information to each step, but we also have a record of the input and output for every step. This granular tracking is valuable for diagnosis and the high-level visibility makes designing complex workflows managerable. We process hundreds of executions per hour and each execution handles up to a couple millions of small decisions by interacting with internal and external systems.

Let's run through the evolution of our Process Batch workflow from the beginning to illustrate how we got to this point, using a sample of the typical kind of job we regularly perform at Motiva.

The task is to send certain emails to particular email addresses (contacts).

We aren't MailChimp though. One of our product's selling points is that our machine learning can determine the most useful* messages for the right people. Thus, identifying which particular emails to send to whom and at what time is the real work of this Process Batch workflow.

We query our machine-learning engine to map the "most likely useful" asset to each target. This defines what needs to be done for each batch.

From the list of email addresses, we need to:

know when to send to each contact

know which email variation to send to each contact

send out all the emails accordingly

Typically, this list of email addresses are contacts that either signed up for a particular newsletter or registered for a particular product. Thus, they are contacts with a history of engaging with our clients; providing a history of data for our machine-learning models to optimize on... but that's another story.

This Process Batch workflow comprises 3 activities:

scheduler -- enables you to do something at a specific time
asset picker -- chooses the email to be sent for each individual
sending out the emails

Quick and dirty first version

We built our first system in a few weeks as a service-oriented architecture. The system includes an email Campaign Orchestrator service, a Machine-Learning service, and an Email-Sender service, which all communicate via HTTP. I added a cron job onto our server to ping the Campaign Orchestrator service every few minutes. On receiving this heartbeat ping, the Campaign Orchestrator service would query our database table of schedules and create batches ready to be sent.

Here's a sequence diagram of the interactions between the services for this Process Batch workflow.

This 2016 version didn't last long though. The biggest problem was that since actions were performed in an impromptu manner, we had no observability when things weren't working for any reason. To diagnose which branch of logic a workflow took, I had to match our logs with database records to step through logical forks one by one. It was a slow and tedious chore. We had many basic reliability issues in our early days. I still recall having to debug a production problem after a few drinks around Christmas time while my CEO and one of our clients was waiting on the line -- a stressful experience that I gave a talk on last year.

Materialize batches for observability

Naturally, the improvement needed was to materialize all the impromptu events as database records ahead of time and determine everything beforehand. It meant, for example, that instead of doing a query of the system at each time interval, working out what needs to be done and processing the batch on the spot, we do all that at the beginning of a campaign once and for all. What this translates to in terms of easier diagnostic is that we can query our database for what decision was made at any point of interest in a workflow. It's a deterministic system that goes straight for the win.

There are two major changes to this second version.

we moved the scheduler out of the Process Batch workflow to run only once per campaign
all campaign batches are determined and materialized in advance

In this scenario, our Process Batch workflow has split into two workflows. The first part is the Activation workflow, which happens once and only once when the user finishes configuring a campaign. During this Activation workflow, the Campaign Orchestrator assign send-time schedules to every contacts; for example, Contact A should be emailed at 9am and Contact B should be emailed at 3pm.

Our Process Batch v2 workflow is still regularly triggered by a cron job. But instead of needing to determine which batches to send impromptu, we simply query our database for these pre-scheduled batches that are ready to be sent.

This second version lasted a few more months. But as our product was getting more use from our customers, we needed to scale the workflow. Particularly as some batches could take many minutes to process, whereas others would take only seconds. We needed a way to ensure that all batches get processed in reasonable time.

Message-based architecture for scaling

We chose to add a queue to our Process Batch workflow in order to choreograph batch worker allocation. Each Process Batch request was created as an independent message and put into the Process This Batch queue. When we needed to process a batch, we sent a Process Batch message with the required data to the queue. To respond to these messages in the queue, we had N number of deployed workers in the Campaign Orchestrator service subscribing to the queue.

That's a lot of words. Perhaps it's easier with a diagram. Note the addition of the Process This Batch queue.

Scaling

Scaling was simply a matter of deploying more workers to increase the N value. Luckily, we kept our components and services decoupled, so scaling was easy. Additional benefits from placing our Process Batch workflow in a queue were:

failure handling
high-level observability

Failure handling

In the previous v2 version of this workflow, we needed to explicitly scan our database for failed batches in order to recover them. With this queue-based v3 workflow, the corresponding message for a failed batch was simply not acknowledged and the message would automatically be put back into the queue to be processed again. The big caveat here is that batches need to be idempotent so that processing the same batch more than once doesn't cause any unexpected behaviour.

Observability

With Process Batch requests going through an actual queue within our infrastructure (as opposed to an in-process queue), we achieved high-level monitoring. We configured various alarms in our queue metrics (i.e., CloudWatch alarms on AWS SQS metrics) such as "age of oldest message in the queue in a 5-min period", "number of messages sent in a period", and "number of messages received in a period". Such alarms provided an overview of the health of our Process Batch queue, and by extension, the health of our Process Batch workflow. These metrics replaced a few database queries that we used to run to get similar data.

Meanwhile, as the demand for our product grew, we scaled our other workflows by using this setup and created numerous queues in the process. Over the following couple of years, we moved to an entirely message-based architecture. Below is our extended Process Batch v3.1 workflow where the intra-system calls -- both to the Machine-Learner and the Email Sender -- were communicated through their own respective queues.

The database journal writes for each worker constitute a small but significant addition to the above graph. What's shown here is a simplistic illustration. This database action could mean writing actual database journal records to log job status, or more often, the transactional reversion of database writes if a job fails. These actions are important because we spent a lot of time accounting for steps failing in different parts of a workflow. Keeping track of each step facilities easier diagnostics.

A more accurate view of our Process Batch v3.1 workflow looks like this:

Using this messaged-based architecture, we handled a growth of 400% serving a handful of Fortune 500 companies, supported by a team of only 2 backend engineers. Even so, this version still wasn't good enough. Our product was still experiencing unexpected issues every few weeks.

If only we could have finer-grained control of our workflows to account for even more state transition cases, we thought. But as you can see, even this simplistic decision diagram above is already a bit confusing even after omitting a few details including re-try limits and parallelization. Our increasing reliance on managing states between asynchronous queue workers was getting tedious and hard to manage. We were violating the 'don't repeat yourself' principle with our boilerplate operational application code and system infrastructure.

The Send Emails validation step in the above decision-flow diagram, for example, creates a decision path in our Process Batch workflow. If all emails were successfully sent, we proceed normally, but if any of the emails failed to send, we break off to an alternate path to re-try only the unsent emails. I coded this path with 2 separate workers over 2 separate queues; one paired worker and queue for the happy path, and a second pair for the failed path. Every extra decision in our workflow therefore became yet another worker and queue pair. With this naive implemenation, our need to impose finer-grained control of our workflows was becoming unmanageable.

Application Workflow Orchestration is a solved problem

What we needed was an application workflow orchestration engine. All those distinct queues we created to cover every decision path and the database journals used to keep track of step statuses are boilerplate logic for solving a known problem. Take a look at AWS Step Functions for an example of a workflow orchestration engine as a service. It's also what we ended up using for production.

AWS Step Functions is a low-code visual workflow service used to orchestrate AWS services, automate business processes, and build serverless applications. Workflows manage failures, retries, parallelization, service integrations, and observability so that developers can focus on higher-value business logic.

These feature -- manage failures, retries, ... and observability -- match the exact operational requirements that I've been building up to in this blog post and in our actual system.

It's worth pointing out that there are plenty of other workflow orchestration engines to choose from. AWS Step Functions just happens to fit our particular use cases.

Failure handling as decision paths

Below is a screenshot from the AWS Step Functions UI on one of our workflow definitions. Defining a workflow is simply a matter of writing a JSON definition (in our case, we use Stepwise, an open-source Clojure library on AWS Step Functions) to specify the state transition logic. All the boilerplate operational aspects, including creating implicit queues and tracking execution history (see next section), are taken care of for you. In this diagram alone, I would have needed to create 7 queues and 8 sets of status records for this workflow. I can't stress enough how the freedom from needing to explicitly create an endless number of queues and journal entries is a huge time saver!

Along with a JSON workflow definition to create this state machine, you have to assign workers to each of the steps. We simply re-used our existing workers because our system had alredy been implemented as a message-based architecture.

Once you have your workflow definition and workers assigned, Step Functions will automatically route output from one worker as input to the next according to the logic in your workflow definition. Routing options such as timeouts, re-tries, and route by output values are all available.

Monitoring 2.0: History for each execution

High-level CloudWatch metrics such as step failure counts over time are provided. These metrics are equivalent to the queue metrics mentioned previously. Moreover, Step Functions also provide a complete step-by-step history for each individual execution! This is valuable as it enables non-developer on the team to inspect particular executions for customer support.

I hope this has been a useful summary of our application business workflow management journey. In the next post, we will dive into our Stepwise library to illustrate how to set up and execute an AWS Step Functions workflow.

We are hiring a ClojureScript developer in Africa

2017-05-29T00:00:00-04:00

The InGrower Mobile App has been put to the test in the hands of a few smallholder chicken farmers in Mozambique since January 2016. We have learnt what's needed and is proceeding to the next stage of development. That's why we're looking to hire a ClojureScript developer to build the next version of our mobile app.

One of my longer term goals is to grow a distributed technical team in East Africa that can tackle various social impact challenges in the region. This first hire will play an important role in setting the direction for the team.

Link to job description

InGrower Mobile is a side business that I've been working on since September 2015. We are an early stage social enterprise in the I.C.T. for Development (smallholder farming) space. InGrower Mobile came about from my co-founder's experience operating an agri-business incubator in Mozambique since 2010. We admit unemployed individuals, then provide training and support to help them start their own agriculture business to make money. For InGrower Mobile, our product is a mobile application that takes what we learned from the past 7 years running our agri-business incubator and applying them online to make our training and support available for a wider audience across East Africa.

A Call to Scandinavian Investigative Journalists

2016-11-15T10:00:00-05:00

tl;dr Are you a journalist interested in understanding online threats to your operations as a journalist and protecting the network of people that have given you their trust -- colleagues, informants, collaborators, partners, and social connections? I would love to talk to you.

With both Brexit and Trump this year, for better or for worse, it's clear that extreme actions are gaining popularity with the people. This saying comes to mind, "most of the evil in this world is done by people with good intentions." Things are most likely going to be fine. But in the 1% chance that shit is going to hit the fan, maintaining freedom of speech is going to be critical. We've seen from history (as recently as Tunisia, Gaddafi) that freedom of information is one of the first to go when human rights are trampled.

My colleagues in Silicon Valley is co-hosting a digital security workshop with Bloomberg in the US. I am looking to start something similar in Scandinavia. At the moment, I'm looking to interview investigative journalists on what questions or concerns you might have regarding online threats to the work that you do. The goal of the interview is for me to listen and identify topics to cover at the planned encryption and operational security free seminar for investigative journalists in 2017.

The planned seminar is structured to help investigative journalists understand the threats that you, as a journalist, face online, and how to design effective responses to those threats. At the conclusion, participants will have an understanding of what a threat model is, how to design a reasonable security posture, best practice behaviours, and, to a lesser extent, what tools are available.

If you know any reporter or journalist that might be interested to have a chat with me, please put me in touch with them. You can find my contacts on the left sidebar.

Facebook-Loving Farmers of Mozambique

2016-03-10T00:00:00-05:00

We run out of drinking water one afternoon at our house on the 300-hectare farm. The region's power grid has been down for over a day. Our water tanks have run dry. It is 38 degrees Celsius, and the African summer sun is high in the sky. This happened on the first week of my month-long stay with our smallholder farmers in Moamba, Mozambique. It also sets the tone for what to expect. This is Africa. Nothing is easy.

Rewinding back five weeks, our first cohort of smallholder farmers were finishing trial runs of our product, InGrower App. We got lots of feedback. More inputs here. Move the box there. The usual pointing out the obvious. Users can tell you what's wrong, but they cannot tell you what to build. As we're an early stage startup that is still finding a product-market fit, we desperately need to figure out the latter. It is then that I see the only way to do this is for me to be in the same room as our users. And that is how I found myself living with our users on a chicken farm in Moamba.

One of the first questions that I asked my current co-founder when he pitched the InGrower idea to me was – do farmers in Mozambique use smartphones? I am now able to answer my own question. While living there, power and water often go out (though not usually at the same time), but the 3G mobile internet is always on. So what do you do when there is no power or water, and it's too hot to do anything? You browse Facebook on your $30 Android smartphone, like everybody else.

Of our first 14 trial users, two are heavy users of our product. These power users have been regular Facebook users for more than three years. They are used to navigating a mobile user interface, and understand the benefits of having data online. These are the users that we want as our initial champions for the app.

The farm is 10 km away from Moamba village. We often roar into the village sitting on the back of a tractor.

This is Moamba market:

This is my bed during my stay at the farm:

I specifically asked the manager before arriving to help me set up a mosquito net. Malaria is a real danger here, especially at the height of summer in January, when I was present. But what do you know? There were few mosquitoes in the house. Instead, I had giant bugs of all shapes and forms visiting me every night. Thank god I had the net!

I ate with the staff and farmers at the house. We all chipped in with US$20 per person each month. That was the budget for a month of groceries for them. Average monthly salary is US$100 per month in the country. A piece of bread cost about $0.15 and a whole chicken was $4 at the market. This was our typical lunch at the house:

There was a limit of two small pieces of meat per person for every meal when there was anything to eat at all. There were a couple of days when I had only a piece of bread and some crackers all day. I don’t know how the other guys who actually had to do physical labour managed to work on an empty stomach.

Sometimes, some of the guys went fishing or picked coconuts, and then our housekeeper would make something special.

At other times, I snuck out to the village and had a shameless feast. All of this grilled sausage and food is $6 and enough to serve four people:

A typical day goes like this for the chicken farmers of Moamba. They come in from the village at 7am. Tend to their chicken house. Breakfast at 10am. Back to their chicken house. Lunch at 2pm. By then, the sun and summer heat is in full blast. They chill in the staff house until 5pm, then head home back to the village.

I came to Moamba to do product development. In addition to all the learning and understanding that I received, the experience itself has been very special. Hanging out in the afternoon with the merry gang, and with the cool breeze blowing against your face. Riding into the village on a tractor. Brushing my teeth every evening out in the open, under the milky way. Nothing is easy here. But there's a certain romance in expecting the unexpected in daily life in Mozambique.

P.S. the title of this post is a play on Craig Mod’s article.

Immersing with Users in Mozambique

2016-01-14T00:00:00-05:00

I've been living on a farm in Mozambique for the past week to get to know our users for InGrower. My cofounder has been working here for 5 years. But still, the fact that I'm a product lead whom knows almost nothing about the lives of our users outside the context of our product trouble me. Soon after starting our alpha trial last month, I informed our team that I'm going over for user research.

Immersion is one of several ways to get to know your users. According to IDEO,

The Inspiration phase is dedicated to hearing the voices and understanding the lives of the people you’re designing for. The best route to gaining that understanding is to talk to them in person, where they live, work, and lead their lives. Once you’re in-context, there are lots of ways to observe the people you’re designing for. Spend a day shadowing them, have them walk you through how they make decisions, play fly on the wall and observe them as they cook, socialize, visit the doctor—whatever is relevant to your challenge.

The insight here is that a product doesn't operate in a vacuum. To build a product that our users would love, I need to understand who they are, what motivates them, what keeps them up at night. So that we deliver a product built for them for their lives in Mozambique and not what I assume to be while living in Boston or Copenhagen.

One of the many things that I learned here is that our users have a high tolerance for bad user interface. If they want something, they'll make it work. The big question for us then is creating something of tangible value to them. Having shadowed and spoken to a handful of local smallholding farmers, I think I have a better sense of what that actually means.

This is our makeshift kitchen that my housemates assembled. Everything here are hacked together in one way or another.