Theaboutbox Tech Blog

Claude Artifacts – A Deep Dive

Cameron — Sat, 31 Aug 2024 20:47:54 +0000

Anthropic recently added a new feature to Claude: Artifacts, a “dedicated window to instantly see, iterate, and build on the work you create with Claude” and I immedeately felt that this will be a game-changer. It’s rough, but you can literally build a simple app by describing what you want on the left side, and Claude will write the code and run it on the right side. That’s it.

What are Artifacts?

An artifact can be a document, a visualization, or a simple app or game. If Claude generates code, it will run in the side panel automatically and can be published online with a single click.

Taking Artifacts for a Test Drive

Large Language Models are pretty good at implementing the kinds of things that developers do all day every day because there is plenty of representative data in its training set. To see how good Claude is at doing something a little different, I gave it an off-the-wall example to see how well it adapts. I decided to try Pong, with three paddles. I started out with a simple prompt:

I want to make a variation of Pong, but there are three paddles that move along the edges of a triangle.

Claude’s first attempt was…something:

Claude’s first attempt at creating a three-paddled pong game

Iterating based on feedback

But that’s okay. I gave Claude some feedback to see if it would fix the problem:

Instead of the paddles rotating, can they move along the edges of the triangle?

Claude spent about 30 seconds updating the code and I ended up with a quirky but almost playable version:

Claude’s second attempt is almost playable

It’s kinda janky, but it’s getting closer. Honestly, I really love the way the paddles move along with the mouse. It is intuitive and makes it quite simple to control three paddles with one trackpad. The biggest issue is that the paddles are, strangely, facing the wrong direction. So I provided some more feedback:

Can we have the paddles face the same direction as the edges of the triangle?

The third attempt has the right idea but is a bit buggy

You’ll also notice it is explaining the changes it’s made in a lot of detail. There are still some issues, like the ball slowing down when the paddles move and the score incrementing by more than 1 when it hits a paddle. I also wanted to introduce a level system so the ball would speed up as you successfully hit it more often. Over a couple more rounds of feedback, I ended up with the version linked below:

Three paddle pong MVP. Click the image to play.

I did not write a single line of code. I described what I wanted, iterated back and forth, and it figured how to control the three paddles with one trackpad and all of the ball bouncing physics, which is made a lot more complicated by having two of the walls at an angle.

Where the context ends

Unfortunately, after I got to that point, after my 14th iteration, Claude kept on introducing bugs. I would get errors and when it fixed the errors it would create new bugs. It’s a known problem with LLMs that they start to act strangely and slow down as conversations get longer and longer. But I wanted to fix an issue where the paddles do not move on mobile devices, and I wanted to add a button to pause the game.

So, I decided to start with a clean slate, started a new conversation and pasted the data into it:

Starting a new chat

Boom! That worked the first time. I chatted back and forth for a few iterations to make some improvements:

Added a pause button
Increased the speed of the ball at the start of the game
Increased the change of speed as the player levels up

I ended up with this “final” version:

The game you didn’t know you needed.
Click the image to play

Conclusion

Having AI generate code is far from new. We’ve had that for a few years with ChatGPT and Github Copilot. What is new and powerful is the ability to iterate effortlessly combined with the ability to publish your results. It is early days, so it is best suited for “single serve” applications, but this is the first time I’ve been able create a simple app just by typing what I want in plain English and publish it on the Internet.

Next time, I will probably ask Claude not to explain itself, just write code. The more content that LLMs generate, the more likely they are to introduce a mistake.

I can think of some interesting real-world uses for something like this:

Interactive data visualizations – You can upload spreadsheets or pdf files and ask Claude to visualize data for you.
Interactive wireframes – You can upload a drawing or screenshot of a user interface and ask Claude to create a live web-based version.
Single-serve information applications – You can upload a document or some other information and create a simple site around it. Or upload your notes and have Claude create an interactive quiz.

It is going to be exciting to see where this product, and this style of interactive development goes from here. I can see a world where AI is a pair programmer and I am just giving feedback and direction.

Genieous: Part 1 – Motivation

Cameron — Sun, 28 Apr 2024 17:17:57 +0000

Last week, a colleague and I released an app: Genieous (App Store link). The concept is pretty simple: if you don’t know what to make with the food you have in the house, Genieous is a cheeky assistant that helps you create recipes. It’s a free app; you should download and try it.

While we hope that the genie can live off of tips and be self-sustaining, I had another motivation: AI is going to change the way we process information. Those who wish to wield AI as a tool should seek to understand it. And, the best way to understand anything is through real-world experience. Also, I frequently stare at the refrigerator wondering what to make for dinner, and I need inspiration.

I quickly realized that every benefit and challenge that exists in the world at large for AI also exists in this tiny little app, and it is a low-stakes way to explore ways of dealing with these issues. I’ll discuss all of these in more detail in future articles. These are some of the challenges that we needed to think through, even with this small-scale app:

Strategy

If you are thinking about incorporating AI into your product or business, one of the first questions you need to answer is: where will you get the most leverage out of using AI? With generative AI, especially large language models, you need to account for it giving wrong answers much of the time.

One way to think about this is in terms of convexity. The best framing I have found from this is from Nassim Nicholas Taleb who, in Understanding is a Poor Substitute for Convexity, defines it as “asymmetry between the gains (as they need to be large) and the errors (small or harmless)“.

The domains that have the property that the benefit of being right far outweighs the cost of being wrong are the most appropriate places to deploy generative AI. In the case of Genieous, cooking is naturally a trial-and-error process. If we can give someone a lot of ideas that they can take or leave as they see fit, the benefit (a great meal you can enjoy over and over again and impress people) far outweighs the drawbacks (a bad meal that you can throw away and order takeout instead).

User Experience

How do you take something that is very complex and open-ended and put it into a user-frendly product? How do you make the experience seamless and not feel like you are interacting with a chatbot? I really wanted to solve the problem of “what to make with the food I have” with the least amount of friction possible. I also wanted users to feel like they are focused on the problem at hand, rather than playing with a large language model.

Model Selection

There are thousands of different language models: ones that are small enough that can be run on-device to the top-of-the-line models from OpenAI, Google and Anthropic. They all have different capabilities and pricing models. Which one is the best for this application, assuming steadily growing usage? There are chat models and instruction tuned models, is one better than another?

Prompt Engineering

Language models can be very sensitive to how they are asked a question, and prompting techniques vary wildly across models. How should we ask an AI to generate recipes with a set list of ingredients. We want to get good results, but we also want it to be easy for the computer to interpret the response. And, since many vendors charge per input and output token, we want to be efficient in how we ask and what we ask for. Often, the ‘better’ a model gets, the ‘slower’ the generation can be.

Evaluation

I’ve used the word ‘better’ a lot without saying anything about what it means to be ‘better’. It is impossible to make informed decisions about what model to use and how to interact with it without some way of evaluating how good the answers you are getting are. How will you know if changes are better or worse? What does it mean for an recipe to be good? Is it more important for the model to generate interesting and creative recipes, even if it adds an extra ingredient or two? Should the model generate a different list of recipes every time when given the same ingredients? If two different models generate the same recipe, which one does a better job?

In the very beginning, you’ll probably base a lot of this on ‘vibes’. But you’ll want to keep track of what models, prompts and settings result in the subjectively best output. As a product gets more mature, you’ll want to augment that with some sort of rubric for scoring model performance.

Evaluating model performance requires data, and you will want to start collecting as much as you can as early in the project as possible.

Model Improvements

New models come out all of the time and new research is coming out on techniques to make generation better. Providing context, changing parameters like model temperature and fine-tuning are all techniques that can be used to improve model performance.

Alignment

Alignment refers to the process of steering an AI towards “a person’s or group’s intended goals, preferences or ethical principles.” [Wikipedia] Even a lowly recipe creation app touches on alignment. If a person has allergies, health concerns, or dietary restrictions, an aligned recipe generation AI will provide recipes that takes their needs into account.

But what if someone intentionally provides a bunch of ingredients that are not edible? What is the right thing to do? Should the AI refuse to generate recipes? Should it provide a warning and do it anyways? Should it happily play along with the human making the request? These are issues anyone deploying AI needs to grapple with.

Hallucinations

Hallucinations refer to the tendency of large language models to generate text that is not grounded in reality. Everyone who has used ChatGPT has noticed sometimes it will just make things up. In Why Reliable AI Requires a Paradigm Shift, Alejandro Morfis writes:

Rather than storing explicit factual claims, LLMs implicitly encode information as statistical correlations between words and phrases. This means the models do not have a clear, well-defined understanding of what is true or false. They can just generate plausibly sounding text.

This is a classic it’s even worse that it appears moment. Everything that large language models generate is a hallucination, luckily the statistically probable answer is quite often a good one.

Toxicity

Toxicity refers to the tendency of models to generate offensive content. Researchers struggle to refine this definition further, so it is often a “I know it when I see it” thing. A recent paper on this subject defines it as “stress by contradiction of accepted morality and norms of interaction with respect to the situational and verbal context of interaction.” A rich ethical debate can be had around this concept.

Even if there were objective criteria to define toxicity, it is impossible to eliminate. Again from Why Reliable AI Requires a Paradigm Shift:

Recent research suggests that if there is a sentence that can be generated at all, no matter how low its base probability, then there is a prompt that will generate it with almost 100% certainty

That means, if you allow free input, there is no way in practice to prevent a language model from producing toxic output.

In the case of Genieous, we can also take this more literally: if the ingredients or the way they are handled would be toxic, then this is a problem. Some ingredients are always toxic, and there are others that are dangerous if not handled properly. (Chicken is one of the most under-rated dangerous ingredients based on food poisoning incidents.)

Bias

On Measures of Biases and Harms in NLP defines bias as a “skew that produces a type of harm towards different social groups.” Much of the time, bias happens because the data that a model is trained on is skewed towards particular groups, or is missing data representing other groups. That causes language models to encode biases in its weights. It is impossible to eliminate bias, but being able to measure it, understand it and improve is a worthy pursuit.

The harms from a lowly recipe app from biases is limited, but we want to please every palate, and drawing from the entire world’s culinary wisdom will produce better outcomes for everyone.

Is every billable hour ‘amateur hour’?

Cameron — Thu, 07 Sep 2023 18:54:24 +0000

I have been thinking all morning about Every Billable Hour is Amateur Hour. Erik argues compellingly against the pitfalls of freelancers building their ventures on Time and Materials (T&M) projects. You should read the whole thing, but to summarize:

T&M can commoditize the service you offer, leading to downward rate pressure.
While T&M promotes grinding, Firm Fixed Price (FFP) drives the balance of maximum value for minimum time.
Companies often regard T&M contractors as employees, sans the paperwork.

Can you scale a T&M consulting practice?

It is possible to achieve scale with hourly billing: hire employees or subcontractors and charge enough margin to clients to cover bench time and administrative overhead. This is what most professional services firms do. It can be lucrative but it scales the perverse incentives too. The most successful tech consulting practices that take this approach tend to find a niche and provide staff-augmentation. One of my colleagues ran a practice sourcing security-cleared network engineers and IT operations people for classified DoD projects. The niche helps fight the commoditization and the business value to the client is in the firm’s ability to network and find difficult-to-source candidates.

The downsides also scale with the size of the practice. Once you hire people to provide services, you need to keep them utilized in order to make money. If you are a practice lead it is almost certain that a big part of your bonus is tied to hitting a utilization number for the people in your practice. The best way to hit utilization numbers is to optimize for projects where your people will grind the highest number of hours for the highest rate possible. Chances are these projects will not have enough in common to have any real repeatability. A consistent influx of projects and timely payments from clients is essential; otherwise, the system’s fragility becomes apparent, leading to erratic hiring and layoffs.

Are there hybrid models?

It is also not an ‘all-or-nothing’ proposition. My wife’s public relations firm works on a T&M basis with some clients and on a per-deliverable basis with others, having a price schedule for white papers, articles, press releases, social media posts, etc. In the IT world, it is common to build an application on a T&M model and provide ongoing sustainment / managed services on a FFP basis.

Pitfalls of Firm Fixed Price (FFP) Projects

I completely agree with Erik’s view that FFP makes us have skin in the game when defining scope and correctly focuses us on providing the most value in the smallest amount of time. It also encourages us to develop unique IP and solutions that develop a unique edge. That said, FFP is scary. I could write a book on mistakes I’ve made throughout my career on negotiating and executing firm-fixed price projects, but here are some highlights:

Don’t have 100% of the payment due upon acceptance

Typically FFP projects define a total price with a series of milestone payments. There is an art to defining the milestones so that you can have healthy cash flow and customers feel comfortable, but never have all the money due at the end. If your customer pays you early in the project as promised you will probably not have issues at the end. When possible, it is nice to have a small ‘get started’ payment to kick off the project, ensuring that everyone has skin in the game.

Have some guard rails for time

There should be some upper-bound on time. Every project has unknowns that affect the schedule. The upper bound should be high enough the client feels like it is really fixed-price but provide some leverage to change the scope if large enough unknowns are discovered.

Be explicit with assumptions

Do you require physical access to a facility? Software licenses? Do you require certain stakeholders to provide you information or feedback within a reasonable amount of time? Are there any other things that need to be in place in order to be successful? This is the time to think clearly about the risks that affect your ability to deliver value in a reasonable amount of time, discuss them with the client and agree on them.

Under-rated capabilities of Large Language Models: Introduction

Cameron — Sat, 29 Apr 2023 02:29:20 +0000

Background

Large Language Models have been with us for a couple of years. OpenAI released the GPT-3 model on June 11, 2020. It generated a lot of buzz in technical circles, but the model was not packaged in a way that was usable by the masses. That changed on November 30, 2022 when OpenAI released ChatGPT, running an improved version of the GPT-3 Model (GPT 3.5) and bringing AI / large language model capabilities to the masses.

ChatGPT is set up with a simple message-based interface, like iMessage or WhatsApp, that makes it easy for users to send text to the model and for the model to generate a response. It immediately showed promise writing essays, generating summaries, answering questions and performing basic computer programming tasks. Pretty quickly users realized that these models generate plausible text that is mostly, but not completely correct.

On March 14, 2023, OpenAI released GPT-4, a major upgrade from GPT-3. It accepts much longer input for questions, can do real math and analysis and it is much less likely to lie in its responses to a user.

From Hope to Hype back to Hope again

GPT-4 makes one heck of a first impression. It can take dozens of pages of context and extract information instantly. It can summarize and reason. It can generate websites and simple apps from a prompt and then incorporate suggestions. It came out 15 weeks after GPT 3.5. My feeling was, if this is the pace of progress now, the world is suddenly a different place than it was the day before.

After some time playing around with it, it still makes things up, often enough that it’s not possible to blindly trust it. It’s not going to turn every industry around overnight. It also seems like we are not careening towards a singularity any time soon. The industry’s focus is on understanding and optimizing capabilities and making models more focused or reliable.

Meanwhile, I’ve found the GPT models to be a great productivity boost for the parts of my work that are tedious, and I have been poking around the edges trying to find unexpected places where these models perform well because that’s where there will be unexpected advances. For example, large language models can gather information, formulate plans and guide their execution in a number of settings, such as chemical engineering. With some teasing, they can perform well at reasoning tasks

Stay tuned, we’ll dig into these and more, providing examples along the way.

Pyenv, Pipenv, Dotenv, oh-my-zsh challenges

Cameron — Wed, 26 Oct 2022 15:41:59 +0000

Filing this one for: I hope I find this post when I have this problem again.

Like many developers, I have a need to juggle multiple python versions: Python 3.8 to run in AWS Lambda; Anaconda for machine learning models; Latest stable version for everyday development.

It seems like these days, the multi-python toolchain looks like this:

pyenv to install other python versions, along with pyenv-virtualenv to better support conda.
venv, generally managed by higher-level tools like Poetry, Pipenv or Hatch.
Dotenv to keep track of environment variables that need to be provided to 12-factor applications.

I am also a fan of oh-my-zsh to automate some basic tasks like:

Showing Git branch and commit status in the terminal
Activating a project’s virtual environment
Adding the variables in .env to the current environment

But when I tried to get this all working on my new Mac, I could not activate any virtual environment. Ultimately I discovered it was because the shell would prompt me to source .env, and that prompt somehow messed up the activation process via the virtual environment. There seem to be two solutions:

Tell oh-my-zsh to always source .env. This breaks the virtual environment the first time, so you need to exit and then cd back to the directory. Or,
Don’t use the .env plugin.

Self-Serving First Post

Cameron — Fri, 07 Oct 2022 16:45:08 +0000

After a long hiatus from screaming into the ether, I warm up my vocal chords again to shout into the void. Perhaps the void will shout back. Perhaps two years from now when I cannot figure out a basic problem, I will concoct some Google searches that will lead me back here to learn that former-me solved this problem after way-too-much effort.

But now I find myself confronted with messy problems, and writing about them helps to give them some form and structure, and that’s enough, for now.