Hartley Brody

How to Scrape Amazon.com: 19 Lessons I Learned While Crawling 1MM+ Product Listings

Hartley Brody — Wed, 03 Aug 2016 18:55:13 +0000

In its simplest form, web scraping is about making requests and extracting data from the response. For a small web scraping project, your code can be simple. You just need to find a few patterns in the URLs and in the HTML response and you’re in business.

But everything changes when you’re trying to pull over 1,000,000 products from the largest ecommerce website on the planet.

When crawling a sufficiently large website, the actual web scraping (making requests and parsing HTML) becomes a very minor part of your program. Instead, you spend a lot of time figuring out how to keep the entire crawl running smoothly and efficiently.

This was my first time doing a scrape of this magnitude. I made some mistakes along the way, and learned a lot in the process. It took several days (and quite a few false starts) to finally crawl the millionth product. If I had to do it again, knowing what I now know, it would take just a few hours.

In this article, I’ll walk you through the high-level challenges of pulling off a crawl like this, and then run through all of the lessons I learned. At the end, I’ll show you the code I used to successfully pull 1MM+ items from amazon.com.

I’ve broken it up as follows:

High-Level Challenges I Ran Into

There were a few challenges I ran into that you’ll see on any large-scale crawl of more than a few hundred pages. These apply to crawling any site or running a sufficiently large crawling operation across multiple sites.

High-Performance is a Must

In a simple web scraping program, you make requests in a loop — one after the other. If a site takes 2-3 seconds to respond, then you’re looking at making 20-30 requests a minute. At this rate, your crawler would have to run for a month, non-stop before you made your millionth request.

Not only is this very slow, it’s also wasteful. The crawling machine is sitting there idly for those 2-3 seconds, waiting for the network to return before it can really do anything or start processing the next request. That’s a lot of dead time and wasted resources.

When thinking about crawling anything more than a few hundred pages, you really have to think about putting the pedal to the metal and pushing your program until it hits the bottleneck of some resources — most likely network or disk IO.

I didn’t need to do this for my purposeses (more later), but you can also think about ways to scale a single crawl across multiple machines, so that you can even start to push past single-machine limits.

Avoiding Bot Detection

Any site that has a vested interest in protecting its data will usually have some basic anti-scraping measures in place. Amazon.com is certainly no exception.

You have to have a few strategies up your sleeve to make sure that individual HTTP requests — as well as the larger pattern of requests in general — don’t appear to be coming from one centralized bot.

For this crawl, I made sure to:

Spoof headers to make requests seem to be coming from a browser, not a script
Rotate IPs using a list of over 500 proxy servers I had access to
Strip “tracking” query params from the URLs to remove identifiers linking requests together

The Crawler Needed to be Resilient

The crawler needs to be able to operate smoothly, even when faced with common issues like network errors or unexpected responses.

You also need to be able to pause and continue the crawl, updating code along the way, without going back to “square one”. This allows you to update parsing or crawling logic to fix small bugs, without needing to rescrape everything you did in the past few hours.

I didn’t have this functionality initially and I regretted it, wasting tons of hours hitting the same URLs again and again whenever I need to make updates to fix small bugs affecting only a few pages.

Crawling At Scale Lessons Learned

From the simple beginnings to the hundreds of lines of python I ended up with, I learned a lot in the process of running this project. All of these mistakes cost me time in some fashion, and learning the lessons I present here will make your amazon.com crawl much faster from start to finish.

1. Do the Back of the Napkin Math

When I did a sample crawl to test my parsing logic, I used a simple loop and made requests one at a time. After 30 minutes, I had pulled down about 1000 items.

Initially, I was pretty stoked. “Yay, my crawler works!” But when I turned it loose on a the full data set, I quickly realized it wasn’t feasible to run the crawl like this at full scale.

Doing the back of the napkin math, I realized I needed to be doing dozens of requests every second for the crawl to complete in a reasonable time (my goal was 4 hours).

This required me to go back to the drawing board.

2. Performance is Key, Need to be Multi-Threaded

In order to speed things up and not wait for each request, you’ll need to make your crawler multi-threaded. This allows the CPU to stay busy working on one response or another, even when each request is taking several seconds to complete.

You can’t rely on single-threaded, network blocking operations if you’re trying to do things quickly. I was able to get 200 threads running concurrently on my crawling machine, giving me a 200x speed improvement without hitting any resource bottlenecks.

3. Know Your Bottlenecks

You need to keep an eye on the four key resources of your crawling machine (CPU, memory, disk IO and network IO) and make sure you know which one you’re bumping up against.

What is keeping your program from making 1MM requests all at once?

The most likely resource you’ll use up is your network IO — the machine simply won’t be capable of writing to the network (making HTTP requests) or reading from the network (getting responses) fast enough, and this is what your program will be limited by.

Note that it’ll likely take hundreds of simultaneous requests before you get to this point. You should look at performance metrics before you assume your program is being limited by the network.

Depending on the size of your average requests and how complex your parsing logic, you also could run into CPU, memory or disk IO as a bottleneck.

You also might find bottlenecks before you hit any resource limits, like if your crawler gets blocked or throttled for making requests too quickly.

This can be avoided by properly disguising your request patterns, as I discuss below.

4. Use the Cloud

I used a single beefy EC2 cloud server from Amazon to run the crawl. This allowed me to spin up a very high-performance machine that I could use for a few hours at a time, without spending a ton of money.

It also meant that the crawl wasn’t running from my computer, burning my laptop’s resources and my local ISP’s network pipes.

5. Don’t Forget About Your Instances

The day after I completed the crawl, I woke up and realized I had left an m4.10xlarge running idly overnight. My reaction:

I probably wasted an extra $50 in EC2 fees for no reason. Make sure you stop your instances when you’re done with them!

6. Use a Proxy Service

This one is a bit of a no-brainer, since 1MM requests all coming from the same IP will definitely look suspicious to a site like amazon that can track crawlers.

I’ve found that it’s much easier (and cheaper) to let someone else orchestrate all of the proxy server setup and maintenance for hundreds of machines, instead of doing it yourself.

This allowed me to use one high-performance EC2 server for orchestration, and then rent bandwidth on hundreds of other machines for proxying out the requests.

I used ProxyBonanza and found it to be quick and simple to get access to hundreds of machines.

7. Don’t Keep Much in Runtime Memory

If you keep big lists or dictionaries in memory, you’re asking for trouble. What happens when you accidentally hit Ctrl-C when 3 hours into the scrape (as I did at one point)? Back to the beginning for you!

Make sure that the important progress information is stored somewhere more permanent.

8. Use a Database for Storing Product Information

Store each product that you crawl as a row in a database table. Definitely don’t keep them floating in memory or try to write them to a file yourself.

Databases will let you perform basic querying, exporting and deduping, and they also have lots of other great features. Just get in a good habit of using them for storing your crawl’s data.

9. Use Redis for Storing a Queue of URLs to Scrape

Store the “frontier” of URLs that you’re waiting to crawl in an in-memory cache like redis. This allows you to pause and continue your crawl without losing your place.

If the cache is accessible over the network, it also allows you to spin up multiple crawling machines and have them all pulling from the same backlog of URLs to crawl.

10. Log to a File, Not `stdout`

While it’s temptingly easy to simply print all of your output to the console via stdout, it’s much better to pipe everything into a log file. You can still view the log lines coming in, in real-time by running tail -f on the logfile.

Having the logs stored in a file makes it much easier to go back and look for issues. You can log things like network errors, missing data or other exceptional conditions.

I also found it helpful to log the current URL that was being crawled, so I could easily hop in, grab the current URL that was being crawled and see how deep it was in any category. I could also watch the logs fly by to get a sense of how fast requests were being made.

11. Use `screen` to Manage the Crawl Process instead of your SSH Client

If you SSH into the server and start your crawler with python crawler.py, what happens if the SSH connection closes? Maybe you close your laptop or the wifi connection drops. You don’t want that process to get orphaned and potentially die.

Using the built-in Unix screen command allows you to disconnect from your crawling process without worrying that it’ll go away. You can close your laptop and simple SSH back in later, reconnect to the screen, and you’ll see your crawling process still humming along.

12. Handle Exceptions Gracefully

You don’t want to start your crawler, go work on other stuff for 3 hours and then come back, only to find that it crashed 5 minutes after you started it.

Any time you run into an exceptional condition, simply log that it happened and continue. It makes sense to add exception handling around any code that interacts with the network or the HTML response.

Be especially aware of non-ascii characters breaking your logging.

Site-Specific Lessons I Learned About Amazon.com

Every site presents its own web scraping challenges. Part of any project is getting to know which patterns you can leverage, and which ones to avoid.

Here’s what I found.

13. Spoof Headers

Besides using proxies, the other classic obfuscation technique in web scraping is to spoof the headers of each request. For this crawl, I just grabbed the User Agent that my browser was sending as I visited the site.

If you don’t spoof the User Agent, you’ll get a generic anti-crawling response for every request Amazon.

In my experience, there was no need to spoof other headers or keep track of session cookies. Just make a GET request to the right URL — through a proxy server — and spoof the User Agent and that’s it — you’re past their defenses.

14. Strip Unnecessary Query Parameters from the URL

One thing I did out of an abundance of caution was to strip out unnecessary tracking parameters from the URL. I noticed that clicking around the site seemed to append random IDs to the URL that weren’t necessary to load the product page.

I was a bit worried that they could be used to tie requests to each other, even if they were coming from different machines, so I made sure my program stripped down URLs to only their core parts before making the request.

15. Amazon’s Pagination Doesn’t Go Very Deep

While some categories of products claim to contain tens of thousands of items, Amazon will only let you page through about 400 pages per category.

This is a common limit on many big sites, including Google search results. Humans don’t usually click past the first few pages of results, so the sites don’t bother to support that much pagination. It also means that going too deep into results can start to look a bit fishy.

If you want to pull in more than a few thousand products per category, you need to start with a list of lots of smaller subcategories and paginate through each of those. But keep in mind that many products are listed in multiple subcategories, so there may be a lot of duplication to watch out for.

16. Products Don’t Have Unique URLs

The same product can live at many different URLs, even after you strip off tracking URL query params. To dedupe products, you’ll have to use something more specific than the product URL.

How to dedupe depends on your application. It’s entirely possible for the exact same product to be sold by multiple sellers. You might look for ISBN or SKU for some kinds of products, or something like the primary product image URL or a hash of the primary image.

17. Avoid Loading Detail Pages

This realization helped me make the crawler 10-12x faster, and much simpler. I realized that I could grab all of the product information I needed from the subcategory listing view, and didn’t need to load the full URL to each of the products’ detail page.

I was able to grab 10-12 products with one request, including each of their titles, URLs, prices, ratings, categories and images — instead of needing to make a request to load each product’s detail page separately.

Whether you need to load the detail page to find more information like the description or related products will depend on your application. But if you can get by without it, you’ll get a pretty nice performance improvement.

18. Cloudfront has no Rate Limiting for Amazon.com Product Images

While I was using a list of 500 proxy servers to request the product listing URLs, I wanted to avoid downloading the product images through the proxies since it would chew up all my bandwidth allocation.

Fortunately, the product images are served using Amazon’s CloudFront CDN, which doesn’t appear to have any rate limiting. I was able to download over 100,000 images with no problems — until my EC2 instance ran out of disk space.

Then I broke out the image downloading into its own little python script and simply had the crawler store the URL to the product’s primary image, for later retrieval.

19. Store Placeholder Values

There are lots of different types of product pages on Amazon. Even within one category, there can be several different styles of HTML markup on individual product pages, and it might take you a while to discover them all.

If you’re not able to find a piece of information in the page with the extractors you built, store a placeholder value like “” in your database.

This allows you to periodically query for products with missing data, visit their product URLs in your browser and find the new patterns. Then you can pause your crawler, update the code and then start it back up again, recognizing the new pattern that you had initially missed.

How My Finished, Final Code Works

TL;DR: Here’s a link to my code on github. It has a readme for getting you setup and started on your own amazon.com crawler.

Once you get the code downloaded, the libraries installed and the connection information stored in the settings file, you’re ready to start running the crawler!

If you run it with the “start” command, it looks at the list of category URLs you’re interested in, and then goes through each of those to find all of the subcategory URLs that are listed on those page, since paginating through each category is limited (see lesson #15, above).

It puts all of those subcategory URLs into a redis queue, and then spins up a number of threads (based on settings.max_threads) to process the subcategory URLs. Each thread pops a subcategory URL off the queue, visits it, pulls in the information about the 10-12 products on the page, and then puts the “next page” URL back into the queue.

The process continues until the queue is empty or settings.max_requests has been reached.

Note that the crawler does not currently visit each individual product page since I didn’t need anything that wasn’t visible on the subcategory listing pages, but you could easily add another queue for those URLs and a new function for processing those pages.

Hope that helps you get a better sense of how you can conduct a large scrape of amazon.com or a similar ecommerce website.

If you’re interested in learning more about web scraping, I have an online course that covers the basics and teaches you how to get your own web scrapers running in 15 minutes.

Facebook Messenger Bot Tutorial: Step-by-Step Instructions for Building a Basic Facebook Chat Bot

Hartley Brody — Wed, 15 Jun 2016 21:23:26 +0000

First there were desktop software products, then everything moved to the web. Then there were email-based products and even SMS-based ones. The latest craze in software interfaces is messenger bots, and Facebook has the largest chat platform by a long shot.

In this tutorial, I’ll show you how to build your own Facebook Messenger Chat Bot in python. We’ll use Flask for some basic web request handling, and we’ll deploy the app to Heroku.

Let’s get started.

Step #1: Create a Working Webhook Endpoint

We’ll get into the meat of sending and receiving messages in a bit, but first you need to have a working endpoint that returns a 200 response code and echoes back some information in order to verify your bot with Facebook.

First, git clone the Github repository that I set up for this project:

git clone git@github.com:hartleybrody/fb-messenger-bot.git

Then, cd into it and install python dependencies:
mkvirtualenv test-bot pip install -r requirements.txt

For simplicity, we’ll deploy this to Heroku, but you could also deploy this Flask web app to any server you have access to.

Assuming you already have the Heroku CLI Toolbelt installed, you can run

heroku create

to get your new application setup.

We’re also using Heroku’s convention for the Procfile to tell it how to run the app, but you could set this up on your own server with something like nginx in front of one or more gunicorn processes.

To verify that Heroku can run things locally on your machine, start your local server with:

heroku local

Then, in your browser, visit http://localhost:5000/ and you should see “Hello world”.

Kill the local server with Ctrl+C. To deploy this endpoint to Heroku

git push heroku master

And to open it in your browser

heroku open

Now you’ve got a “working” webhook URL that you can use to setup your bot. Make sure you grab the full https://*.herokuapp.com URL from your browser since we’ll need it in a bit.

Step #2: Create a Facebook Page

If you don’t already have one, you need to create a Facebook Page. The Facebook Page is the “identity” of your bot, including the name and image that appears when someone chats with it inside Facebook Messenger.

If you’re just creating a dummy one for your chatbot, it doesn’t really matter what you name it or how you categorize it. You can skip through most of the setup steps.

In order to communicate with your bot, people will need to go through your Page, which we’ll look at in a bit.

Step #3: Create a Facebook App

Go to the Facebook Developer’s Quickstart Page and click “Skip and Create App ID” at the top right. Then create a new Facebook App for your bot and give your app a name, category and contact email.

You’ll see your new App ID at the top right on the next page. Scroll down and click “Get Started” next to Messenger.

Step #4: Setup Your Messaging App

Now you’re in the Messenger settings for your Facebook App. There are a few things in here you’ll need to fill out in order to get your chatbot wired up to the Heroku endpoint we setup earlier.

Generate a Page Access Token
Using the Page you created earlier (or an existing Page), click through the auth flow and you’ll receive a Page Access Token for your app.

Click on the Page Access Token to copy it to your clipboard. You’ll need to set it as an environment variable for your Heroku application. On the command line, in the same folder where you cloned the application, run:

heroku config:add PAGE_ACCESS_TOKEN=$your_page_token_here

This token will be used to authenticate your requests whenever you try to send a message or reply to someone.

Setup Webhook
When you go to setup your webhook, you’ll need a few bits of information:

Callback URL – The Heroku (or other) URL that we setup earlier.
Verification Token – A secret value that will be sent to your bot, in order to verify the request is coming from Facebook. Whatever value you set here, make sure you add it to your Heroku environment using heroku config:add VERIFY_TOKEN=$your_verification_token_here
Subscription Fields – This tells Facebook what messaging events you care about and want it to notify your webhook about. If you’re not sure, just start with “messages,” as you can change this later

After you’ve configured your webhook, you’ll need to subscribe to the specific page you want to receive message notifications for.

Once you’ve gotten your Page Access Token and set up your webhook, make sure you set both the PAGE_ACCESS_TOKEN and VERIFY_TOKEN config values in your Heroku application, and you should be good to go!

Step #5: Start Chatting with Your Bot

Go to the Facebook Page you created and click on “Message” button, next to the “Like” button near the top of the page. This should open a message pane with your Page.

Start sending your Page messages and the bot should reply!

To see what’s happening, check the logs of your application

heroku logs -t

You should see the POST data that Facebook is sending to your endpoint whenever a new message is sent to your Page’s bot.

Here’s an example JSON POST body that I got when I sent “does this work?” to my bot

        {
            "object":"page",
            "entry":[
                {
                    "messaging":[
                        {
                            "message":{
                                "text":"does this work?",
                                "seq":20,
                                "mid":"mid.1466015596912:7348aba4de4cfddf91"
                            },
                            "timestamp":1466015596919,
                            "sender":{
                                "id":"885721401551027"
                            },
                            "recipient":{
                                "id":"260317677677806"
                            }
                        }
                    ],
                    "time":1466015596947,
                    "id":"260317677677806"
                }
            ]
        }

By default, the bot should respond to everything with “got it, thanks!”

Step #6: Customize Your Bot’s Behavior

Here’s where we finally start to dive into the code.

There are really only two key parts to a messaging bot: receiving and sending messages

Receiving Messages
We handle incoming messages starting on line 24 inside app.py, in our `webhook()` view function.

First we load in the JSON POST data that’s sent to the webhook from Facebook whenever a new messaging event is triggered, usually when someone sends a message to our Page.

Then we loop over each entry — in my testing experience, there’s only ever been one entry sent to the webhook at a time.

Then we loop over each of the messaging events. Here, there may be several messaging events.

In step #4, we told Facebook what message types we want our webhook to be notified about. If you followed my advice, then our endpoint will only receive “message” events, but we could also receive delivery confirmations, optins and postbacks (more on those later). I left some code in place for detecting those other types of messaging events, but I don’t actually handle them.

The messaging event that will be most useful to most applications will be the “message” event, meaning someone has sent your Page a new message. I wrote some basic code to handle that event, parsing out the sender’s ID, and simply responding back to them.

Sending Messages
In order to send a simple text message, you only need two things:

the recipient’s Facebook ID
the text of the message you want to send

I’ve created a simple send_message() function that automatically hits the Facebook API and sends those pieces of information.

Remember that the request is authenticated using the PAGE_ACCESS_TOKEN environment variable that we got back in step #4.

There are many more complex message types you can send, including messages with images and buttons. More information on those message types here.

Important to note is the ability to send a “postback” button in a message. These are essentially buttons that, when tapped by a user, send a postback messaging event to your webhook.

This essentially allows users to “press buttons” in your app, all while inside Facebook Messenger. You could use this for placing an order, confirming a request or lots of other things.

Whenever a user taps a postback button, your webhook is notified and can perform any sort of subsequent follow-up action necessary.

Step #7: Submit Your App to be Reviewed

While you’re testing your bot, only you and other Page admins can message with the bot directly. You have to go through a review process before your bot is open to the world, ready to chat with anyone.

Facebook seems to be very thorough in their review process, and with good reason. The code for a messaging bot runs on your own servers and could change at any time, without Facebook knowing.

They seem to be trying hard to make sure you’re a good actor, and not submitting a simple dummy app to get approved, only to change it to some spam bot down the road.

Obviously, they could still revoke your API access tokens if you did that, but they’d rather not have any abuse on the Messenger platform at all.

Go back to your Messenger App Settings page that we used in Step #4. Scroll down to “App Review for Messenger” and click “Request Permissions.”

Request the permissions that you need, and then you’ll be taken to the “Review Status” page. This page requires a ton of information to ensure that developers aren’t going to abuse the platform.

It requires you to

check several boxes verifying that you’ve read their policies and guidelines
promise you won’t engage in unsolicited, outbound messaging
describe how you’re going to interact with users through your bot
provide a test user that the review team can use to interact with your bot
upload a screencast of you interacting with your bot via Messenger
have a privacy policy
verify that you’re explaining the bot and setting expectations with users

On this page, you can also ask to be granted extra information about users, like their email or profile information.

Then it all goes to the Facebook review team to sign off and give you full access to the Messenger platform. More information about the approval process here.

Even if you don’t intend to go all the way through the review process, hopefully you’ve learned a thing or two about how to build a simple chat bot for Facebook Messenger.

Check out my code here.

7 Reasons I Won’t Sign Your NDA Before a Coffee Meeting

Hartley Brody — Wed, 17 Feb 2016 20:17:38 +0000

In my work as a full stack web developer, I often meet clients who request that I sign a Non-Disclosure Agreement (NDA) at various stages of the project.

By signing an NDA, the client is basically asking me to agree that I won’t take their idea and work on it myself, or share it with anyone else who will.

For most entrepreneurs, that sounds like a smart idea, right?

Here’s why I won’t sign them.

I Hear a Lot of Ideas

For every client that I end up starting a project with, I often talk to a dozen different businesses at various stages of needing help. As such, I hear lots of product ideas in any given week.

Have you ever started a sentence with “Oh, I can’t remember where I read this, but…” It’s hard to keep track of who told me what when you consider the sheer number of business ideas and products that I hear.

Similarly, it’s hard for me to keep track of which ideas are protected by an NDA. It’s not worth my time to make a catalogue of who told me exactly what idea and what agreement I had with them. It’s much easier to simply never sign NDAs.

No Value for Me

If you ask me to sign an NDA before we even have an introductory meeting to talk about the project, then by definition I know very little about you or your idea.

In the legal system, whenever two parties sign a contract, there’s a concept of “consideration” which basically means that all parties are receiving something of value by signing.

I don’t even know what we’re going to be talking about yet! How am I supposed to know what the value is or what I’m getting by signing the NDA?

Creates Liability for Me

Without knowing anything about your business, I can’t know ahead of time what ideas I’m agreeing to not compete with you on.

What if you tell me the idea and it’s something I already worked on or have thought of myself? Now you have a claim to any ideas I might already have about that topic.

Creates Liability for my Other Clients

Similarly, if I’m working with another client — or end up working with someone down the line — and that person comes up with their own version of your idea on their own, I could open them to liability.

If you find out that I worked on a project that sounds similar to your idea, you’d sue me and potentially my other client. I wouldn’t want to expose my other clients to that risk.

Sign of a Worthless Idea

Most experienced entrepreneurs know that ideas are a dime a dozen. I have several notebooks full of “million dollar startup” ideas lying around my apartment.

Execution on an idea is what matters. That’s what creates a valuable business. If the idea itself is so easy to execute on that it must be kept secret, then it’s probably not very strong.

Sign of an Overly Litigious Client

I’m a small business too, and as a small business owner, I want to play it conservative and keep myself out of any potential legal trouble.

As a contractor who relies on my reputation, I work on relationships founded in trust and us both keeping up our ends of the bargain on a good-faith gentlemen’s agreement.

Why would I intentionally meet with a potential client who comes in guns blazing with legal documents for me to sign? That’s a big red flag that you’re not very trusting and likely not trustworthy.

They’re not really worth anything

Now you may be reading all of this thinking — “woah woah woah, calm down! NDAs aren’t that big of a deal, I’m not going to sue you!”

You realize that’s the point of an NDA right? To make it easier for you to sue me?

If your legal advisor is “making” you have people sign it, just stop. NDAs are basically a super easy way for lawyers to cash in on naive first-time founders.

If you really think an NDA will stop someone from “stealing your idea” do some research into how valuable they are. Spoiler alert: they’re very rarely enforced, and the burden of proof is very high. You’re not likely to win, even in questionable cases.

I hope that gives you a better idea of my perspective on the issue. If you’re still okay with chatting, I’m happy to proceed without an NDA in the early stages.

If we actually start specing out a project together and I’ll be getting into the guts of the execution, then I’d be happy to sign one.

But if you really insist on me signing one this early, then I’m afraid I must say no.

The 3 Mistakes Every Junior Developer Makes (And How to Stop Making Them)

Hartley Brody — Wed, 14 Oct 2015 19:40:06 +0000

As more and more people are learning to code, there are an increasing number of new developers in the work force. Code bootcamps are springing up everywhere that promise to land candidates a job with only a few months of experience.

Having worked with a number of junior developers (and having been one myself, at one point ) I’ve noticed a lot of the same mistakes crop up.

In an effort to help junior developers level up to eventually become engineers, I decided to enumerate some of the mistakes I see most often, as well as my recommended solutions and takeaways.

Saying You’re “Done” with a Task Prematurely

This is probably the most common so I mention it first. New developers have tendency to slap the last semicolon onto whatever code they’re writing, hastily commit it and proudly declare that they’re “done” to the rest of the team.

As soon as they’ve finished their first pass at an implementation, it’s “ready” for code review and testing. Except usually, it’s not.

Features of the product are very obviously broken, and it only takes someone a few seconds to notice. Or their code only implements half of the features it was supposed to, or doesn’t work for common use cases.

Since the junior developer has been heads down in the weeds wrestling with language syntax or the API of some new library, they haven’t yet lifted their head to survey the scene of where they came from and where they still need to go.

As a result, code reviews can be brutal, product and business stake holders are confused about why things don’t work as they should, and the team generally loses faith in the junior developer’s autonomy. “Why would she say this is done when it clearly needs more work?”

Solution
The solution I always advocate for is the “step away from your desk” maneuver. If you’ve pushed up your last commit and are ready to check off the asana ticket or move the trello card — PAUSE.

Go get some water, walk around for 5 minutes and then come back to your desk. Reread the spec or ticket that you’re working from, and go back through the email thread where the last-minute adjustments were made. Make sure that everything that’s supposed to be in there is.

Then, try to actually use the feature you just built. Do a bit of your own quality assurance, even if that’s supposed to be “someone else’s job”. Click around the product and make sure nothing is obviously wrong.

Takeaway
As an engineer, your job isn’t just to write code, it’s to build a product or add features — and do no harm in the process.

Putting in the extra few minutes to review your work before you show it off will save you lots of embarrassment and help build the team’s trust that you know what you’re doing.

Implementing the First Solution You Think Of

Most of a junior developer’s first few projects will be largely spec’d out for them. They’ll be handed not only “what” to build, but also a list of steps for “how” to build it, at a high level.

But eventually, there will come a time when the team lead decides to trust a new developer to come up with their own technical implementation plan. And while this can be a great chance for someone who is newer to “earn their stripes,” it’s often a place where many falter.

The developer will come up with a plan forward, and stop there. “I’ve got it!”

The reality in software is that there are often dozens of different ways to implement something. Some solutions may reuse existing parts, while others may require new dependencies, abstractions or technologies.

To feel confident about a path forward, a good engineer will think through and then weigh the pros and cons of multiple potential solutions before establishing a path forward.

“We could use this library, but that adds a dependency. Or we could implement it ourselves, but it might take a while and be harder to test.”

When the junior developer presents their plan to the team and the team starts asking questions about it, the developer is often caught off guard, having not really considered where their plan might have some holes in it, or why it’s better than competing ideas.

Solution
If you’re putting together a technical plan, try to come up with two or three alternative solutions, and understand when and how some are better or worse than each other.

Think up a list of a few pros and cons for each solution, considering both technical merits as well as business ones (“this would save us time, be cheaper to implement, etc”).

Takeaway
The opportunity to come up with your first technical spec or implementation plan can be a great moment in your career.

To really hit it out of the park, do your research and be ready to defend your recommendations — don’t just present the first ones you come up with.

Forgetting that Code is Read More than it is Written

Most code that gets shipped will ideally live on production for more than just a few weeks. Which is plenty of time for everyone who worked on it to completely forget everything about it.

While it might be written over the course of a few hours or days, it’ll likely be read dozens of times over months and years by other engineers.

Every line of code that’s written should be clear, concise and self-documenting. Abstractions should make sense and be reusable in other contexts. Code should follow a style guide for consistency.

By definition, someone who is new to the profession won’t have experience trying to find bugs or add features to code that they wrote a long time ago.

Hopefully they’ll have some experience reading and working with other engineers’ old code, but they might not appreciate its elegance or clarity (or lack thereof).

One of the most important things you can learn as a junior developer is how to write clear, readable code. While your language’s interpreter might not care if you use confusing, nonsensical variable or function names, your colleagues certainly will.

While doing a code review a few years ago, I saw a javascript function called is_ready() which — from its name alone — would seem to return a boolean indicating whether something was “ready”.

But instead, it returned a jQuery node which resulted in code like var modal = is_ready();. And while that’s valid javascript that any browser would happily interpret, as a human I could make no sense of what was happening here.

Solution
After you’ve gotten a new feature working, go back and read through all the code you just wrote. Maybe you started out calling a function one way but then you added more parameters and changed the implementation slightly and now it does something different. Rename it (and update all the places you call it) so that it reads closer to english and makes sense.

Takeaway
Deciding what is “more readable” can sometimes be subjective. But it’s good practice to work on it in your own code and to try to recognize it in other people’s code as well.

Final Thought

If you’re a junior developer who feels like you’ve made some of these mistakes, don’t be discouraged! I decided to write this because I’ve seen so many new engineers who needed help learning about these same issues.

Being a good software engineer is an on-going learning progress — not just learning how to wrangle code, but also learning how to be a productive, effective and well-liked member of your team.

Scaling Your Web App 101: Lessons in Architecture Under Load

Hartley Brody — Wed, 09 Sep 2015 00:47:41 +0000

It’s the classic champagne problem that most successful web apps will deal with — there are so many users on your site that things are starting to get bogged down.

Pages load slowly, network connections start timing out and your servers are starting to creak under heavy load. Congratulations — your web app has hit scale!

But now what? You need to keep everything online and want the user’s experience to be fast — speed is a feature after all.

Scaling Comes at a Price

But before we go any further, an important caveat — you shouldn’t attempt to “scale” your web app before you’ve actually run into real scaling problems.

While it may be fun to read about Facebook’s architecture on their engineering blog, it can be disastrous to think that their solutions apply to your fledgling project.

A lot of the solutions to common scaling bottleneck introduce complexity, abstraction and indirection which makes systems more difficult to reason about. This can create all sorts of problems:

Adding new features takes longer
Code can be harder to test
Finding and fixing bugs is more frustrating
Getting local and production environments to match is more difficult

You should only be willing to accept these tradeoffs if your app is actually at the limits of what it can handle. Don’t introduce complexity until it’s warranted.

As the famous quote goes:

Premature optimization is the root of all evil.
— Donald Knuth

Find the Actual Bottleneck using Metrics

The first step to alleviating any problem — in software or otherwise — is to clearly and accurately define what the problem actually is.

A problem well stated is a problem half-solved.
— Charles Kettering

For a web app that’s under too much load, that means finding out what resource your application is running out of on the server.

At a high level, the answer is usually going to be one of four things:

Memory
CPU
Network I/O
Disk I/O

Until you figure out what resource your application is bounded by, no one can help you scale your app and any solutions you come up with will be complete guesses.

Figuring out what you’re bounded by means checking your resource monitoring — or adding some if you’ve never done it before.

What gets measured, gets managed
— Peter Drucker

If you’re managing your own servers, installing Munin is a great first step. If you’re running on Amazon’s EC2, AWS offers some decent instance monitoring out of the box. If you’re on Heroku, New Relic seems to be the best approach.

Use the graphs to look for spikes or flat tops. These usually imply that some resource was overwhelmed or completely at capacity and couldn’t handle any new operations.

If you don’t see any resources that seem to be at capacity, but your app is just slow in general, sprinkle some logging throughout heavily-used operations and check the logs to see if there’s some resource that’s taking a long time to load over the network.

It could be that another server is introducing delays — potentially your database server or a third-party API.

If you host your database on a different machine than your web servers (which you should) it’s important to check your resource monitoring for that machine as well as for your web servers.

The database is usually the first place scaling issues start to show up.

Scaling a Web App from 10,000 Feet

Now that you’ve got a much better sense of what the problem is, you should start to tackle it by trying the simplest solution that directly addresses the issues — remember, we’re always trying to avoid adding unnecessary complexity.

At a high level, the goal of any scaling solutions should be to make your web stack do less work.

If you’ve already figured out the answer to a query, reuse it. Or if you can avoid computing it or looking up all together, even better.

In a tangible sense, this usually means one of the following:

Store results of common operations so you’re not repeating work
Reuse data you’ve already looked up, even if it’s a bit stale
Avoid doing complex operations in the request-response cycle
Don’t make requests from the client for things it already has

These all basically boil down to some form of caching.

Memory is not only inexpensive to add to a server, it’s usually many orders of magnitude faster for accessing data when compared to disk or the network.

Hosting Topology

Whether your application is hosted in the cloud or on hardware, some part of your stack will inevitably fail. You should host and arrange your web servers to take this into account.

Your domain should point to some sort of load balancer, which should then route requests between two or more web servers.

Not only does this setup make it easy to survive failures, it also makes handling increased load easier as well.

With a load balancer in front of two web servers, you can horizontally scale your application by bring up new web servers and putting them behind the load balancer. Now the requests are spread across more machines, meaning each one is doing less work overall.

This allows you to grow your application gracefully over time, as well as handle temporary surges of traffic.

I should also add that setting up a load balancer and two web servers is a one-time setup that doesn’t add much on-going complexity, so it’s something you should consider doing up-front, even before you’ve run into scaling problems.

Cache Database Queries

This is one of the simplest improvements you can make. There’s usually a few common queries that make up the majority of load on your database.

Most databases support query logging, and there are many tools that will ingest those logs and run some analysis to tell you what queries are run most frequently, and what queries tend to take the longest to complete.

Simply cache the responses to frequent or slow queries so they live in memory on the web server and don’t require a round-trip over the network or any extra load on the database.

Obviously, data that’s cached can grow “stale” or out-of-date quickly if the underlying information in the database is updated frequently. Your business or product requirements will dictate what can or can’t be cached.

Database Indexes

Database indexes ensure that needle-in-a-haystack type lookups are O(1) instead of O(n).

In layman’s terms, this means the database can find the right row immediately, rather than having to compare the queried conditions against every single row in the table.

If your table has tens of thousands of rows, this could shave noticeable amount of time off of any queries that use that column.

As a very simple example, if your application has profile pages that look up a user by their handle or username, an un-indexed query would examine every single row in the users table, looking for the ones where the “handle” column matched the handle in the URL.

By simply adding an index to that table for the “handle” column, the database could pull out that row immediately without requiring a full table scan.

Session Storage

A lot of applications handle sessions by storing a session ID in a cookie, and then storing the actual key/value data for each and every session in a database table.

If you find your database is getting slammed and your application does a lot of reading and writing to session data, it might be smart to rethink how and where you store your session data.

One option is to move your session storage to a faster, in-memory caching tool like redis or memcached.

Since these use volatile memory rather than persistent disk storage (which most databases use) they’re usually much faster to access — but the tradeoff is that you run the risk of losing all of your session data if the caching system needs to reboot or go offline.

Another option is to move the session information into the cookie itself. This obviously leaves it open to being tampered with by the user, so it shouldn’t be used if you’re storing anything private or sensitive in the session.

By moving session data out of the database, you’ll likely eliminate several database queries per page load, which can help your database’s performance tremendously.

Run Computations Offline

If you have some long-running queries or complex business logic that takes several seconds to run, you probably shouldn’t be running it in the request-response cycle during a page load.

Instead, make it “offline” and have a pool of workers that can chug away at it and put the results in a database or in-memory cache.

Then when the page loads, your web server can simply and quickly pull the precomputed data out of the cache and show it to the user.

A drawback here is that the data you’re showing the user is no longer “real time,” but having data that’s a few minutes old is often good enough for many use-cases.

If the data really takes a long time to generate, see if it can be parallelized so that multiple workers can work on different parts of the computation at the same time.

You’ll probably want to setup another cluster of machine for the work queue and the workers, since those will likely have different scaling properties than your web servers.

To take this architectural style to its logical conclusion, you can generate the HTML for your entire web app offline and simply serve it to users as static files.

This is the inspiration behind static site generators that are used to power a growing number of blogs, and it’s what the New York Times did to serve election night results.

HTML Fragment Caching

If you’re rendering HTML templates on the server-side, you want to avoid having your template engine wasting CPU cycles on every request generating the same HTML over and over again for content that doesn’t change often.

If there are certain sections of your site’s markup that change very infrequently — say the navigation, footer or sidebar — then that HTML should be cached somewhere and reused between requests.

Pay special attention to high-traffic pages. Sometimes you’ll be able to cache most of the page except for a few more dynamic or “real time” sections.

Putting Work Into Queues

We talked about using queues and workers for caching output that takes a long time to generate. You can also use workers for processing large amount of input asynchronously.

This has the effect of taking large, slow chunks of work and breaking them out from the main request-response cycle and taking it completely off your web servers.

Say you have a way for someone to import a CSV of their contacts and several people upload 50MB files. Instead of sending all of that data to a web server and having it take up memory and block the CPU while it’s being processed — put it on a static file host like s3 and have a worker that periodically checks for new file uploads and processes them offline.

You do have to be careful when you start putting lots of business logic into workers. You have to make sure you’re keeping track of what still needs to be processed, what is currently being processed, and what failed and needs to be processed again.

You also need to make sure you have enough workers running, otherwise the work queue will grow longer and longer, leading to silent delays that are easy to miss.

Client Side Improvements

Of course, another great way to decrease the load on your web servers is to decrease the number of requests they have to deal with.

Even with the same number of users in your app, there are a number of client-side improvements that can lower the number of requests your web stack has to deal with.

HTTP Caching
Just like you want to cache database queries to avoid regenerating answers you already know, you should avoid having the browser ask for content that it has already downloaded.

You should use HTTP caching headers for all of your static files — CSS, javascript and images.

Google and Mobify provide great overviews on how to use the headers, and your web framework will likely have some helpers to make it even easier.

Content Delivery Network
Ideally, your web servers wouldn’t serve any static content at all. You don’t need the overhead of loading your entire web framework or language runtime in order to serve a static file off of disk.

You should host your static content on a static file host that’s purpose built for sending files over the network. You can setup a simple one with nginx or use a dedicated service like Amazon’s S3.

When you’re splitting out your static content, it’s good to give the static file host a different CNAME or subdomain.

Once you’ve gotten that setup, it’s usually pretty straightforward to add a Content Delivery Network in front of your static file host.

This distributes your static content even faster — often without the requests ever reaching your web stack.

The CDN is essentially a geographically-distributed file cache that serves up copies of your static files that are often geographically closer to the end-user than your server.

The CDN will often have options to minify and gzip your static content to make the files even smaller and faster to send to the client.

When All Else Fails

If you’re still having issues with high load and you’ve cached as much as you can and your server budget is maxed out, there are still some less-than-ideal options you have to handle excess load.

Back Pressure
Back pressure is just a way of telling people they have to wait because the system is slow. If you’ve ever gone tried going to a cafe, saw a line out the door and decided to go elsewhere — that’s a great example of back pressure.

Even without much work on your end, back pressure can be implicit. The site will be slow for users, which will discourage them from clicking around as much.

They also might see increased error rates when they try to load things — think Twitter’s Fail Whale.

You can also make back pressure explicit by adding messaging to the UI telling people that parts of your applications are temporarily disabled due to high demand.

Shed Load
The other option is to shed load. You’re acknowledging that you’re not able to respond to everyone’s requests, so you’re not even going to try.

This is the nuclear option. Set aggressive timeout in your web server and let requests hang and return blank pages. Some people’s data will get lost in cyberspace.

Try to add some temporary messaging to let people know that you’ll be back soon, but be prepared for a PR fallout.

—

Did you know I do consulting? If you need help scaling your web application, drop me a line!

Focus on the Product, Not the Code: How I Build Software for Clients

Hartley Brody — Wed, 22 Jul 2015 02:29:33 +0000

While there are tons of resources designed to help people learn to code, there aren’t as many resources for helping people learn to build software products, at a higher level.

“Writing code” is largely a vocational skill, just like swinging a hammer is — but presumably you’re using that skill to actually build something.

In my experience, the reasons software projects fail — take too long, go over budget, are too complex — isn’t necessarily because of bad coding practices. It’s because there was too much focus on writing code, and not enough on building a product.

It’s deceptively simple (and all too common) for a non-technical person to come up with a few sentences describing an idea, shoot it to someone with coding chops and say “code this for me.”

It’s essentially the modern-day equivalent of someone sketching a building on a napkin, handing it to a carpenter and telling them to start swinging their hammer.

There are often huge structural decisions — as well as a million tiny implementation details — that need to be fleshed out before you even start writing a line of code.

If the first thing you do with a product idea is start coding, your project is almost certainly doomed.

Why You Need a Detailed Specification

If you don’t have a detailed plan for how you’re going to build your product, then your engineers will inevitably make all sorts of seemingly small decisions that might not be in line with the original product vision, or what the business stakeholders need.

Those decisions will then have to be rolled back once the business stakeholder discover them, wasting a lot of time and energy, and also undermining the entire codebase.

Engineers: What color should we make the carpets? Oh, they didn’t specify? Okay, well uh, red then?
Business (after walking in and seeing the red carpets): Red is our main competitor’s color! How could you have chose this? Tear them up and recarpet the office with blue!

If you don’t have enough detail, then you’ll end up spending a ton of time writing code that ends up getting thrown out or refactored very quickly. This is the main source of delays in projects that have taken too long or gone over-budget.

If you had specified the carpet color from the beginning, then you wouldn’t have to lay it down twice, and spend the time and effort ripping it up in between.

But in a software project, sometimes these last-minute refactors are even more problematic than simply rewriting sections of code. Instead, they’re more akin to trying to change the foundation of a building from cement to wood pylons after several stories have already been built.

It’s not just that you have to change one or two pieces of the code — even small changes can have big ripple effects across the codebase. There are tons of abstractions and relationships in even seemingly simple pieces of software. Those all get created based on assumptions about how the finished product should look.

A real-world example of a seemingly minute detail that could have a huge implications on a project is deciding if a user is allowed to have multiple email addresses or just one.

Adding support for multiple email addresses — after the assumption was that there would only be one — isn’t simply a matter of adding a new field to the user table, it can have implications on the login system, notification and emailing systems, billing and admin, profiles, sharing and lots of other interconnected systems.

It’s extremely important to get as many things right as you can in your first pass of writing code.

The best way to avoid ambiguity and the problems and delays it creates is to come up with a plan. This should be done with all of the stakeholders — business and technical — working together, to explicitly lay out how the most important pieces of the product will look and function.

How this Process Developed

Before I was a freelancer, I was employee #1 at an ad tech startup. We weren’t a traditional single-product company. Instead, we built dozens of products, some of which were user facing, while many were internal ad- or analytics-related products just for us or our partners.

We had lots of product ideas that we wanted to test, but we couldn’t afford to spend months bringing them to market. We had to ship quickly, experiment and iterate, so our product development process had to be very efficient.

Genetics researchers study fruit flies because their lifespans are so short, meaning researchers can observe many generations very quickly.

We built (and threw away) so many products in such short time periods, it proved to be a similarly effective laboratory for testing product development strategies.

We experimented with several different iterations of product development processes and were ruthlessly honest about what was working and what wasn’t. I learned a lot about building products in my time there.

I’ve since taken what we learned and the general processes we developed and adapted them for various clients across a dozen different projects.

The Product Spec

The first step is to produce the Product Spec. This is a non-technical document and should usually be written by the client or product owner, with assistance where needed.

I have a Google Doc template which is essentially a structured list of questions designed to to help the client brain dump their ideas about the product.

The exact questions change from product to product, but there’s usually sections about the following:

Users: Who are they, what are their goals, why are they using the product?
Features: What features are required, what are nice to have, and what are backlogged for later?
Components and Screens: What are the pages or URLs the product will have, and are there non-user facing parts?
User Experience (UX): Taking what we know about users and what we want to build, what are their main flows and experiences within the product? How would a user perform the core actions in the product?
Open Questions: What are the things we don’t yet know or made assumptions about that should be tested to make sure everything is viable and will work like we expect it to? We’ll list open questions and then try to answer them within the document, to have a record of potential problems and the solutions we came up with.

After the Product Spec is complete, the client will have a thorough document that describes the Minimum Viable Product (MVP) product as well as a backlog of potential features to be added down the line.

This can be used to guide a product road map, to help with a pitch to investors, to find an outside development agency, or just generally to help the client wrap their mind around what they’re building.

The Technical Spec

The next step is the Technical Spec. This is a technical document that lays out the high-level structural and architectural decisions that need to be made before you can start writing software.

While the Product Spec describes what we’re building, the Technical Spec describes how we’re building it.

It’s important that the Product Spec is finished and agreed upon before beginning work on the Tech Spec, so that time isn’t needlessly spent designing a system on top of changing requirements.

While the Product Spec might be similar to an architectural rendering, showing the glassy facade of a building from 1000 feet away, the Technical Spec is more like the blueprint, working out some of the important structural details at a lower level.

For the Technical Spec, I’ll often do the bulk of the writing. If the client has in-house technical resources, then I’ll work with them to ensure their approval.

Common sections include:

Core Technology: Programming languages, frameworks, open sourced libraries, public data sets, important third-party APIs. What are the core pieces the product will be built with?
Data Models: What are the main objects that this product deals with? What are their properties and methods? What are the relationships between them?
User Flows: As users follow the key flows from the Product Spec, what will happen at a tech level? How and when will things be processed? When and where will data be saved?
Technical Diagram: How will systems communicate? How will 3rd-party services be integrated?
Infrastructure & Ops: Hosting? Deploying? Logging? Testing? Monitoring? Analytics?
Open Questions: Are there technical feasibility issues? Do we need to hack together quick prototypes to make sure something is possible? Just like with the Product Spec, we’ll enumerate these and then try to go through and answer them before moving on to the next steps.

Once the Technical Spec is complete, the client will have a very detailed picture of the complexity and scope of their product.

If there were questions of technical feasibility, those will hopefully be answered. If they want a quote for how much time it will take to build, or at what cost, this document will make those answers very clear.

Next Steps

Once we’ve gotten to this point — both a Product and Technical Spec, signed off by the necessary stakeholders — now we’ve got a much clearer sense of what needs to be built and our plan of attack.

We’ve taken the back of the napkin sketch of the office building and turned it into an architectural rendering as well as a clear, detailed blueprint for what needs to be built and how it’ll all fit together.

At this point, we’ve laid everything on the table and can decide how we’d like to proceed.

I’ve had clients who looked at the Product Spec and realized they could actually just build their product using an off-the-shelf survey tool for $10/month instead of paying thousands for something custom, like they had planned.

Sometimes clients will take the detailed Tech Spec and look for cheap overseas coders to build it on eLance and oDesk. We actually did this for some projects at my last startup with great success.

Sometimes the process of writing these documents is so informative that the client decides to take it in a completely new direction, and that’s okay. There’s no commitment that we have to work together moving forward — although I’d usually love to!

If they decide to move forward, we can come up with a quote for the full product or break it down into smaller deliverables.

There should be very few surprises at this point, so it’s much more reasonable to set deadlines and time estimates than it is when you just have a few aspirational sentences about a product.

Just Enough Process

Now some of you might be thinking, “Top down, waterfall-style approaches to building software are bad! The market changes, issues crop up, you can’t possible enumerate everything up front! You need to be agile.”

Surfacing technical issues to clients — as well as frequent, clear communication — is definitely part of the process. But setting out without a plan other than “being agile” is a great way to waste a lot of time chasing dead ends and writing code that you have to throw away or constantly refactor.

I also only try to employ these steps for building Minimum Viable Products (MVP), or adding features to existing products, so we’re just fleshing out the minimum feature set to start testing and learning about user behavior. The documents never describe more than a few weeks of actual development work.

The Product Spec includes a backlog section for features the client would like to add eventually, but this can change once we start leaning more about how customers are using the product.

Don’t be Dogmatic

As with many things in life, it’s important not to be too dogmatic about always doing things a certain way. For really small features, these steps might be overkill, while some of the questions might not make sense for certain kinds of products.

I generally think of these steps as guidelines, to help guide the product development conversation, and ensure all stakeholders are on the same page about where we are in the process and what problems we’re focused on solving.

If you think you could use my help on your product, drop me a line!

How I Learned to Code in Only 6 Years: And You Can Too!

Hartley Brody — Wed, 29 Apr 2015 18:59:03 +0000

A few weeks ago, a friend texted me asking for advice. She was interested in learning how to code and wanted to know how I had done it.

While I did take a handful of Computer Science classes in college, I consider most of my relevant, day-to-day software development skills to be things I picked up through self-guided learning.

My initial advice for her was going to be pretty banal — sign up for Codecademy or Treehouse or one of the many “learn to code in 12 weeks” bootcamps.

But before I could send her that text, I realized that my own path to becoming a full-time, freelance software developer didn’t really look anything like that.

While there are a growing number of programs, classes and websites that purport to teach you coding skills in a short amount of time — and I’ve played with a few of them myself — I don’t really see them as an effective path to learning the kinds of skills one needs to be a competent software developer.

And so, to answer her question, I decided to take some time to look back on what I actually did, and what got me to the point I’m at today, earning a living writing code for people.

Text files ending in .html

It all really got started for me out of sheer boredom over winter break in 2008, during my freshman year of college. Having refreshed my Facebook News Feed for the millionth time that day and not found anything interesting, I decided to click the magic “view source” option in the browser and see if I could understand any of the HTML. Of course, it was all completely indecipherable to me, but I did notice the “.php” extension in the facebook.com/home.php URL.

That piqued my interest and after a bit of googling I discovered that PHP was some kind of language that would produce HTML for a browser to read. It all sounded really complicated so I figured I’d just start with the HTML part.

I opened up Notepad on my Windows laptop and saved a file to the desktop, making sure to change the extension from “homepage.txt” to “homepage.html”. From there, I read through the w3schools tutorials on HTML and built a page using table elements for layouts.

I’ll always remember the first time I opened a new tab in my browser and opened the HTML file on my desktop and saw a freaking web page that I had just freaking made. I mean look at it! It looks like a web page, and I made it!

It was probably the first big “AHAHAHA!” moment that got me hooked on building stuff with code. I felt like a superhuman. Maybe I could build a site like Facebook! But not quite yet…

I bought a domain, picked a $5 webhost and figured out how to upload my shiny new HTML file so that the world could see it. I proudly emailed my site to some of the guys that worked on the campus life blog.

After a few days, I hadn’t heard back from them, but I bumped into one of them on the quad. “Yah… your site is… umm… you should really learn CSS,” he told me, shuffling his feet and avoiding eye-contact. “Using tables for your layout is a bad idea, and using inline styles is… you should learn CSS.”

I felt so ashamed. Rather than being a beautiful work of art, someone who knew what they were talking about told me, ever-so-politely, that my code was a steaming pile of turds. Maybe this coding thing was going to be a lot harder than I thought.

I went to the campus library and found a book on CSS that was probably a thousand pages long. “Nope, not reading that,” I thought. I’ll stick with using tables for layouts, thank you very much.

Getting started with PHP

Fast forward a few months and I had added several HTML files to my website. But it felt lacking, it didn’t “do” anything, just showed the same text to every visitor, over and over again.

I started looking into PHP again. At this point, I had learned a thing or two about if-statements and for-loops from my Computer Science 101 class, so it was a bit less intimidating.

My first project was adding a “guestbook” to my website, that would email me whenever someone left a note. This was super exciting because now my website actually did something! People could interact with it and change it so it wasn’t always the same!

You might think that I had to setup a database and “learn” SQL in order to be able to insert people’s comments into said database, and then read those comments back out when the page was loaded.

What actually happened was that I spent many hours googling around for “PHP guestbook script”, copy/pasting the parts of other people’s code that seemed to be working until it did what I wanted. I was a hacker, not a developer.

I got a few “nice job” comments from my friends over the next few days, and each email notification I received made me so proud that I had built something that worked on the internet.

But it only took a few more days before the spambots found my lonely little website and proceeded to fill my guestbook with ads for enlargement supplements. My wonderful little guestbook was being vandalized by strangers on the internet whom I didn’t even know!

“The internet must be a scary place.”

Building a Blog

By the fall of 2009, blogging was starting to be something that average-joe people did. It didn’t occur to me that there’d already be software built to make this process easy, so I set out to build my own little blogging engine so that I could participate.

I created a folder for all the blog entries, and would add HTML files whenever I wanted to add a new article. I gave them datestamp file names like “2009-10-04.html” and wrote a simple loop in PHP to iterate over files in the directory, concatenate them together, add a header and footer and build my HTML homepage.

There was no way to view a single article by itself, there was only one page with every article strung together. Wanting to give my “audience” (aka my parents and college roommate) more control over their browsing experience, I began learning AJAX and added some buttons that let a visitor choose which month’s articles they wanted to see. Then the for-loop that strung together articles would filter out ones that didn’t have the right numbers in the filename. Brilliant! It felt so simple and elegant and look at me making a freaking blog as if I know what I’m doing. People might actually think I know how to do this stuff.

At some point, as a prank, one of my friends on the swim team started going into the school library and setting the browsers’ homepage to be my new blog. I didn’t have any analytics setup so I had no idea I was getting all of this traffic until an acquaintance from one of my classes mentioned he had seen my blog on the library’s computer.

A bit ashamed of the new-found attention, and worried that people might think I was the one who was setting my site as the homepage, I hunted down the perpetrator and asked him to stop. His response was to laugh maniacally and basically say, “make me.”

Realizing that diplomacy had failed, I knew that I needed to resort to bigger guns — code. So I hatched a plan: if a visitor was coming to my website from an IP address on campus, I’d redirect them to an embarrassing picture I dug up on Facebook of said friend drunkenly at a party, with a note about how he had been messing with the school computers, and offering a link to Google, a site the unsuspecting visitor might actually wanted to go to.

I visited the “what is my IP address” websites from a few different spots on campus to get a sense of the IP address ranges an on-campus visitor might have. Then it took a few hours of googling around for how to find a visitor’s IP addresses in the request and how read and write HTTP headers and how redirect users to a different page, but eventually, my plan was complete. I called a few friends at other schools and had them visit my site to make sure they weren’t seeing the image, and then I tried it from the library computers to make sure they were.

I felt like a mad scientist. Look at how much power I wielded with my simple website! Someone was trying to use it to embarrass me, but I was able to come up with a clever solution and a few lines of code to turn it around on him!

Learning More About Media & Content on the Web

In the spring of 2010, I started a college music blog with a friend and former roommate. This time, instead of building my own blogging engine, I did a bit more research and discovered blogger and wordpress and found them to be infinitely more powerful and flexible than anything I had written.

Since I didn’t know about version control, I just threw all of my old code away. Won’t need this anymore!

By the fall of 2010, the music blog was having fairly moderate success — thousands of daily visitors and climbing. I remember the first time I walked by someone I didn’t know who was on their computer, browsing the site.

Hell yah, baby! The internet has so much potential for building cool stuff and reaching new people!

That fall, I came up with the idea of creating a site that would aggregate content from the top music blogs. Basically an RSS reader for people who didn’t know what that was or want to use one.

I did a bit of googling about how to read RSS feeds and found a lot of people recommended using a particular library. I didn’t know what a “library” was, but I looked at some example code that used it and it seemed pretty cool.

You mean I don’t have to fetch the URLs and parse the different XML formats myself? I can just tell this library the sites that I care about and then simply read out their content? And I can use this code for free? Wow, open source code is great! Thanks, whoever wrote this.

I put the site together in about a week and coordinated a big, splashy launch with the sites it featured. I pulled in 14,000 visitors that first week.

Learning a Second Language, Properly This Time

By the summer of 2011, I was an intern on the marketing team at a mid-sized software company. The company had open seating where all the teams were mixed, so I was sitting among real programmers and tried to overhear their conversations to find out what that was like.

I heard that you could now buy servers from amazon.com and that Twitter released some CSS that everyone could use. Javascript was getting more popular, but Python was the way to go if you want to have code running on the server.

I offered to build some internal tools for the marketing team. I had never built something that wasn’t my idea before, and certainly never anything that a real business relied on, so I was really nervous.

I decided I needed to level-up and start writing code that was written cleanly and that executed efficiently. No more copy/pasting other random code snippets from the internet, I needed to Know What I Was Doing™.

The code I wrote wasn’t just going to be hidden on my web server — other, real developers might actually see it, so it needed to be in ship-shape.

I decided I had to learn Python, so I read the 20 chapter introduction on python.org and got all sorts of confused about all sorts of stuff. What’s a tuple for? I don’t understand how default parameters work. What the heck is a splat?

I learned about object-oriented programming and how to write files of code that didn’t just execute from top to bottom (“scripting”, I learned that was called).

Eventually I put together a data pipeline that took data from one system, changed its formatting, and inserted it into another system. I had to learn about SOAP and APIs and these reverse-API things called “webhooks”.

The pipeline helped save a coworker hours each week of manual data processing. She was elated that I had automated this work for her, and she was super encouraging about it. “Thank you so much! This is amazing!”

A few months later though, she was singing a different tune. For some unknown reason, my code was no longer working and everything was getting out of sync. “Where is all my data?? It needs to be there! Everything is getting backed up! What’s going on?”

Holy shit. I had no idea what to do or what had happened. My code didn’t change, and the server seems to still be up. “Uhhh….” was all I could muster.

Here I was, thinking I was a programmer and building things that a large company trusted to move important, timely data around. When it broke, I had no idea what to do.

I recruited a few friends on the engineering team to help me figure out what was going on and one of them quickly suggested one of the APIs I was using had changed. “Of course! That made so much sense, why couldn’t I have thought of that…”

I was disappointed in myself and I felt embarrassingly incompetent. Maybe I should stick to marketing. This coding things seems really hard. I don’t know if I’m cut out for it.

A Fragile “Web Developer” Complex

Several months after that incident I was asked to work on a public web app we were building to help generate leads for the business.

It was going to be written in Python but it would also use this thing called a “web framework” that helped you organize your code and did a lot of things for you like handling requests and talking to databases and stuff.

“Django,” it was called. The difficulty I had pronouncing its name was foreshadowing for the massive struggle I would soon encounter when I tried to actually build things with it.

This time, there were a few engineers from the core product team assigned to work with me and a few other interns on the project. All I really remember from those next few weeks was struggling for hours and hours just trying to get the code to run on my Windows laptop, and feeling like a complete failure.

Nothing was working and I was too embarrassed to ask for help, but I also couldn’t make heads or tails of the errors I was getting by myself.

Fortunately, the real engineers I was working with were very patient and kind. “Yah, Windows is tough, its not your fault,” they’d say, reassuringly. “Believe it or not, I used to struggle with this kind of stuff too.” I assumed he was lying to help me save face, but I appreciated the sentiment.

Eventually, I started to get more comfortable working with Django and Python. I hung out in the Computer Science department on campus and started reading books about Python. I remember the first time I successfully used a list comprehension and a lambda function. Fuck yah, this is like real programming!

I decided to rebuild my college music blog aggregator in Python and planned to add a bunch of features. One of which was the ability to learn your preferences based on the articles you clicked on and somehow highlight other articles you might like.

I started reading a bit about machine learning and, to my surprise, it wasn’t totally over my head. I could follow along and build a few of the toy projects in the book that I checked out from the library. But eventually it was finals week and the book was due back and I never implemented anything. Maybe some day.

By this time, I had gotten a bit of a reputation as a “coder” on campus and I was approached by a fellow senior who had an idea for a startup. The idea was born out of the struggle his younger sister was having sharing and discussing prom dress ideas. He had a cool name for it and it seemed like something I could reasonably build. And so, the social network for fashion was born.

The server-side Python stuff was pretty straightforward, the hard part was making the website look and feel like a social network. Facebook had set the bar really high — people expected everything to happen magically with no page reloads. If someone posted a comment with a URL, that comment needed to show up in the discussion thread immediately, the URL needed to be detected and turned into a clickable link, and we needed to try to pull in a preview of the page they’d linked to. All of that just to support URLs in comments!

I wove a tangled web of jQuery with tons of nested callbacks stuffed inside $(document).ready() . Things were getting unwieldy, fast. At one point I decided it would be a good idea to upgrade the version of bootstrap we were using and I noticed a bunch of tiny things broke. I had no comprehensive way to test everything or manage other people’s code that I was depending on. I got very little sleep that week, staying up to odd hours playing whack-a-mole with these bugs.

But despite how embarrassingly ugly the code was, the site looked alright and functioned decently. Smart people that I trusted to be honest told me the site looked “cool”. Man, look at me tricking people into thinking I can build quality software.

I started reading more books and blogs about building software for the web at scale. I started learning about best practices and why they were important. It wasn’t just that your code needed to barely function, it should be easy to read, scale and maintain in the future too.

By the time graduation rolled around in the spring of 2012, the site ended up petering out, as most startups do, but I was really proud of what we built.

Sneaking onto a Software Engineering Team

The job offer I had after graduation was to join the marketing team at the same company where I had screwed up the pipeline and taken forever on the web app.

I was supposed to join a group within marketing known as the “Marketeers” who built stuff to support marketing. But a few weeks before my start date, that group was moved from being within marketing to being a part of the product team, where it would have better engineering support and management.

I was given the option — I could either stay in marketing and join a different group, or I could move over to the product team and be an engineer. I’d have a quick interview to make sure I wouldn’t be over my head, but if I got the green light, I’d officially become a “product engineer”.

The interview was with a senior engineering manager who had a reputation for being curt and ruthless. To say I was nervous was a tremendous understatement. He asked me to show him something I had worked on and I tried pulling up the homepage of the social network I had built. It took 30 excruciating seconds to load.

He glanced at me sideways. “Why is this so slow?” I rambled something about lots of database queries and shuffling data around and maybe one of the servers is down. “How would you debug this?” I had heard of the Django Debug Toolbar which makes it easy to see where your slow queries are in an application, so I mentioned that.

He scrolled around and clicked on a few things. “Alright” he said. I was in.

I guess I had built a few pieces of software and had definitely learned a lot in the past year or so. And now, here I was, on an actual software engineering team.

I was working with other real engineers who followed best practices. I was using a proper IDE and a well-tuned development environment that I setup in a single morning.

I got to hang out with Javascript developers and ops gurus and ask them a million questions. I had an opportunity to see what real-life, production, high-quality code for thousands of paying customers looks like.

Things were going great until suddenly, they weren’t. On a Friday afternoon, I accidentally hard-coded some admin configuration data, shipped them to production, and then packed up and went home. Within a few minutes, part of the product was broken for all customers, and I had taken down the homepage of the company I worked for. A coworker was able to roll back my commit and someone else had to restore the homepage from a backup.

“What am I doing here? I don’t deserve to be here.”

Gaining Confidence

By the middle of 2013, I had started working on a few more independent side projects. I tried to do another startup, building a marketing platform for musical artists and learned the hard way why real-time analytics at scale is such a rare feature in big products (hint: it’s really hard).

I started working with the Flask web framework for Python, instead of Django. I could see differences and similarities in how things are done, and started to recognize high-level design patterns. I also started learning more advanced web topics like caching and database optimizations and service-oriented architectures.

That summer, I got a job as employee #1 at an ad-tech startup. In my interview, I remember being able to answer most of their questions about load balancing algorithms and their tradeoffs, and what HTTP, TCP and UDP were and how they differed. I struggled a bit doing some joins in SQL but I went home and watched some coursera classes and was teaching the topic to some junior developers we hired a few months later.

I told them in my interview that I didn’t know much about lua or nginx or high performance computing or “big data” or any of the other technologies they were using, but I was told not to worry. I’d figure it out.

And eventually, I did.

I started working more independently and designing systems with decreasing oversight. I still managed to ship code on a Friday afternoon that broke everything, but this time I realized right away, fixed it, and wrote up a post-mortem to share what went wrong with the team.

I was able to keep track of “code smells” — small symptoms of some larger issues — and got better at spotting them in my own code. I kept an eye on the things we were doing as a team and looked for patterns of behavior we could improve.

I lead the charge to come up with our own internal git best practices and organized a weekly tech talk to help people share what they’d learned with the team. I no longer felt like an impostor when I told people I was a software developer.

It only took me six years.

Lessons & Advice

I realize this story is already pretty long, so I’ll try to summarize some takeaways for people thinking about starting this journey.

Don’t just read and watch other people code, do it yourself
There are a lot of well-meaning people who will tell you to watch these videos or follow this tutorial and you’ll be able to code in no time.

In my experience, this method of learning “feels” fast cause look at what you just built! But when you have to step away from the lesson and build your own thing, you often haven’t actually learned much about how to solve problems with the technology. Just do whatever feels natural to you and get started.

Start with projects you want to work on
It doesn’t matter if what you’re doing has already been done. Just build a version of something you’d use yourself and iterate on it.

Having a personal interest in the problem will help you come up with creative solutions and will keep you interested when you inevitably get stuck on something.

There will be lows
You’re going to get frustrated and screw things up and be embarrassed and want to quit. That never stops happening, no matter what level you get to (as far as I can tell).

Err on the side of making the mistakes yourself, not getting it “right”
There are a million different blog posts containing conflicting information about the “proper” ways to do everything. If you feel stuck, just keep on pushing forward with your original idea and make it work. You will eventually figure it out and will be so much prouder of your solution because you stuck it out. Plus, you may pick up a battle scar or two which helps guide you next time you’re solving a similar problem.

Learn only as much as you need to know, just when you need to know it
Start with a problem you’re trying to solve — like getting information from RSS feeds or sending email — and learn just enough so that your program does that. Then move on to solving the next problem.

It’s going to be difficult enough to cram a new technology into your brain and make sense of it all, don’t try to pull in superfluous information with it.

As you get more experience, your learning style will change. But don’t try to learn everything right out of the gates. Just what you need to solve the next problem.

Learning is about solving problems, not memorizing syntax
Even though all of those weird symbols and rules for writing things feel like some esoteric mumbo-jumbo, you’ll get past that relatively quickly.

No one gets good at programming by memorizing all of the weird little rules and bits of the language. Instead, focus on using code to solve problems. You can build tons of stuff only knowing arrays, hashes/dicts, if-statements and for-loops. In fact, I got by for years only knowing those 4 things.

It’s an ongoing learning process
There’s always more to know, and the state-of-the-art and “best practices” are constantly evolving over time.

If you’re expecting it to be like riding a bike, where once you get over the hump of figuring it out for the first time, you can stay competent and relevant forever, you might get overwhelmed. But if you love learning — from others and from your own mistakes — it’s a great place to be.

They say that if you aren’t embarrassed by the quality of the code you wrote 6 months ago, you’re not learning fast enough.

You probably know more than you realize
There’s something called impostor syndrome which the astute reader might have noticed I was exhibiting at several points in this article.

It’s a psychological condition that’s fairly common in amongst people who write code, where you don’t really recognize all that you’ve learned and accomplished. You chalk successes up to luck and good timing, whereas failures feel like something you really deserved.

I’m not really sure why it’s so common in this business, but I see it everywhere when people downplay their successes and accomplishments. You might end up doing it too, so be sure to give a hearty fist pump when you finally fix that bug or solve that problem. You earned it.

—

Send me a note on Twitter if you enjoyed the post.

Thanks for reading this far! If I sound like someone you might want to work with, I’m currently accepting new clients. Get in touch

Look Ma, No Servers! How Javascript is Changing the Modern Web Stack

Hartley Brody — Wed, 18 Mar 2015 16:22:10 +0000

Formerly titled “The Rise of the Server-less Web Stack”

Javascript has lots of cool stuff built on top of it now. These days, there are tons of well-worn frameworks that bring all sorts of powerful programming paradigms into the browser.

Want easy object-orientation? Use backbone. More of a functional programmer? There’s underscore, lodash and many others. And I can’t keep up with the latest template rendering libraries, but there are dozens.

Plus ECMAscript 6 is rolling out quickly and with it, some long-awaited language features, syntactic sugar and new APIs.

Additionally, there are a lot of JS SDKs and simple integrations for things like accepting payments (stripe), analytics (mixpanel, customer.io, etc) if you don’t want to write the code or support the infrastructure to do those things yourself.

With all of these features, one can build an entire, bonafide web application in pure javascript. This certainly isn’t a new idea — single-page javascript applications have been around for years.

But what if we take the power of javascript to its logical conclusion — making the entire app live in the user’s browser.

Do we even need to deal with setting up servers and maintaing a separate codebase for a server-side backend at all?

Hosting Benefits

I read a great piece recently on my friend Jonathan’s startup’s “stack” and how they have built their application to live on a simple static hosting service.

I myself just shipped a project that will be entirely hosted on s3, and have collected email addresses from statically-hosted landing pages.

It’s a pretty great selling point to tell your client that their web app can have 99.99% availability and no servers that need to be patched or updated or maintained. Nothing to crash or get hacked or set off someone’s pager at 2 in the morning.

Having an entirely static app means you won’t have to worry about many of the traditional scaling issues and performance bottlenecks that most web apps have to consider if (when!) usage takes off.

Serving static files off disk is trivial to scale compared to CPU- or memory-bound applications. Each client is running their own version of the app on their desktop, laptop or phone, using their own CPU, memory and IO resources, not yours.

Plus it’s fast — your entire application can live on a globally-distributed CDN so it’s physically close to users around the globe, without needing to have servers scattered across various data centers.

Granted, this is all because there are already some pretty great abstractions on top of static file hosting that do the “heavy lifting” to deliver the SLA with that many nines. But those abstractions are so cheap and simple that they’re basically a commodity at this point.

All of these benefits mean that a single developer or small team can get pretty far without needing to have any devops skills at all. Just push your code to a static file host and make sure your domain is pointed to the right place.

Remaining Problems

Okay so building a web app for a static hosting environment has some awesome benefits. But there are definitely some drawbacks to consider — depending on your application’s needs.

Data Persistence
If you need any sort of centralized datastore across multiple clients (ie Users, Accounts, etc) you’re probably going to need a database server and some application logic that lives on top of it.

But you might also be able to use the API of an existing third-party system like Firebase or Parse.

You could leverage the existing API of your company’s CRM or other existing internal datastore. I was able to use a Marketing SaaS app as the “database backend” for my recent client project, by simply sending data from the app directly to their marketing software’s API.

It’s also a good idea to consider whether your app really needs a database at all. Is it really a requirement that users need an account before they can use your product? Don’t collect and store data needlessly.

3^rd-Party Integrations
If your app needs to integrate with a third party for storing data, sending emails, tracking user analytics or other tasks, you might need to keep that integration on the server-side if it requires authentication — you don’t want to send API keys or other credentials to the client. But if the API doesn’t require auth or has a client-side integration option, you may not have to worry.

Another problem you might run into is CORS support — if the vendor’s API doesn’t return the right headers, then the client will refuse to send requests to it. But if an API endpoint expects GET or POST requests, a potential workaround is to just use form submissions in a hidden iFrame to send requests without reloading the page. Hack-y, but doable.

Some APIs are specifically designed to accept form submissions, like MailChimp for managing email lists or the the Google Forms to collect arbitrary data in a spreadsheet.

Private Business Logic
This one admittedly has no really good answer. You could run your javascript through obfuscation (renaming variables and functions to meaningless letters) and minification tools, but the instructions and logic are still there if anyone wants to take the time to piece them together.

Having had to read through obfuscated and minifed javascript in a previous job at an ad-tech company, I can tell you that it’s certainly not easy or pleasant to reverse engineer code that has been run through these steps, but it is possible.

So if your app uses any sensitive, valuable or proprietary business logic, it’s best you let that run on the server-side.

Crashes and Bug Reporting
When a 503 happens on a web server, most web frameworks send you a pretty stack trace in your inbox by default. But when building a javascript app, things fail on the client’s machine, out of sight of your web server’s logging features.

It’s important to consider how things might fail — network being unavailable, unexpected input or result of computation, etc — and figure out if or how the exceptional condition should be reported back to your system, and also how to communicate about the exception to the user.

Heavy “Offline” Processing
If your app processes a ton of data, you probably don’t want that happening synchonrously on a web server during the request-response cycle anyways.

With javascript, the risk is that you’ll hang the user’s browser if you’re doing expensive calculations on the main thread.

On the server, you might do something like kick off some offline job to do the processing and store the output or notify the client when it’s done.

In javascript, there is a pretty simple Web Worker API that’s widely supported. If you need to do heavy rendering or data crunching, it’s worth taking a look at Web Workers.

—

There obviously still some drawbacks to the “server-less web stack” and servers certainly aren’t going away any time soon. But building a server-less application has some really cool benefits.

You should consider if it’d be a good fit for your next project.

Minimum Viable Git Best Practices for Small Teams

Hartley Brody — Wed, 21 Jan 2015 01:01:58 +0000

When I started as the first employee at Burstworks, the cofounders and I could easily hold the information about who was working on what at any given moment in our brains.

But as we worked on new projects and the scope and size of the engineering team grew, all of our code mostly stayed organized in one central repository:

Our high-performance ad server
Data Pipeline
One-off scripts
Nightly jobs
Everything…

While we generally weren’t working on the exact same files at the same time, there was still lots of stepping on toes. Having your git push rejected was a common occurrence.

Inevitably we had issues with merge conflicts, which lead me to send this tweet from our company account:

When multiple people are committing to the same repo and trying to push at the same time #git pic.twitter.com/m7KP19KdP4

— Burstworks (@burstworks) July 2, 2014

And so I decided to take a step back and think about how we managed our version control system at Burstworks.

I definitely didn’t want to come up with something heavy handed or overly-proscriptive. The goal was to come up with just enough process to grease the wheels, and not slow things down.

I did some reading, came up with some initial ideas and pitched them to the team. We iterated a bit and here’s what we came up with.

I should start out by saying that it’s nothing revolutionary or new. It’s what I would consider the Minimum Viable Git Best Practices™ for a small engineering organization.

Diff Everything

You should review every single change before you stage or commit something. Whenever print statements or “# don’t commit this” make it to production, it’s almost always because someone blindly ran

git add .
git commit -m "my git is bad and i should feel bad"

without diffing their changes.

If you’ve already added a bunch of changes to staging and want to confirm them before you commit, you can run

git diff --staged

or alternatively, run

git commit -v

to bring up a vim editor to edit your commit message, while also viewing the diff you’re about to commit.

Pro tip:
If that seems like typing a lot of commands to remember, make some aliases in your shell’s profile.

Commits Should be Small and Frequent

You should be committing code whenever you have made a single logical change. This allows you to write concise but descriptive commit messages, which offers great context for others who might be reading through your code.

Committing small things, frequently make it much easier to handle bugs or other bad situations:

When did we push this bug? Oh I see the commit… but a bunch of other things changed too. What was being worked on here?

I need to roll back that commit but what else might break if I do that?

Here are some code smells or signs that usually mean you’re not committing frequently enough:

A 100+ line file is being committed for the first time
You’re changing more than 20 lines of a file in one commit
You’re only committing when you take breaks (i.e. lunch, end of day, etc)
You have trouble succinctly describing what has changed (see below section)

Pro tip:
Sometimes when you’re in the zone, you end up making a bunch of different logical changes to the code base without stopping to commit each one. That’s okay: git add -p to the rescue!

Running git add in patch mode (sometimes called “partial” mode) lets you stage a few lines out of a file for a commit, leaving the rest of the changes to a file unstaged.

Git will automatically show you chunks of changes from the file, and ask if you want to stage them or skip them. There’s also an option to break down the current chunk into smaller ones for really fine-grained control.

Now you can take the huge refactor you just cranked through and break it down into smaller, logical commits.

Commit Messages Should be Semantic

Every commit message should describe why the code was changed — or what a change accomplished — at an appropriate level of detail.

You shouldn’t just use words to describe what parts of the code have changed — anyone can see that from reading the diff.

If someone wonders why a line of code was created or edited, the commit message should make it clear.

Here are some code smells or signs that you’ve writing a bad commit message:

The message is less than 3 words
The message is more than 10 words
Your message is too high-level (it’s hard to be too low level)
You don’t know what changes are being committed (see above section)

Pro tip:
Did you just make a commit with a bad message like “refactor” or “business logic”?

It happens to the best of us. Just use:

git commit —amend

which gives you a vim editor to change the last commit’s message.

Use Feature Branches

If you’ll be making multiple commits that are related to each other, they belong on a separate branch. One of the nicest things about using branches is that both Github and Bitbucket support Pull Requests which allow for a discussion about a collection of commits, before they’re merged back into master.

Using branches makes the merge into master feel very concrete and important. It gives you a chance to see all of your final changes while ignoring work-in-progress commits. It also means that the master branch is always ready to deploy, with no half-ready changes mixed into the code.

Signs you should be using a branch:

You will be committing “work in progress” changes to save progress that leave your application in a broken state and shouldn’t go to production
There will take multiple logical changes to the codebase that are part of a larger project
There will be several commits in a row that all depend on each other and should be in order

Signs you don’t need a feature branch — that is, it’s okay to commit to master:

You’re making a small change that fits nicely inside one commit
Bug fixes/hot fixes for fixing typos, etc
Previous/future commits won’t affect this commit

When you’re working on a branch, make sure you run git pull origin master frequently, (at least once a day) so that your branch doesn’t get left behind, and to decrease the likelihood of merge conflicts.

Once we switched to using branches, we found that we spent a lot more time reading each others’ code. This helped us learn from each other and gave us all a chance to recognize new or potentially troubling idioms and have a discussion around them, “in the code”.

It also allowed us to catch mistakes sooner and keep problematic code from being merged into master and shipped to production.

More Git Resources

If you want to learn more, here are some of the resources I used when coming up with these tips.

I’m hoping to avoid starting a flame war between different workflow models, but if you have constructive suggestions or more git power tips, feel free to leave a comment or drop me a note on Twitter!

Lightning Fast Data Serialization in Python

Hartley Brody — Tue, 09 Dec 2014 03:55:26 +0000

A few months ago, I got a chance to dive into some Python code that was performing slower than expected.

The code in question was taking tiny bits of data off of a queue, translating some values from strings to primary keys, and then saving the data back to another queue for another worker to process.

The translation step should have been fast. We were loading the data into memory from a MySQL database during initialization, and had organized the data structure so that the id -> string lookups were constant time.

Finding the Problem

In order to figure out where the bottleneck(s) were, I used Python’s builtin CProfile package, and combed through the results using the awesome CProfileV package, written by a former Quora intern.

After letting the script run for awhile, the bottleneck jumped out right away — the workers were spending about 40% of their time serializing and deserializing data.

In order to keep messages on the queue for other workers to pick up, we were translating the Python dicts into JSON objects using the standard library’s json package.

Our worker was reading the text data from the queue, deserializing it into a Python dict, changing a few values and then serializing it back into text data to save onto a new queue.

The translation steps were taking up about 40% of the total runtime.

So I set out to see if there was a faster way to serialize a Python dict.

Our Concerns

When you’re optimizing code, it’s helpful to think about what sort of gains you’re looking for.

Gaining a few percentage points faster isn’t usually too challenging
Gaining several times faster (ie 200-800%) requires more strategic thinking
Gaining orders of magnitude in speed often requires rearchitecture or starting over

Since we were already using Python’s builtin json module, we knew it’d be hard to eek out an order of magnitude improvement. But a few percentage points wasn’t going to cut it. It had to be a meaningful speedup in order to take a big chunk out of that 40% of time spent doing serialization/deserialization.

Each message of data was small — 5 keys with small values tipped the scales at a few dozen bytes each — so we weren’t worried about saturating the network card. Bandwidth and latency also weren’t a huge factor since the queue and all the workers were in the same availability zone on EC2.

I should note that all of the workers that’d be touching this data were in-house, so interoperability with common data serialization standards wasn’t a huge concern.

If the fastest way to encode data was to string it together with pipes | and backslashes , that was fine. We could update all of the workers to accommodate it.

I tried searching for as many Python data serialization libraries as I could find — as well as coming up with my own serialization schemes.

The Code

I learn a bunch about the performance of different string building functions while building my own home_brew‘d serialization process. If you have any more ideas, let me know and I’ll be sure to add them!

Note that some packages require a file handle in order to write the serialized data, while others just dumped it to a string in-memory.

The overhead of opening and closing the file was undetectable on the order of time I was examining, but I commented the file-handling code out for the packages that didn’t need it, to simulate the actual cost of using each package in production.

Results & Observations

To get the times for each of the different serialization functions, I ran the script with the unix time command and summed the user and system time.

I ran the script 10 times for each package and made a mental average. That’s the number you see listed in the comments next to each function.

If speed is your primary concern, I’d recommend checking out “Ultra JSON” aka ujson from these fine folks.

We switched to using ujson and saw a roughly 1/3rd overall increase in our pipeline processing speed, which was inline with our expectations from the test results.

Any packages I missed? Different ideas for home brewed serialization? Shoot me a note on Twitter.